Date
1 - 8 of 8
Standalone license tools for scanning debian/ubuntu apps?
Dan Kegel
Hi all!
Coming up with a list of licenses a binary is bound by is a mind-boggling task that I avoid whenever possible. I've been watching spdx and friends from afar for some time in hopes they will help. Recently I was asked to write a stateless, standalone tool that takes a path to a dynamically linked linux binary, and outputs an approximate list of licenses the shared libraries it uses are bound by. Here's my current draft: https://github.com/Oblong/obs/blob/master/ob-list-licenses Roughly, it uses ldd and dpkg-query to locate copyright files for all shared libraries it references, and then either just outputs the License: values for DEP-5 copyright files, or uses scancode to detect them for non-DEP-5 copyright files. Now I'm plugging along, adding optional heuristics like "XXX of dependencies can be filtered out (because I'm only interested in the bits pulled in via dynamic linking)" where XXX is "files: debian/*" and "files: doc/*" Am I duplicating work? I looked at fossology, but its complexity kind of disqualifies it (nothing about it seems standalone or stateless). Thanks, Dan |
|
Have you looked at the binary analysis tool?
Regards,
Jeremiah
This e-mail and any attachment(s) are intended only for the recipient(s) named above and others who have been specifically authorized to receive them. They may contain confidential information. If you are not the intended recipient, please do not read this email or its attachment(s). Furthermore, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and then delete this e-mail and any attachment(s) or copies thereof from your system. Thank you. |
|
Kate Stewart
On Mon, Feb 4, 2019 at 8:47 PM Jeremiah C. Foster <jfoster@...> wrote:
There's also BANG! (Binary Analysis Next Generation) that is in beta now. Kate |
|
Dan Kegel
I did look a bit at those, but they seemed more about unpacking
binaries than about wrangling copyrights. |
|
If I'm not mistaken, copyright has to be a string because it has to be legible by humans. This means you can likely grep through source code as scancode does with a fair degree of confidence and use 'strings' on binaries.
Using DEP-5 and Debian Copyright files where you can should also be sufficient for due diligence in most jurisdictions, but I can't point to any legal precedent as evidence. SPDX helps by creating a framework for human and machine readable documentation of your work, but you'll still need to scan code for copyright. Binaries likely require a bit of reverse engineering. From: Dan Kegel <dank@...>
Sent: Monday, February 4, 2019 23:49 To: spdx@... Subject: Re: [spdx] Standalone license tools for scanning debian/ubuntu apps? I did look a bit at those, but they seemed more about unpacking
binaries than about wrangling copyrights. This e-mail and any attachment(s) are intended only for the recipient(s) named above and others who have been specifically authorized to receive them. They may contain confidential information. If you are not the intended recipient, please do not read this email or its attachment(s). Furthermore, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and then delete this e-mail and any attachment(s) or copies thereof from your system. Thank you. |
|
Dan Kegel
On Tue, Feb 5, 2019 at 1:30 PM Jeremiah C. Foster <jfoster@...> wrote:
If I'm not mistaken, copyright has to be a string because it has to be legible by humans. This means you can likely grep through source code as scancode does with a fair degree of confidence and use 'strings' on binaries.Yes, absolutely. SPDX's set of standard licenses and ids (and scancode's somewhat expanded similar set) are great for stating license info succinctly. scancode is great at collecting the info that should go into the debian copyright file. My goal for this iteration at our licensing process was to automate collection of license info for the shared libraries our binary uses. Here's the pipeline I set up to do that: 1) https://github.com/Oblong/obs/blob/master/ob-filter-licenses reads a DEP-5 (aka Debian copyright) file and filters out any clauses that (most likely) do not propagate to shared library artifacts 2) https://github.com/Oblong/obs/blob/master/ob-parse-licenses reads a Debian copyright file, filters it through ob-filter-licenses, and outputs spdx ids. (For non-DEP-5 copyright files, it uses scancode to guess licenses.) 3) https://github.com/Oblong/obs/blob/master/ob-list-licenses uses ldd to look up shared libraries used by a binary, uses dpkg-query to look up the containing packages, and runs ob-parse-licenses on them. For instance, running "ob-list-licences /bin/login" outputs: libaudit1 https://people.redhat.com/sgrubb/audit/ GPL-2 LGPL-2.1 libc6 https://www.gnu.org/software/libc/libc.html libc6-special libcap-ng0 http://people.redhat.com/sgrubb/libcap-ng GPL-2.0-or-later LGPL-2.1-only GPL-1.0-or-later libpam0g http://www.linux-pam.org/ BSD-3-Clause GPL-1.0-or-later GPL-2.0-only This of course only solves a small part of the license / copyright problem, and only approximately, but it found interesting things for us. - Dan |
|
Philippe Ombredanne
Hi Dan:
You are asking a simple question for which is there is no simple answer: this is not yet a solved problem and there is no easy button to press. Hence the long answer. On Mon, Feb 4, 2019 at 8:20 PM Dan Kegel <dank@...> wrote: Since you are trying to figure out the license of a shared object (aka library, or DLL) you need to know the license of the files that are compiled/linked in it. And quite rightly, using Debian copyright files will help you find out the license of the source files. So will a ScanCode scan (and it will also look in the binaries and report anything that can be found there). But that would not help you find out which set of source files are baked in that shared object and what is the effective license of the shared object in many cases (short of applying some extra heuristics or work on top). The most common problem is when a package provides a library and a command line too/utilities and each use a different license (typically the library is LGPL and the command line utilities are GPL). There are also other files such as build scripts, test files, documentation, tools, etc that may use several other licenses. These are not linked in the shared library and therefore would need to be parsed out to properly conclude what is the effective shared object license. As an example libcap-ng (one of the package you listed) is both small and typical of many Linux libraries licensing and code organization and the issues that come up when you are trying to find which license applies to what. - It is overall under LGPL-2.1 or later and in particular its library is using this license [1] but its RPM spec file (libcap-ng.spec) is not up to date and refers to an LGPL-2.0+ instead of a 2.1 version. - Command line utilities are under GPL 2.0 [2] - Some build scripts are MIT-licensed (configure.ac), or use other similar licenses (INSTALL is FSFAP) or are GPL- or LGPL-licensed (Makefile.am) - The root directory contains a copy of the GPL 2 (COPYING) and LGPL 2.1 (COPYING.LIB) and another copy of the LGPL (LICENSE). But there are no indication of which one applies to what except for the not entirely correct spec file mentioned above. - The corresponding Debian copyright file [3] is not structured to be machine readable yet . Yet it provides a bit more information: the top level license is properly reported as an LGPL-2.1 or later. And there is a mention of the GPL-2.0-licensed build scripts and command line utilities. But it also introduces a new GPL-3.0 license for the Debian packaging removing some clarity to the licensing documentation. The overall licensing is pretty clear when you are used to this after a quick review (and the help of a ScanCode scan of course ;)): the shared library license is LGPL-2.1-or-later and nothing else but things are mighty difficult to automate to come to the same correct conclusion. If you could know which exact files are included in the shared object/library, you could get back to these source files to get the licensing information. For this there are a few ways to go: 1. trace which files are compiled and linked the DLL 2. obtain that information from a DB or from a tool (without tracing) 1. For the tracing part 1.1- in the world of ELFs, you could use a debug build and parse out debug symbols to get back the list of actual source files that are based in that executable. There is contributed code in scancode [4] to extract DWARF debug symbols and get the corresponding source code file paths but that has not been fully integrated yet in the main tool. 1.2- you could trace the build such that you know exactly which files are used and included in the .so. I maintain TraceCode for this [5] and quartermaster [6] is also doing similar things (with different approaches). TraceCode works from an strace system calls trace of an unmodified/uninstrumented build to recreate/reverse engineer a build graph as it happens in user space. This is not a magically automated solution though and results require review and interpretation. But it works not too badly on single libraries. 2. To obtain the information without tracing: Some tool or database could have told which files are in the library and which files are in the CLI utilities and which are build scripts in a structured way. This is what are called facets [7] in scancode which is a concept borrowed from ClearlyDefined [8]. But being able to define facets and having facets defined is not the same. Scancode can report facets for each file if you tell it. But it does not have yet the ability to infer facets from the code such as for instance assigning "Makefile.am" to a development/build scripts facets. Working together on this would go a long way. There are a few Scancode pending tickets on this [9] [10]. As for ClearlyDefined, our goal (I contribute to the project) is to have facets possibly contributed as part of the curation and review process. And any enhancement to Scancode to infer facets would also benefit ClearlyDefined and would likely be of a great help to Debian to improve copyright files that are not yet machine readable too. [1] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L7 [2] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L57 [3] https://metadata.ftp-master.debian.org/changelogs/main/libc/libcap-ng/libcap-ng_0.7.7-3_copyright [4] https://github.com/nexB/scancode-toolkit-contrib/tree/develop/src/compiledcode [5] https://github.com/nexB/tracecode-toolkit [6] https://github.com/QMSTR/qmstr [7] https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/facet.py#L58 [8] https://clearlydefined.io/ [9] https://github.com/nexB/scancode-toolkit/issues/1036 [10] https://github.com/nexB/scancode-toolkit/issues/377 -- Cordially Philippe Ombredanne +1 650 799 0949 | pombredanne@... ScanCode maintainer |
|
Kate Stewart
On Tue, Feb 5, 2019 at 5:32 PM Dan Kegel <dank@...> wrote: On Tue, Feb 5, 2019 at 1:30 PM Jeremiah C. Foster <jfoster@...> wrote: Hi Dan, Am not sure what you're using for a build infrastructure, but there are some solutions emerging in Yocto that may be relevant, as well as the other projects that Philippe outlines. I checked with Richard and he confirms that
So if you're using Yocto for your builds, and want to help get with the development of this capability available faster, rather than create a stand-alone tool feel free to reach out to Richard (on cc). Thanks, Kate |
|