Re: Standalone license tools for scanning debian/ubuntu apps?
Philippe Ombredanne
Hi Dan:
You are asking a simple question for which is there is no simple answer: this is not yet a solved problem and there is no easy button to press. Hence the long answer. On Mon, Feb 4, 2019 at 8:20 PM Dan Kegel <dank@...> wrote: Since you are trying to figure out the license of a shared object (aka library, or DLL) you need to know the license of the files that are compiled/linked in it. And quite rightly, using Debian copyright files will help you find out the license of the source files. So will a ScanCode scan (and it will also look in the binaries and report anything that can be found there). But that would not help you find out which set of source files are baked in that shared object and what is the effective license of the shared object in many cases (short of applying some extra heuristics or work on top). The most common problem is when a package provides a library and a command line too/utilities and each use a different license (typically the library is LGPL and the command line utilities are GPL). There are also other files such as build scripts, test files, documentation, tools, etc that may use several other licenses. These are not linked in the shared library and therefore would need to be parsed out to properly conclude what is the effective shared object license. As an example libcap-ng (one of the package you listed) is both small and typical of many Linux libraries licensing and code organization and the issues that come up when you are trying to find which license applies to what. - It is overall under LGPL-2.1 or later and in particular its library is using this license [1] but its RPM spec file (libcap-ng.spec) is not up to date and refers to an LGPL-2.0+ instead of a 2.1 version. - Command line utilities are under GPL 2.0 [2] - Some build scripts are MIT-licensed (configure.ac), or use other similar licenses (INSTALL is FSFAP) or are GPL- or LGPL-licensed (Makefile.am) - The root directory contains a copy of the GPL 2 (COPYING) and LGPL 2.1 (COPYING.LIB) and another copy of the LGPL (LICENSE). But there are no indication of which one applies to what except for the not entirely correct spec file mentioned above. - The corresponding Debian copyright file [3] is not structured to be machine readable yet . Yet it provides a bit more information: the top level license is properly reported as an LGPL-2.1 or later. And there is a mention of the GPL-2.0-licensed build scripts and command line utilities. But it also introduces a new GPL-3.0 license for the Debian packaging removing some clarity to the licensing documentation. The overall licensing is pretty clear when you are used to this after a quick review (and the help of a ScanCode scan of course ;)): the shared library license is LGPL-2.1-or-later and nothing else but things are mighty difficult to automate to come to the same correct conclusion. If you could know which exact files are included in the shared object/library, you could get back to these source files to get the licensing information. For this there are a few ways to go: 1. trace which files are compiled and linked the DLL 2. obtain that information from a DB or from a tool (without tracing) 1. For the tracing part 1.1- in the world of ELFs, you could use a debug build and parse out debug symbols to get back the list of actual source files that are based in that executable. There is contributed code in scancode [4] to extract DWARF debug symbols and get the corresponding source code file paths but that has not been fully integrated yet in the main tool. 1.2- you could trace the build such that you know exactly which files are used and included in the .so. I maintain TraceCode for this [5] and quartermaster [6] is also doing similar things (with different approaches). TraceCode works from an strace system calls trace of an unmodified/uninstrumented build to recreate/reverse engineer a build graph as it happens in user space. This is not a magically automated solution though and results require review and interpretation. But it works not too badly on single libraries. 2. To obtain the information without tracing: Some tool or database could have told which files are in the library and which files are in the CLI utilities and which are build scripts in a structured way. This is what are called facets [7] in scancode which is a concept borrowed from ClearlyDefined [8]. But being able to define facets and having facets defined is not the same. Scancode can report facets for each file if you tell it. But it does not have yet the ability to infer facets from the code such as for instance assigning "Makefile.am" to a development/build scripts facets. Working together on this would go a long way. There are a few Scancode pending tickets on this [9] [10]. As for ClearlyDefined, our goal (I contribute to the project) is to have facets possibly contributed as part of the curation and review process. And any enhancement to Scancode to infer facets would also benefit ClearlyDefined and would likely be of a great help to Debian to improve copyright files that are not yet machine readable too. [1] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L7 [2] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L57 [3] https://metadata.ftp-master.debian.org/changelogs/main/libc/libcap-ng/libcap-ng_0.7.7-3_copyright [4] https://github.com/nexB/scancode-toolkit-contrib/tree/develop/src/compiledcode [5] https://github.com/nexB/tracecode-toolkit [6] https://github.com/QMSTR/qmstr [7] https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/facet.py#L58 [8] https://clearlydefined.io/ [9] https://github.com/nexB/scancode-toolkit/issues/1036 [10] https://github.com/nexB/scancode-toolkit/issues/377 -- Cordially Philippe Ombredanne +1 650 799 0949 | pombredanne@... ScanCode maintainer
|
|