Re: Standalone license tools for scanning debian/ubuntu apps?

Philippe Ombredanne

Hi Dan:
You are asking a simple question for which is there is no simple
answer: this is not yet a solved problem and there is no easy button
to press.
Hence the long answer.

On Mon, Feb 4, 2019 at 8:20 PM Dan Kegel <dank@...> wrote:

Hi all!

Coming up with a list of licenses a binary is bound by is
a mind-boggling task that I avoid whenever possible.
I've been watching spdx and friends from afar for some time
in hopes they will help.

Recently I was asked to write a stateless, standalone tool that takes
a path to a
dynamically linked linux binary, and outputs an approximate list of licenses
the shared libraries it uses are bound by. Here's my current draft:

Roughly, it uses ldd and dpkg-query to locate copyright files
for all shared libraries it references, and then either
just outputs the License: values for DEP-5 copyright files,
or uses scancode to detect them for non-DEP-5 copyright files.

Now I'm plugging along, adding optional heuristics like
"XXX of dependencies can be filtered out (because I'm only interested
in the bits pulled in via dynamic linking)"
where XXX is "files: debian/*" and "files: doc/*"

Am I duplicating work? I looked at fossology, but its complexity kind
of disqualifies it
(nothing about it seems standalone or stateless).
Since you are trying to figure out the license of a shared object (aka
library, or DLL) you need to know the license of the files that are
compiled/linked in it.
And quite rightly, using Debian copyright files will help you find out
the license of the source files. So will a ScanCode scan (and it will
also look in the binaries and report anything that can be found

But that would not help you find out which set of source files are
baked in that shared object and what is the effective license of the
shared object in many cases (short of applying some extra heuristics
or work on top).

The most common problem is when a package provides a library and a
command line too/utilities and each use a different license (typically
the library is LGPL and the command line utilities are GPL). There are
also other files such as build scripts, test files, documentation,
tools, etc that may use several other licenses. These are not linked
in the shared library and therefore would need to be parsed out to
properly conclude what is the effective shared object license.

As an example libcap-ng (one of the package you listed) is both small
and typical of many Linux libraries licensing and code organization
and the issues that come up when you are trying to find which license
applies to what.

- It is overall under LGPL-2.1 or later and in particular its library
is using this license [1] but its RPM spec file (libcap-ng.spec) is
not up to date and refers to an LGPL-2.0+ instead of a 2.1 version.
- Command line utilities are under GPL 2.0 [2]
- Some build scripts are MIT-licensed (, or use other
similar licenses (INSTALL is FSFAP) or are GPL- or LGPL-licensed
- The root directory contains a copy of the GPL 2 (COPYING) and LGPL
2.1 (COPYING.LIB) and another copy of the LGPL (LICENSE). But there
are no indication of which one applies to what except for the not
entirely correct spec file mentioned above.
- The corresponding Debian copyright file [3] is not structured to be
machine readable yet . Yet it provides a bit more information: the top
level license is properly reported as an LGPL-2.1 or later. And there
is a mention of the GPL-2.0-licensed build scripts and command line
utilities. But it also introduces a new GPL-3.0 license for the Debian
packaging removing some clarity to the licensing documentation.

The overall licensing is pretty clear when you are used to this after
a quick review (and the help of a ScanCode scan of course ;)): the
shared library license is LGPL-2.1-or-later and nothing else but
things are mighty difficult to automate to come to the same correct

If you could know which exact files are included in the shared
object/library, you could get back to these source files to get the
licensing information.
For this there are a few ways to go:
1. trace which files are compiled and linked the DLL
2. obtain that information from a DB or from a tool (without tracing)

1. For the tracing part
1.1- in the world of ELFs, you could use a debug build and parse out
debug symbols to get back the list of actual source files that are
based in that executable. There is contributed code in scancode [4] to
extract DWARF debug symbols and get the corresponding source code file
paths but that has not been fully integrated yet in the main tool.

1.2- you could trace the build such that you know exactly which files
are used and included in the .so. I maintain TraceCode for this [5]
and quartermaster [6] is also doing similar things (with different
approaches). TraceCode works from an strace system calls trace of an
unmodified/uninstrumented build to recreate/reverse engineer a build
graph as it happens in user space. This is not a magically automated
solution though and results require review and interpretation. But it
works not too badly on single libraries.

2. To obtain the information without tracing:
Some tool or database could have told which files are in the library
and which files are in the CLI utilities and which are build scripts
in a structured way. This is what are called facets [7] in scancode
which is a concept borrowed from ClearlyDefined [8]. But being able to
define facets and having facets defined is not the same. Scancode can
report facets for each file if you tell it. But it does not have yet
the ability to infer facets from the code such as for instance
assigning "" to a development/build scripts facets. Working
together on this would go a long way. There are a few Scancode pending
tickets on this [9] [10].
As for ClearlyDefined, our goal (I contribute to the project) is to
have facets possibly contributed as part of the curation and review
process. And any enhancement to Scancode to infer facets would also
benefit ClearlyDefined and would likely be of a great help to Debian
to improve copyright files that are not yet machine readable too.

Philippe Ombredanne

+1 650 799 0949 | pombredanne@...
ScanCode maintainer

Join to automatically receive all group messages.