Standalone license tools for scanning debian/ubuntu apps?


Dan Kegel
 

Hi all!

Coming up with a list of licenses a binary is bound by is
a mind-boggling task that I avoid whenever possible.
I've been watching spdx and friends from afar for some time
in hopes they will help.

Recently I was asked to write a stateless, standalone tool that takes
a path to a
dynamically linked linux binary, and outputs an approximate list of licenses
the shared libraries it uses are bound by. Here's my current draft:
https://github.com/Oblong/obs/blob/master/ob-list-licenses

Roughly, it uses ldd and dpkg-query to locate copyright files
for all shared libraries it references, and then either
just outputs the License: values for DEP-5 copyright files,
or uses scancode to detect them for non-DEP-5 copyright files.

Now I'm plugging along, adding optional heuristics like
"XXX of dependencies can be filtered out (because I'm only interested
in the bits pulled in via dynamic linking)"
where XXX is "files: debian/*" and "files: doc/*"

Am I duplicating work? I looked at fossology, but its complexity kind
of disqualifies it
(nothing about it seems standalone or stateless).

Thanks,
Dan


Jeremiah C. Foster
 

Have you looked at the binary analysis tool?

Regards,

Jeremiah 

On Feb 4, 2019, at 14:20, Dan Kegel <dank@...> wrote:

Hi all!

Coming up with a list of licenses a binary is bound by is
a mind-boggling task that I avoid whenever possible.
I've been watching spdx and friends from afar for some time
in hopes they will help.

Recently I was asked to write a stateless, standalone tool that takes
a path to a
dynamically linked linux binary, and outputs an approximate list of licenses
the shared libraries it uses are bound by. Here's my current draft:
https://github.com/Oblong/obs/blob/master/ob-list-licenses

Roughly, it uses ldd and dpkg-query to locate copyright files
for all shared libraries it references, and then either
just outputs the License: values for DEP-5 copyright files,
or uses scancode to detect them for non-DEP-5 copyright files.

Now I'm plugging along, adding optional heuristics like
"XXX of dependencies can be filtered out (because I'm only interested
in the bits pulled in via dynamic linking)"
where XXX is "files: debian/*" and "files: doc/*"

Am I duplicating work?  I looked at fossology, but its complexity kind
of disqualifies it
(nothing about it seems standalone or stateless).

Thanks,
Dan






This e-mail and any attachment(s) are intended only for the recipient(s) named above and others who have been specifically authorized to receive them. They may contain confidential information. If you are not the intended recipient, please do not read this email or its attachment(s). Furthermore, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and then delete this e-mail and any attachment(s) or copies thereof from your system. Thank you.


Kate Stewart
 


On Mon, Feb 4, 2019 at 8:47 PM Jeremiah C. Foster <jfoster@...> wrote:
Have you looked at the binary analysis tool?

There's also BANG! (Binary Analysis Next Generation) that is in beta now.

Kate


Dan Kegel
 

I did look a bit at those, but they seemed more about unpacking
binaries than about wrangling copyrights.


Jeremiah C. Foster
 

If I'm not mistaken, copyright has to be a string because it has to be legible by humans. This means you can likely grep through source code as scancode does with a fair degree of confidence and use 'strings' on binaries.

Using DEP-5 and Debian Copyright files where you can should also be sufficient for due diligence in most jurisdictions, but I can't point to any legal precedent as evidence.

SPDX helps by creating a framework for human and machine readable documentation of your work, but you'll still need to scan code for copyright.

Binaries likely require a bit of reverse engineering.


From: Dan Kegel <dank@...>
Sent: Monday, February 4, 2019 23:49
To: spdx@...
Subject: Re: [spdx] Standalone license tools for scanning debian/ubuntu apps?

I did look a bit at those, but they seemed more about unpacking
binaries than about wrangling copyrights.






This e-mail and any attachment(s) are intended only for the recipient(s) named above and others who have been specifically authorized to receive them. They may contain confidential information. If you are not the intended recipient, please do not read this email or its attachment(s). Furthermore, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by replying to this e-mail and then delete this e-mail and any attachment(s) or copies thereof from your system. Thank you.


Dan Kegel
 

On Tue, Feb 5, 2019 at 1:30 PM Jeremiah C. Foster <jfoster@...> wrote:
If I'm not mistaken, copyright has to be a string because it has to be legible by humans. This means you can likely grep through source code as scancode does with a fair degree of confidence and use 'strings' on binaries.

Using DEP-5 and Debian Copyright files where you can should also be sufficient for due diligence in most jurisdictions, but I can't point to any legal precedent as evidence.

SPDX helps by creating a framework for human and machine readable documentation of your work, but you'll still need to scan code for copyright.

Binaries likely require a bit of reverse engineering.
Yes, absolutely.

SPDX's set of standard licenses and ids (and scancode's somewhat
expanded similar set) are great for stating license info succinctly.

scancode is great at collecting the info that should go into the
debian copyright file.

My goal for this iteration at our licensing process was to automate
collection of license info for the shared libraries our binary uses.

Here's the pipeline I set up to do that:

1) https://github.com/Oblong/obs/blob/master/ob-filter-licenses reads
a DEP-5 (aka Debian copyright) file and filters out any clauses that
(most likely) do not propagate to shared library artifacts
2) https://github.com/Oblong/obs/blob/master/ob-parse-licenses reads a
Debian copyright file, filters it through ob-filter-licenses, and
outputs spdx ids. (For non-DEP-5 copyright files, it uses scancode to
guess licenses.)
3) https://github.com/Oblong/obs/blob/master/ob-list-licenses uses ldd
to look up shared libraries used by a binary, uses dpkg-query to look
up the containing packages, and runs ob-parse-licenses on them.

For instance, running "ob-list-licences /bin/login" outputs:

libaudit1 https://people.redhat.com/sgrubb/audit/
GPL-2
LGPL-2.1

libc6 https://www.gnu.org/software/libc/libc.html
libc6-special

libcap-ng0 http://people.redhat.com/sgrubb/libcap-ng
GPL-2.0-or-later
LGPL-2.1-only
GPL-1.0-or-later

libpam0g http://www.linux-pam.org/
BSD-3-Clause
GPL-1.0-or-later
GPL-2.0-only

This of course only solves a small part of the license / copyright
problem, and only approximately, but it found interesting things for
us.
- Dan


Philippe Ombredanne
 

Hi Dan:
You are asking a simple question for which is there is no simple
answer: this is not yet a solved problem and there is no easy button
to press.
Hence the long answer.

On Mon, Feb 4, 2019 at 8:20 PM Dan Kegel <dank@...> wrote:

Hi all!

Coming up with a list of licenses a binary is bound by is
a mind-boggling task that I avoid whenever possible.
I've been watching spdx and friends from afar for some time
in hopes they will help.

Recently I was asked to write a stateless, standalone tool that takes
a path to a
dynamically linked linux binary, and outputs an approximate list of licenses
the shared libraries it uses are bound by. Here's my current draft:
https://github.com/Oblong/obs/blob/master/ob-list-licenses

Roughly, it uses ldd and dpkg-query to locate copyright files
for all shared libraries it references, and then either
just outputs the License: values for DEP-5 copyright files,
or uses scancode to detect them for non-DEP-5 copyright files.

Now I'm plugging along, adding optional heuristics like
"XXX of dependencies can be filtered out (because I'm only interested
in the bits pulled in via dynamic linking)"
where XXX is "files: debian/*" and "files: doc/*"

Am I duplicating work? I looked at fossology, but its complexity kind
of disqualifies it
(nothing about it seems standalone or stateless).
Since you are trying to figure out the license of a shared object (aka
library, or DLL) you need to know the license of the files that are
compiled/linked in it.
And quite rightly, using Debian copyright files will help you find out
the license of the source files. So will a ScanCode scan (and it will
also look in the binaries and report anything that can be found
there).

But that would not help you find out which set of source files are
baked in that shared object and what is the effective license of the
shared object in many cases (short of applying some extra heuristics
or work on top).

The most common problem is when a package provides a library and a
command line too/utilities and each use a different license (typically
the library is LGPL and the command line utilities are GPL). There are
also other files such as build scripts, test files, documentation,
tools, etc that may use several other licenses. These are not linked
in the shared library and therefore would need to be parsed out to
properly conclude what is the effective shared object license.

As an example libcap-ng (one of the package you listed) is both small
and typical of many Linux libraries licensing and code organization
and the issues that come up when you are trying to find which license
applies to what.

- It is overall under LGPL-2.1 or later and in particular its library
is using this license [1] but its RPM spec file (libcap-ng.spec) is
not up to date and refers to an LGPL-2.0+ instead of a 2.1 version.
- Command line utilities are under GPL 2.0 [2]
- Some build scripts are MIT-licensed (configure.ac), or use other
similar licenses (INSTALL is FSFAP) or are GPL- or LGPL-licensed
(Makefile.am)
- The root directory contains a copy of the GPL 2 (COPYING) and LGPL
2.1 (COPYING.LIB) and another copy of the LGPL (LICENSE). But there
are no indication of which one applies to what except for the not
entirely correct spec file mentioned above.
- The corresponding Debian copyright file [3] is not structured to be
machine readable yet . Yet it provides a bit more information: the top
level license is properly reported as an LGPL-2.1 or later. And there
is a mention of the GPL-2.0-licensed build scripts and command line
utilities. But it also introduces a new GPL-3.0 license for the Debian
packaging removing some clarity to the licensing documentation.

The overall licensing is pretty clear when you are used to this after
a quick review (and the help of a ScanCode scan of course ;)): the
shared library license is LGPL-2.1-or-later and nothing else but
things are mighty difficult to automate to come to the same correct
conclusion.

If you could know which exact files are included in the shared
object/library, you could get back to these source files to get the
licensing information.
For this there are a few ways to go:
1. trace which files are compiled and linked the DLL
2. obtain that information from a DB or from a tool (without tracing)

1. For the tracing part
1.1- in the world of ELFs, you could use a debug build and parse out
debug symbols to get back the list of actual source files that are
based in that executable. There is contributed code in scancode [4] to
extract DWARF debug symbols and get the corresponding source code file
paths but that has not been fully integrated yet in the main tool.

1.2- you could trace the build such that you know exactly which files
are used and included in the .so. I maintain TraceCode for this [5]
and quartermaster [6] is also doing similar things (with different
approaches). TraceCode works from an strace system calls trace of an
unmodified/uninstrumented build to recreate/reverse engineer a build
graph as it happens in user space. This is not a magically automated
solution though and results require review and interpretation. But it
works not too badly on single libraries.

2. To obtain the information without tracing:
Some tool or database could have told which files are in the library
and which files are in the CLI utilities and which are build scripts
in a structured way. This is what are called facets [7] in scancode
which is a concept borrowed from ClearlyDefined [8]. But being able to
define facets and having facets defined is not the same. Scancode can
report facets for each file if you tell it. But it does not have yet
the ability to infer facets from the code such as for instance
assigning "Makefile.am" to a development/build scripts facets. Working
together on this would go a long way. There are a few Scancode pending
tickets on this [9] [10].
As for ClearlyDefined, our goal (I contribute to the project) is to
have facets possibly contributed as part of the curation and review
process. And any enhancement to Scancode to infer facets would also
benefit ClearlyDefined and would likely be of a great help to Debian
to improve copyright files that are not yet machine readable too.

[1] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L7
[2] https://github.com/stevegrubb/libcap-ng/blob/master/libcap-ng.spec#L57
[3] https://metadata.ftp-master.debian.org/changelogs/main/libc/libcap-ng/libcap-ng_0.7.7-3_copyright
[4] https://github.com/nexB/scancode-toolkit-contrib/tree/develop/src/compiledcode
[5] https://github.com/nexB/tracecode-toolkit
[6] https://github.com/QMSTR/qmstr
[7] https://github.com/nexB/scancode-toolkit/blob/develop/src/summarycode/facet.py#L58
[8] https://clearlydefined.io/
[9] https://github.com/nexB/scancode-toolkit/issues/1036
[10] https://github.com/nexB/scancode-toolkit/issues/377
--
Cordially
Philippe Ombredanne

+1 650 799 0949 | pombredanne@...
ScanCode maintainer


Kate Stewart
 



On Tue, Feb 5, 2019 at 5:32 PM Dan Kegel <dank@...> wrote:
On Tue, Feb 5, 2019 at 1:30 PM Jeremiah C. Foster <jfoster@...> wrote:
> If I'm not mistaken, copyright has to be a string because it has to be legible by humans. This means you can likely grep through source code as scancode does with a fair degree of confidence and use 'strings' on binaries.
>
> Using DEP-5 and Debian Copyright files where you can should also be sufficient for due diligence in most jurisdictions, but I can't point to any legal precedent as evidence.
>
> SPDX helps by creating a framework for human and machine readable documentation of your work, but you'll still need to scan code for copyright.
>
> Binaries likely require a bit of reverse engineering.

Yes, absolutely.

SPDX's set of standard licenses and ids (and scancode's somewhat
expanded similar set) are great for stating license info succinctly.

scancode is great at collecting the info that should go into the
debian copyright file.

My goal for this iteration at our licensing process was to automate
collection of license info for the shared libraries our binary uses.

Hi Dan,
    Am not sure what you're using for a build infrastructure, but there
are some solutions emerging in Yocto that may be relevant, as well
as the other projects that Philippe outlines.    

 I checked with Richard and he confirms that
" The Yocto Project already builds everything with debug symbols which
get linked and separated into separate packages. It already uses
dwarfsrcfiles to generate a list of source code files which went into
creating a given binary.

The Project also has license information for each software recipe it
builds.

There are some work in progress patches, not quite ready to merge yet
but working which combine these two pieces of information, along with
scanning the source files for SPDX headers to give information about
the possible license a binary may be under."

So if you're using Yocto for your builds, and want to help get with the development
of this capability available faster,  rather than create a stand-alone tool feel free to 
reach out to Richard (on cc). 

Thanks, Kate