Re: [spdx-tech] An example of a super simple SPDX licenses registry, for discussion

Philippe Ombredanne

Richard, Jeff:

On Mon, Mar 11, 2019 at 10:32 PM Richard Fontana <rfontana@...> wrote:
Use of "LicenseRef" (not to mention something like
NOASSERTION) is a nonstarter for the use cases we are most interested
in. What we've actually done in some cases is use the nonstandard
identifiers created by nexB.
Agreed. What I am trying to achieve here is to make these become "standard" and
known at SPDX. I think this is possible.

On Sun, Mar 10, 2019 at 12:44 PM Jeff McAffer
<Jeff.McAffer@...> wrote:
IMO the "ideal" here is that there is some automated way of
"fingerprinting" license texts such that two parties, given more or less
the same text, can independently come up with the same id. At that point
you would not need a registry, just a shared algorithm. When/if eventually
SPDX does recognize a given license and gives it a formal id, there could
be a relatively simple aliasing step where SPDX id "SomeCoolLicense-1.0"
is AKA "LicenseRef-43bdf298"
This ideal works in theory but for several reasons I outline below would be
too brittle in practice as you would have different fingerprints too often for
this to be working. Instead running a full license detection is a better way
to dedupe things. And this requires some form of centralization but could be
fully automated alright. The other thing is that IMO giving a name/id does
matter a lot: the license named 43bdf298 is not really human friendly.

Now even if license-text-fingerprint-as-id were to work out, the difficult part
is not so much the algorithm for computing these, but the content you feed for
fingerprinting. And that part is not easily to automate:

- For instance, is a copyright part of the license or not (I think not, but

- Or what about statements around a license? For instance these two SPDX
licenses may not really deserve a different id yet they have one: and

The LICENSE file in the original code archives does not have a patent
disclaimer statement footer seen in bzip2-1.0.5's SPDX license text.
That footer is present on the website only. I would not treat
this as part of the license, but this was treated as part of it here. This
is a judgment call.

- Or for instance, there are 6+ version of the text of the GPL-2.0 which are
really the same but would fingerprint differently.

Therefore a fingerprint algorithm would be hard to generalize as there would be
many exceptions or a simple one would be too brittle in too many cases.
Deduping is best achieved by license detection with a full diff (which
is what scancode does FWIW).

Let me follow up with my suggestion.

Philippe Ombredanne

Join { to automatically receive all group messages.