Re: SPDX Legal call this Thursday


Philippe Ombredanne
 

On Wed, Sep 16, 2015 at 2:33 AM, J Lovejoy <opensource@...> wrote:
3) License matching templates/markup:
We have a task to add markup to some of the standard headers and have also
had input to add/edit markup on existing licenses. As a result of the
latter, it has been raised that perhaps the markup could be improved. Before
adding more markup (to standard headers, license text or both), it seemed
prudent to start a discussion as to whether the existing markup is
effective. Please ponder the following questions:
a) have you used the existing markup for matching purposes?
Yes and No: ScanCode uses an SPDX-inspired/derived markup, but
instead of reusing the markup directly from the main license texts,
markup is transformed in a simpler {{mustache-like}} syntax added to
copies of these texts used only for detection purpose.

i) if no, why not?
Because:
- adding more markup to a reference license text makes this eventually
no longer usable as a reference text and harder to read by humans
- the many variations found in the wild make it hard to put all in a
single template.
- the markup syntax implies eventually an implementation using regular
expressions. ScanCode does not use regex, but inverted indexes and
string alignments.

ii) if yes, has it been helpful/effective? Could it be improved, and if so,
how? (this will likely involve putting forward a proposal for review)
I think a simple markup is a very effective way to detect licenses
with minor text variations and still call this an exact match.
It is also a very effective way to indicate variations for humans.
I find it hard personally to mix the human readability and technical
detection concerns in the same file without compromises.

As food for thought, here are some examples of markup as used in ScanCode:

https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/apache-1.1_38.RULE#L17
https://github.com/nexB/scancode-toolkit/blob/b37be4de78152fbd3ed54761627c960010ce26a3/src/licensedcode/data/rules/bzip2-libbzip-1.0.5_1.RULE#L1

The syntax is using double curly braces to enclose variable parts.
There is no regex involved.
Optionally a number can be used after the opening braces to indicate
the number of variable words, defaulting to 5 words.
For instance {{ Copyright (c) 2015 Myco }} would match up to 5 words
and {{ 10 Copyright (c) 2015 Myco inc.}} would match up to 10 words.

I hope this helps even though this is a slightly different take.
--
Cordially
Philippe Ombredanne

Join {Spdx-legal@lists.spdx.org to automatically receive all group messages.