[spdx-tech] Proposed topic for this week's tech call: Extend license expressions to include OR-MAYBE


W. Trevor King
 

On Mon, Nov 27, 2017 at 08:49:08PM +0000, Wheeler, David A wrote:
gary@...:
- Do we agree the "OR-MAYBE" should be added?
I agree…
Philippe's recent points about weighted confidence (e.g. [1]) suggests
that, even if we decide to support incomplete conclusions, an
unweighted list of alternatives may not be sufficient. In that case,
we may want something like:

binary-confidence-expression-operator = "AND"
confidence-expression = license-expression space "CONFIDENCE" space "0." 1*DIGIT
confidence-list = confidence-expression *(space confidence-expression) [space license-expression]
/ confidence-list space binary-confidence-expression-operator space confidence-list
/ license-expression

where license-expression, space, and DIGIT are discussed in [2]. The
sum of confidence weights would have to sum to something ≤ 1. ‘AND’
would have the same conjunctive semantics as the current
license-expression operator, but we don't want to support disjunctive
OR for confidence lists.

The ‘[space license-expression]’ (optional trailing
license-expression) has an implicit ‘CONFIDENCE {1 -
sum_of_previous_confidences}’, for folks who don't trust their math or
want to save a few characters.

The ‘/ license-expression’ case has an implicit ‘CONFIDENCE 1’ for
backwards compatibility with existing license-expression consumers who
choose to upgrade to confidence-list.

Then folks consuming confidence-list could use:

GPL-2.0-only CONFIDENCE 0.95 GPL-2.0-or-later

For “I am 95% sure this is GPL-2.0-only but it could be
GPL-2.0-or-later” with the implicit 5% confidence for
GPL-2.0-or-later.

- Should we disallow "OR-MAYBE" in declared license fields (it
would only be used in concluded license fields)?
No. Projects sometimes get inherited from others where the license
isn't clear to start with, so it needs to be *possible* to declare
ambiguities. Of course, a *declaration* using "OR MAYBE" should
concerning, but that helps potential users know where to dig in.
Keeping a separate ABNF rule for license-expression allows consumers
to choose between license-expression and confidence-list as they see
fit. But yeah, the “inherited project” case is a good reason for
allowing confidence-list (or whatever we use for partial conclusions)
in declared-license fields.

- What is the exact definition of the "OR-MAYBE" we would include
in the spec?
For "OR MAYBE", in the definition of compound-expression, change:
compound-expression "OR" compound-expression ) /
to:
compound-expression "OR" ["MAYBE"] compound-expression ) /

If you want a MAYBE prefix to be allowed anywhere, you could change:
compound-expression = 1*1(simple-expression /
to:
compound-expression = ["MAYBE"] 1*1(simple-expression /

The latter allows MAYBE as a prefix in general, in case you have no
confidence in *anything*.
The CONFIDENCE approach allows you to handle that case with:

GPL-2.0-only CONFIDENCE 0.90

for “I'm 90% sure this is GPL-2.0-only, and am not expressing an
opinion on the 10% alternatives”. Using an OR-MAYBE like:

binary-alternatives-operator = "AND"
alternatives = license-expression *(OR-MAYBE license-expression)
/ alternatives space binary-alternatives-operator space alternatives

would not support weighting. But with [3], you could represent that
case with:

GPL-2.0-only OR-MAYBE NOASSERTION

So I don't see an upside to a separate MAYBE. It might work with
clear precedence rules, but without them:

APACHE-2.0 OR GPL-2.0-only OR MAYBE GPL-2.0-or-later

could mean ‘APACHE-2.0 OR GPL-2.0-only OR (MAYBE GPL-2.0-or-later)’:

A disjunctive choice between ‘APACHE-2.0’, ‘GPL-2.0-only’, and
something that I haven't been able to figure out yet but which might
be ‘GPL-2.0-or-later’”.

or it could mean ‘(APACHE-2.0 OR GPL-2.0-only) OR MAYBE GPL-2.0-or-later’:

This might be ‘APACHE-2.0 OR GPL-2.0-only’, but I'm not sure. It
might also be ‘GPL-2.0-or-later’. I haven't been able to figure out
which yet.

depending on whether MAYBE had a higher precedence than OR or not.

With the former interpretation, you're safe if you want to use the
code under APACHE-2.0 or if you want to use it under GPL-2.0-only.
With the latter interpretation, you're only safe if you want to use
the code under GPL-2.0-only (since that's also a subset of
GPL-2.0-or-later).

Even with OR-MAYBE, precedence for AND is going to be complicated (and
will decide whether a given AND is acting as a license expession
operator or an alternative operator). But using a hyphenated OR-MAYBE
at least avoids that confusion for OR.

Comparing OR-MAYBE with CONFIDENCE, the only actionable use I can
think of for weighting is a vendor with a report of confidence lists
for various components of their software. They might decide to
prioritize digging into the component with the least-confident
assertion. But they might also want to prioritize based on lines of
code under the unclear license, or on the importance of the particular
lines. For example, say you have a product with:

10k lines of core code under ‘GPL-3.0-only’
1k lines of core code under ‘GPL-2.0-or-later CONFIDENCE 0.9 GPL-2.0-only’
100 lines of build script under ‘MIT CONFIDENCE 0.5 NONE’
10 lines of build script under ‘MIT CONFIDENCE 0.1 NONE’

where NONE is [4]. What would the project be?

GPL-3.0-only AND
(GPL-2.0-or-later CONFIDENCE 0.9 GPL-2.0-only) AND
(MIT CONFIDENCE 0.5 NONE) AND
(MIT CONFIDENCE 0.1 NONE)

would it be:

GPL-3.0-only AND
(GPL-2.0-or-later CONFIDENCE 0.9 GPL-2.0-only) AND
(MIT CONFIDENCE 0.4636 NONE)

using line-count weights (or similar) to combine the two ‘MIT OR-MAYBE
NONE’ cases?

Either way, that's probably going to focus people on build script
(“reasonable chance that this is not open code at all!”), but they may
instead want to focus on the core code (“we think copy/pasting 110
lines could be fair use, but we don't want to waste time on those 1k
lines of possibly GPL-2.0-only code if we can't link them with the 10k
GPL-3.0-only code”). And we don't weight AND, so it's not clear to me
how actionable CONFIDENCE values would be for product-level
composites. Still, scancode-toolkit [1,5] and licensee [6] both
decided to set it, so I don't want to drop it without understanding
how it's used. My impression based on [7,8] is that both of these are
tunables for the tool-user, and that the tool-authors don't expect
them to be passed up the chain to folks reading compound confidence
lists, but it's worth getting more feedback from the tool authors on
that.

And I'm also fine with leaving a partial-conclusion syntax out of the
spec, and punting it to higher levels and third parties.

[1]: https://lists.spdx.org/pipermail/spdx-legal/2017-November/002351.html
Subject: Re: update on only/or later etc.
Date: 2017-11-22
Message-ID: <CAOFm3uFFfitvk-wK_TO3ZqWpGR6VD+R-26HrucnQ8MNbzx2Bag@...>
[2]: https://github.com/wking/spdx-spec/blob/922031a89e7f7dca19f20d17005d0f3feeb95af5/chapters/appendix-IV-SPDX-license-expressions.md#IV.2
https://github.com/spdx/spdx-spec/pull/37
[3]: https://github.com/spdx/spdx-spec/issues/50
Subject: Add “NOASSERTION” to the license expression syntax
[4]: https://github.com/spdx/spdx-spec/issues/49
Subject: Add “NONE” to the license expression syntax
[5]: https://github.com/nexB/scancode-toolkit/blame/v2.2.1/src/licensedcode/README.rst#L140-L141
[6]: https://github.com/benbalter/licensee/blob/v9.6.0/docs/usage.md#command-line-usage
[7]: https://github.com/nexB/scancode-toolkit/issues/342
Subject: Bare CPOL license detection rule detection issue
[8]: https://github.com/benbalter/licensee/pull/212
Subject: Fix for FCPL false positive

--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


Gary O'Neall
 

-----Original Message-----
From: W. Trevor King [mailto:wking@...]
Sent: Monday, November 27, 2017 4:18 PM
To: Wheeler, David A <dwheeler@...>
Cc: Gary O'Neall <gary@...>; spdx-tech@...; spdx-
legal@...
Subject: Re: [spdx-tech] Proposed topic for this week's tech call: Extend license
expressions to include OR-MAYBE

On Mon, Nov 27, 2017 at 08:49:08PM +0000, Wheeler, David A wrote:
gary@...:
- Do we agree the "OR-MAYBE" should be added?
I agree…
Philippe's recent points about weighted confidence (e.g. [1]) suggests that, even
if we decide to support incomplete conclusions, an unweighted list of
alternatives may not be sufficient. In that case, we may want something like:

binary-confidence-expression-operator = "AND"
confidence-expression = license-expression space "CONFIDENCE" space "0."
1*DIGIT
confidence-list = confidence-expression *(space confidence-expression) [space
license-expression]
/ confidence-list space binary-confidence-expression-operator space
confidence-list
/ license-expression
[G.O.] My preference is for the "OR-MAYBE" approach just due to the simplicity. In the audit use case, it is difficult to assign a confidence that has any precision. The weighting would work for a tool where there is some algorithm that results in a weighting or confidence measure.


W. Trevor King
 

On Mon, Nov 27, 2017 at 10:17:22PM -0800, Gary O'Neall wrote:
binary-confidence-expression-operator = "AND"
confidence-expression = license-expression space "CONFIDENCE" space "0." 1*DIGIT
confidence-list = confidence-expression *(space confidence-expression) [space license-expression]
/ confidence-list space binary-confidence-expression-operator space confidence-list
/ license-expression
[G.O.] My preference is for the "OR-MAYBE" approach just due to the
simplicity. In the audit use case, it is difficult to assign a
confidence that has any precision. The weighting would work for a
tool where there is some algorithm that results in a weighting or
confidence measure.
I agree that getting consistent confidence numbers is going to be
hard, and that without that (and maybe even with that), confidence
weights may not be very useful. But with two license tools returning
confidence-weighted alternatives, I want to make sure we understand
their intended use cases before we commit to backwards-compat for a
binary OR-MAYBE.

Cheers,
Trevor

--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy