Validate license cross references: New fields to be added


Smith Tanjong Agbor
 

Hi all,

I am working on a Google Summer of Code project that emanates from this discussion/issue; concerning the validation of license cross references. Here is a link to my GSOC proposal.

The focus is on improving the LicenseListPublisher repository to have generated license data updated with fields on the validity of the crossref, among others. 

Inorder to do this, the structure of the crossref shall change(in some cases, eg JSON), and in others, there shall be additional tags. In general the following are fields which shall be added to the crossrefs:

"isValid": true/false,
Indicates whether or not the crossref url is a valid url (ex: not some local file link)

"isWayBackLink": true/false,
Indicates whether or not the url is a link from a previous version(wayback machine) of the site(where the license is located)

"extraText": true/false,
Indicates whether or not the license from the url has extra text in its description when compared to the license description in the current file.

"isMatch": true/false,
Indicates whether or not the license from the url link matches(perfectly) the license description in the current file.
This is the url of the license text/description

"isDead": true/false
Indicates whether or not the url is a dead link(a link that returns a page different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, forbidden HTTP_403, etc)

Please consider this as a proposal and any suggestions and/or modifications will be very much appreciated.

Thanks,
Smith



Brad Edmondson
 

Hi Smith,

Thanks for your well-laid-out email and your GSoC proposal. Trying to think about this from the perspective of the LicenseListPublisher repository over time, I would imagine the validity and other status of links could change over time. Links can linkrot, http-302 forwards can differ one day to the next, and the license text presented in HTML at a specific URL could be, and sometimes is, altered -- either with or without explicitly versioning the license. I think this necessitates some way of recording or representing validity information as a point-in-time, at minimum with a lastChecked value (e.g., UTC). There may be use cases for representing validity over periods of time, for example:
  • Time-series: (in daily checks tagged with UTC): valid-valid-valid-invalid-invalid-valid
  • Last-known-modified: perhaps lastChecked and lastChanged so that one could say "this was checked every week since X date and hasn't changed)
  • Other: other time-related information that tooling providers might want

Then, I wasn't sure if isValid represented a valid regex-matchable URL (which presumably could be local, or more likely, corporate intranet), or both validly-formed according to regex and accessible from [some place on] the global internet. In theory that might depend on DNS, firewall configurations, or both, which are subject to change or manipulation to e.g. mitigate DDoS, find the physically closest webserver for a CDN, or block specific IPs sending malicious traffic. When it comes down to the "bits on the wire," the server has the option whether and how to respond to a request, and the server can (and occasionally does) make its decision based on these types of connection metadata describing the "from" side of the connection. So in theory it may make sense to include things like the source IP address of the system performing the validation attempt. That raises privacy issues, although if it came from a Linux Foundation system (or something similar), then hiding the validating system's IP address wouldn't necessarily be a requirement. So it may make sense to evaluation these kinds of contextual data points, along with clarifying in the isValid name or definition which validity-check you mean for it to represent. At minimum, it's worth thinking through these things and how we would deal with the edge cases introduced by relying on DNS and http to perform what is ultimately a connection-based point-in-time check.

Best,
Brad Edmondson

PS: Personally I am not in favor of SPDX tracking the validity of license-text links, but then again I am coming at this as a contributor on the SPDX-legal side of things, and not on the SPDX tech team nor a frequent user of tooling. If the tech team is happy with this idea generally, and with fully owning the process and collected data on the LicenseListPublisher side, then I would have no objection from the legal side. (Also, of course, I only represent my own view and not the official or finalized position of the legal team.)

--
Brad Edmondson, Esq.
brad.edmondson@...


On Wed, Jun 17, 2020 at 6:31 AM Smith Tanjong Agbor <stanjongagbor@...> wrote:
Hi all,

I am working on a Google Summer of Code project that emanates from this discussion/issue; concerning the validation of license cross references. Here is a link to my GSOC proposal.

The focus is on improving the LicenseListPublisher repository to have generated license data updated with fields on the validity of the crossref, among others. 

Inorder to do this, the structure of the crossref shall change(in some cases, eg JSON), and in others, there shall be additional tags. In general the following are fields which shall be added to the crossrefs:

"isValid": true/false,
Indicates whether or not the crossref url is a valid url (ex: not some local file link)

"isWayBackLink": true/false,
Indicates whether or not the url is a link from a previous version(wayback machine) of the site(where the license is located)

"extraText": true/false,
Indicates whether or not the license from the url has extra text in its description when compared to the license description in the current file.

"isMatch": true/false,
Indicates whether or not the license from the url link matches(perfectly) the license description in the current file.
This is the url of the license text/description

"isDead": true/false
Indicates whether or not the url is a dead link(a link that returns a page different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, forbidden HTTP_403, etc)

Please consider this as a proposal and any suggestions and/or modifications will be very much appreciated.

Thanks,
Smith



Kaelbling, Michael <michael.kaelbling@...>
 

In the spirit of “any suggestions and/or modifications will be very much appreciated”, I have inserted comments below.

 

From: Spdx-legal@... <Spdx-legal@...> On Behalf Of Smith Tanjong Agbor
Sent: Wednesday, June 17, 2020 12:32
To: Spdx-tech@...; Spdx-legal@...
Cc: Gary O'Neall <gary@...>; swinslow@...
Subject: Validate license cross references: New fields to be added

 

Hi all,

 

I am working on a Google Summer of Code project that emanates from this discussion/issue; concerning the validation of license cross references. Here is a link to my GSOC proposal.

 

The focus is on improving the LicenseListPublisher repository to have generated license data updated with fields on the validity of the crossref, among others. 

 

Inorder to do this, the structure of the crossref shall change(in some cases, eg JSON), and in others, there shall be additional tags. In general the following are fields which shall be added to the crossrefs:

 

"isValid": true/false,

Indicates whether or not the crossref url is a valid url (ex: not some local file link)

Must a valid URL be based on one of only two/three schemes: http, https, and ftp? Is http://localhost/ or https://127.0.0.1 valid?


"isWayBackLink": true/false,

Indicates whether or not the url is a link from a previous version(wayback machine) of the site(where the license is located)


"extraText": true/false,

Indicates whether or not the license from the url has extra text in its description when compared to the license description in the current file.


"isMatch": true/false,

Indicates whether or not the license from the url link matches(perfectly) the license description in the current file.

Rather than true/false perhaps allow the name of the matched algorithm:
verbatim
noassertion – if no test result is available (for invalid links perhaps)
todo – no match attempted

“” – no match asserted

verbatim2 – matches with \r == \r\n == \n
verbatim3 – matches “ignoring whitespace differences” reflowed text

verbatim4 – matches ignoring decoration (comments, flower-boxes)
template – matches template verbatim (see ppalaga’s comment)
et cetera as they become available

This is the url of the license text/description


"isDead": true/false

Indicates whether or not the url is a dead link(a link that returns a page different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, forbidden HTTP_403, etc)

Rather than true/false (since dead sites can be reanimated), how about a date for the most-recent HTTP-200 response? “dateMRHTTP200”: “UTC date”

 

Please consider this as a proposal and any suggestions and/or modifications will be very much appreciated.

 

Thanks,

Smith

 

 


Smith Tanjong Agbor
 

Hi all,

1- I don't think http://localhost/ or https://127.0.0.1 should be valid urls. So I shall consider those exceptions in my code.
2- In addition to https, http and ftp; Are there any other protocols you would like us to consider for license urls?

3- I think this:

Rather than true/false perhaps allow the name of the matched algorithm:
verbatim
noassertion – if no test result is available (for invalid links perhaps)
todo – no match attempted

“” – no match asserted

verbatim2 – matches with \r == \r\n == \n
verbatim3 – matches “ignoring whitespace differences” reflowed text

verbatim4 – matches ignoring decoration (comments, flower-boxes)
template – matches template verbatim (see ppalaga’s comment)
et cetera as they become available

shall provide more information than what I suggested previously. It will also enable us to add values without changing the structure of the data.

4- Concerning the date of the most recent HTTP-200 response, we can have two values; the date of the most recent HTTP-200/or not and true/false. I think this will allow us to have dates in any case; and whether the link is dead or not.

Concerning Brad's reply;

1- I would suggest storing the dates of events for all fields, except the url.
For instance:
isValid: {val: true/false, date: date_utc},
isDead: {val: true/false, date: date_utc}, etc

2- I would really like to have more input on this. I really do not know if the inclusion of the DNS, CDN, private network, etc to evaluate the validity of an url is ok. I am more inclined towards using a regex, and not requiring that a link is valid before establishing whether it is dead or not. I think that could help.

Any more comments/suggestions are welcome.

Thanks, 
Smith

Le mer. 17 juin 2020 à 21:14, Kaelbling, Michael <michael.kaelbling@...> a écrit :

In the spirit of “any suggestions and/or modifications will be very much appreciated”, I have inserted comments below.

 

From: Spdx-legal@... <Spdx-legal@...> On Behalf Of Smith Tanjong Agbor
Sent: Wednesday, June 17, 2020 12:32
To: Spdx-tech@...; Spdx-legal@...
Cc: Gary O'Neall <gary@...>; swinslow@...
Subject: Validate license cross references: New fields to be added

 

Hi all,

 

I am working on a Google Summer of Code project that emanates from this discussion/issue; concerning the validation of license cross references. Here is a link to my GSOC proposal.

 

The focus is on improving the LicenseListPublisher repository to have generated license data updated with fields on the validity of the crossref, among others. 

 

Inorder to do this, the structure of the crossref shall change(in some cases, eg JSON), and in others, there shall be additional tags. In general the following are fields which shall be added to the crossrefs:

 

"isValid": true/false,

Indicates whether or not the crossref url is a valid url (ex: not some local file link)

Must a valid URL be based on one of only two/three schemes: http, https, and ftp? Is http://localhost/ or https://127.0.0.1 valid?


"isWayBackLink": true/false,

Indicates whether or not the url is a link from a previous version(wayback machine) of the site(where the license is located)


"extraText": true/false,

Indicates whether or not the license from the url has extra text in its description when compared to the license description in the current file.


"isMatch": true/false,

Indicates whether or not the license from the url link matches(perfectly) the license description in the current file.

Rather than true/false perhaps allow the name of the matched algorithm:
verbatim
noassertion – if no test result is available (for invalid links perhaps)
todo – no match attempted

“” – no match asserted

verbatim2 – matches with \r == \r\n == \n
verbatim3 – matches “ignoring whitespace differences” reflowed text

verbatim4 – matches ignoring decoration (comments, flower-boxes)
template – matches template verbatim (see ppalaga’s comment)
et cetera as they become available

This is the url of the license text/description


"isDead": true/false

Indicates whether or not the url is a dead link(a link that returns a page different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, forbidden HTTP_403, etc)

Rather than true/false (since dead sites can be reanimated), how about a date for the most-recent HTTP-200 response? “dateMRHTTP200”: “UTC date”

 

Please consider this as a proposal and any suggestions and/or modifications will be very much appreciated.

 

Thanks,

Smith