Re: Validate license cross references: New fields to be added


Kaelbling, Michael
 

In the spirit of “any suggestions and/or modifications will be very much appreciated”, I have inserted comments below.

 

From: Spdx-legal@... <Spdx-legal@...> On Behalf Of Smith Tanjong Agbor
Sent: Wednesday, June 17, 2020 12:32
To: Spdx-tech@...; Spdx-legal@...
Cc: Gary O'Neall <gary@...>; swinslow@...
Subject: Validate license cross references: New fields to be added

 

Hi all,

 

I am working on a Google Summer of Code project that emanates from this discussion/issue; concerning the validation of license cross references. Here is a link to my GSOC proposal.

 

The focus is on improving the LicenseListPublisher repository to have generated license data updated with fields on the validity of the crossref, among others. 

 

Inorder to do this, the structure of the crossref shall change(in some cases, eg JSON), and in others, there shall be additional tags. In general the following are fields which shall be added to the crossrefs:

 

"isValid": true/false,

Indicates whether or not the crossref url is a valid url (ex: not some local file link)

Must a valid URL be based on one of only two/three schemes: http, https, and ftp? Is http://localhost/ or https://127.0.0.1 valid?


"isWayBackLink": true/false,

Indicates whether or not the url is a link from a previous version(wayback machine) of the site(where the license is located)


"extraText": true/false,

Indicates whether or not the license from the url has extra text in its description when compared to the license description in the current file.


"isMatch": true/false,

Indicates whether or not the license from the url link matches(perfectly) the license description in the current file.

Rather than true/false perhaps allow the name of the matched algorithm:
verbatim
noassertion – if no test result is available (for invalid links perhaps)
todo – no match attempted

“” – no match asserted

verbatim2 – matches with \r == \r\n == \n
verbatim3 – matches “ignoring whitespace differences” reflowed text

verbatim4 – matches ignoring decoration (comments, flower-boxes)
template – matches template verbatim (see ppalaga’s comment)
et cetera as they become available

This is the url of the license text/description


"isDead": true/false

Indicates whether or not the url is a dead link(a link that returns a page different from HTTP_200, could be bad request HTTP_400, not found HTTP_404, forbidden HTTP_403, etc)

Rather than true/false (since dead sites can be reanimated), how about a date for the most-recent HTTP-200 response? “dateMRHTTP200”: “UTC date”

 

Please consider this as a proposal and any suggestions and/or modifications will be very much appreciated.

 

Thanks,

Smith

 

 

Join Spdx-legal@lists.spdx.org to automatically receive all group messages.