Re: Joint Call: Tuesday, Oct 25th w/Tech Team

Mark D. Baushke <mdb@...>

Hi Alexios,

Zavras, Alexios <alexios.zavras@...> writes:

I think there has been a misunderstanding.
Yes, that is very likely. I regret that I seem to be having trouble
understanding the topic. I will endeavor to make my point with more

The "encoding" item on the agenda simply means that there is a
proposal to standardize on UTF-8 for the file format in which the XML
version of the licenses (in the SPDX master license repo) are stored.
Yes. My question seems to have been unclear. I regret this.

The difficulty is in the word standardize. UTF-8 allows for many
possible expressions of the same token. In particular, the text
expected in a standard license in XML will contain a number of
different characters which have multiple representations.

One meaning of the term standardize would be to come up with a single
cannincal representation for the template.

Will this meeting take up which of those many representations should be
used as the cannonical representation in the SPDX XML master license

Items we see in a copyright and license file may include multiple
representations of:

Double Quote, Single quote, Copyright Sign, Registered Sign, Trade
Mark Sign, etc.

Will there be an SPDX specification of what to put into the template
even if it may also be needful to look for the laternatives when doing
an extraction? Or, will there be an SPDX XML token that specifies the
class of representations that may be present?

fwiw: I would also hope that a full set of DTDs are to be generated for
the SPDX dialect of XML.

As to what you should be looking for, in order to extract copyright
notices, the list is longer than what you include. For example, when
reading an HTML file, the copyright symbol might be encoded as the
characters "&#169;" or "&#xa9;" (besides the "&copy;" that you have).
And strings in C or Python code might use ""\u00A9"" or "u"\u00A9"",
although these are probably not a copyright notice for the file
True. However, looking at the XML prototype license, what cannonical
form should be used to represent all of the other possible forms?

My original question was not clear.

I am asking if we are going to see something like <copyright-sign/> as
the SPDX XML template to represent any of the various encodings that
could exist?

For example, in MIT.xml should I see

<p>Copyright (c) &lt;year&gt; &lt;copyright holder&gt;</p>


<p>Copyright <copyright-sign/> <year-range/> <copyright-holder/></p>

so that each element could be used as a processing token for pattern

Also, in that file we have the text

(the "Software")

which uses U+0022 for the double quote. I have seen some documents that
are using the multibyte 'LEFT DOUBLE QUOTATION MARK' (U+201C) Software
'RIGHT DOUBLE QUOTATION MARK' (U+201D). What cannonical representation
will be used in the XML templates? My personal preference is U+201D.

I hope this helps with the understanding of my question as it relates to
UTF-8 selection for XML templates.

Please pardon the length of this message, I only endeavor to make my
question more clear.
Mark D. Baushke

Join to automatically receive all group messages.