Re: Joint Call: Tuesday, Oct 25th w/Tech Team
I think there has been a misunderstanding.
The “encoding” item on the agenda simply means that there is a proposal to standardize on UTF-8 for the file format in which the XML version of the licenses (in the SPDX master license repo) are stored.
As to what you should be looking for, in order to extract copyright notices, the list is longer than what you include. For example, when reading an HTML file, the copyright symbol might be encoded as the characters “©” or “©” (besides the “©” that you have). And strings in C or Python code might use “"\u00A9"” or “u"\u00A9"”, although these are probably not a copyright notice for the file itself.
-- zvr –
From: spdx-legal-bounces@... [mailto:spdx-legal-bounces@...] On Behalf Of Mark D. Baushke
Sent: Friday, 21 October, 2016 18:16
To: J Lovejoy <opensource@...>
Cc: SPDX-legal <spdx-legal@...>
Subject: Re: Joint Call: Tuesday, Oct 25th w/Tech Team
Hi Jilayne & Paul,
- Encoding (propose UTF-8)
I have no problem with this. I do think that some folks may not completely understand the implications.
I would like to see a table of all of the representations of various copyright signs that we need to consider when we extract from a file.
To date I have observed the following:
(c) - 0x28 0x63 0x29
(U+0028 U+0063 U+0029)
(C) - 0x28 0x43 0x29
(U+0028 U+0043 U+0029)
- 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'
- U+24B8 'circled latin capital letter c'
© - 0x26 0x63 0x6f 0x70 0x79 0x3b
(U+0026 U+0063 U+006f U+0070 U+0079 U+003b)
Although I have only seen the graphic for the 'SOUND RECORDING COPYRIGHT' on labels, I thought it may also be worth mentioning:
(P) - 0x28 0x50 0x29 (U+0028 U+0050 U+0029)
- 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'
- 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'
Note that I have also seen a bare 0xa9 in a file without the proceeding
0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So, we may need to also consider how to handle those kinds of situations.
There are other interesting multiple representations in licenses such as:
- ''as is'' (uses U+0027) and
- "as is" (uses quotation mark U+0022) and
- “as is” and
- <U+201C>as is<U+201D>
- <U+201F>as is<U+201F>
there may be a few others as well.
I guess the point I am trying to make is that it may be desirable to transcode some UTF-8 into a cannonical and recommended encoding form when doing things like license extraction.
Mark D. Baushke
Spdx-legal mailing list
Intel Deutschland GmbH