Joint Call: Tuesday, Oct 25th w/Tech Team


J Lovejoy
 

We will have a joint call with tech team, joining their regular call time on Tuesday, Oct 25th @ 18:00 GMT (10:00AM PT, 11:00 MT, 12:00PM CT, 1:00PM ET).  Please mark your calendars.

Dial-in (same as we use): http://uberconference.com/SPDXTeam or  Call: +1-857-216-2871
PIN # 38633

Agenda:

Close on the terms and discuss any next steps related to the following items:
 
-          Encoding (propose UTF-8)
-          The high level element name
-          Paragraph tag or p or some other term
-          Use of the <br> tags
 
All of the proposals except encoding are on the Google docs page:


Thanks,
Jilayne & Paul
SPDX Legal Team co-leads



Brad Edmondson
 

Works for me; thanks Jilayne and Gary.

Best,
Brad

--
Brad Edmondson, Esq.
512-673-8782 | brad.edmondson@...

On Fri, Oct 21, 2016 at 12:29 AM, J Lovejoy <opensource@...> wrote:
We will have a joint call with tech team, joining their regular call time on Tuesday, Oct 25th @ 18:00 GMT (10:00AM PT, 11:00 MT, 12:00PM CT, 1:00PM ET).  Please mark your calendars.

Dial-in (same as we use): http://uberconference.com/SPDXTeam or  Call: +1-857-216-2871
PIN # 38633

Agenda:

Close on the terms and discuss any next steps related to the following items:
 
-          Encoding (propose UTF-8)
-          The high level element name
-          Paragraph tag or p or some other term
-          Use of the <br> tags
 
All of the proposals except encoding are on the Google docs page:


Thanks,
Jilayne & Paul
SPDX Legal Team co-leads



_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal



Mark D. Baushke <mdb@...>
 

Hi Jilayne & Paul,

- Encoding (propose UTF-8)

I have no problem with this. I do think that some folks may not
completely understand the implications.

I would like to see a table of all of the representations of various
copyright signs that we need to consider when we extract from a file.

To date I have observed the following:

(c) - 0x28 0x63 0x29
(U+0028 U+0063 U+0029)
(C) - 0x28 0x43 0x29
(U+0028 U+0043 U+0029)
- 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'
- U+24B8 'circled latin capital letter c'
&copy; - 0x26 0x63 0x6f 0x70 0x79 0x3b
(U+0026 U+0063 U+006f U+0070 U+0079 U+003b)

Although I have only seen the graphic for the 'SOUND RECORDING
COPYRIGHT' on labels, I thought it may also be worth mentioning:

(P) - 0x28 0x50 0x29 (U+0028 U+0050 U+0029)
- 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'
- 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'

Note that I have also seen a bare 0xa9 in a file without the proceeding
0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So,
we may need to also consider how to handle those kinds of situations.

There are other interesting multiple representations in licenses such as:

- ''as is'' (uses U+0027) and
- "as is" (uses quotation mark U+0022) and
- &ldquo;as is&rdquo; and
- <U+201C>as is<U+201D>
- <U+201F>as is<U+201F>

there may be a few others as well.

I guess the point I am trying to make is that it may be desirable to
transcode some UTF-8 into a cannonical and recommended encoding form
when doing things like license extraction.

--
Mark D. Baushke
mdb@...


Alexios Zavras
 

I think there has been a misunderstanding.

 

The “encoding” item on the agenda simply means that there is a proposal to standardize on UTF-8 for the file format in which the XML version of the licenses (in the SPDX master license repo) are stored.

 

As to what you should be looking for, in order to extract copyright notices, the list is longer than what you include. For example, when reading an HTML file, the copyright symbol might be encoded as the characters “&#169;” or “&#xa9;” (besides the “&copy;” that you have). And strings in C or Python code might use “"\u00A9"” or “u"\u00A9"”, although these are probably not a copyright notice for the file itself.

 

 

-- zvr –

 

-----Original Message-----
From: spdx-legal-bounces@... [mailto:spdx-legal-bounces@...] On Behalf Of Mark D. Baushke
Sent: Friday, 21 October, 2016 18:16
To: J Lovejoy <opensource@...>
Cc: SPDX-legal <spdx-legal@...>
Subject: Re: Joint Call: Tuesday, Oct 25th w/Tech Team

 

Hi Jilayne & Paul,

 

- Encoding (propose UTF-8)

 

I have no problem with this. I do think that some folks may not completely understand the implications.

 

I would like to see a table of all of the representations of various copyright signs that we need to consider when we extract from a file.

 

To date I have observed the following:

 

  (c)         - 0x28 0x63 0x29

           (U+0028 U+0063 U+0029)

  (C)        - 0x28 0x43 0x29

           (U+0028 U+0043 U+0029)

         - 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'

         - U+24B8 'circled latin capital letter c'

  &copy; - 0x26 0x63 0x6f 0x70 0x79 0x3b

           (U+0026 U+0063 U+006f U+0070 U+0079 U+003b)

 

Although I have only seen the graphic for the 'SOUND RECORDING COPYRIGHT' on labels, I thought it may also be worth mentioning:

 

  (P)    - 0x28 0x50 0x29 (U+0028 U+0050 U+0029)

               - 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'

               - 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'

 

Note that I have also seen a bare 0xa9 in a file without the proceeding

0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So, we may need to also consider how to handle those kinds of situations.

 

There are other interesting multiple representations in licenses such as:

 

  - ''as is'' (uses U+0027) and

  - "as is"   (uses quotation mark U+0022) and

  - &ldquo;as is&rdquo; and

  - <U+201C>as is<U+201D>

  - <U+201F>as is<U+201F>

 

there may be a few others as well.

 

I guess the point I am trying to make is that it may be desirable to transcode some UTF-8 into a cannonical and recommended encoding form when doing things like license extraction.

 

--

Mark D. Baushke

mdb@...

_______________________________________________

Spdx-legal mailing list

Spdx-legal@...

https://lists.spdx.org/mailman/listinfo/spdx-legal

Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


Mark D. Baushke <mdb@...>
 

Hi Alexios,

Zavras, Alexios <alexios.zavras@...> writes:

I think there has been a misunderstanding.
Yes, that is very likely. I regret that I seem to be having trouble
understanding the topic. I will endeavor to make my point with more
clarity.

The "encoding" item on the agenda simply means that there is a
proposal to standardize on UTF-8 for the file format in which the XML
version of the licenses (in the SPDX master license repo) are stored.
Yes. My question seems to have been unclear. I regret this.

The difficulty is in the word standardize. UTF-8 allows for many
possible expressions of the same token. In particular, the text
expected in a standard license in XML will contain a number of
different characters which have multiple representations.

One meaning of the term standardize would be to come up with a single
cannincal representation for the template.

Will this meeting take up which of those many representations should be
used as the cannonical representation in the SPDX XML master license
repository?

Items we see in a copyright and license file may include multiple
representations of:

Double Quote, Single quote, Copyright Sign, Registered Sign, Trade
Mark Sign, etc.

Will there be an SPDX specification of what to put into the template
even if it may also be needful to look for the laternatives when doing
an extraction? Or, will there be an SPDX XML token that specifies the
class of representations that may be present?

fwiw: I would also hope that a full set of DTDs are to be generated for
the SPDX dialect of XML.

As to what you should be looking for, in order to extract copyright
notices, the list is longer than what you include. For example, when
reading an HTML file, the copyright symbol might be encoded as the
characters "&#169;" or "&#xa9;" (besides the "&copy;" that you have).
And strings in C or Python code might use ""\u00A9"" or "u"\u00A9"",
although these are probably not a copyright notice for the file
itself.
True. However, looking at the XML prototype license, what cannonical
form should be used to represent all of the other possible forms?

My original question was not clear.

I am asking if we are going to see something like <copyright-sign/> as
the SPDX XML template to represent any of the various encodings that
could exist?

For example, in MIT.xml should I see

<p>Copyright (c) &lt;year&gt; &lt;copyright holder&gt;</p>

or

<p>Copyright <copyright-sign/> <year-range/> <copyright-holder/></p>

so that each element could be used as a processing token for pattern
matching?

Also, in that file we have the text

(the "Software")

which uses U+0022 for the double quote. I have seen some documents that
are using the multibyte 'LEFT DOUBLE QUOTATION MARK' (U+201C) Software
'RIGHT DOUBLE QUOTATION MARK' (U+201D). What cannonical representation
will be used in the XML templates? My personal preference is U+201D.

I hope this helps with the understanding of my question as it relates to
UTF-8 selection for XML templates.

Please pardon the length of this message, I only endeavor to make my
question more clear.
--
Mark D. Baushke
mdb@...