Re: Joint Call: Tuesday, Oct 25th w/Tech Team

Mark D. Baushke <mdb@...>

Hi Jilayne & Paul,

- Encoding (propose UTF-8)

I have no problem with this. I do think that some folks may not
completely understand the implications.

I would like to see a table of all of the representations of various
copyright signs that we need to consider when we extract from a file.

To date I have observed the following:

(c) - 0x28 0x63 0x29
(U+0028 U+0063 U+0029)
(C) - 0x28 0x43 0x29
(U+0028 U+0043 U+0029)
- 0xc2 0xa9 (U+00A9) - 'COPYRIGHT SIGN'
- U+24B8 'circled latin capital letter c'
&copy; - 0x26 0x63 0x6f 0x70 0x79 0x3b
(U+0026 U+0063 U+006f U+0070 U+0079 U+003b)

Although I have only seen the graphic for the 'SOUND RECORDING
COPYRIGHT' on labels, I thought it may also be worth mentioning:

(P) - 0x28 0x50 0x29 (U+0028 U+0050 U+0029)
- 0xe2 0x84 0x97 (U+2117) 'SOUND RECORDING COPYRIGHT'
- 0xe2 0x93 0x85 (U+24C5) 'circled latin captial letter p'

Note that I have also seen a bare 0xa9 in a file without the proceeding
0xc2 byte. Tehnically that is not a valid UTF-8 file representation. So,
we may need to also consider how to handle those kinds of situations.

There are other interesting multiple representations in licenses such as:

- ''as is'' (uses U+0027) and
- "as is" (uses quotation mark U+0022) and
- &ldquo;as is&rdquo; and
- <U+201C>as is<U+201D>
- <U+201F>as is<U+201F>

there may be a few others as well.

I guess the point I am trying to make is that it may be desirable to
transcode some UTF-8 into a cannonical and recommended encoding form
when doing things like license extraction.

Mark D. Baushke

Join to automatically receive all group messages.