Use of escaped characters in SPDX XML files.
Hi,
I would like to resolve some queries in the SPDX XML files in relation to the strange text such as ’. The characters < and > and & are used in the structure of XML files, for example in the tag “<p>” and so these characters cannot be used directly in text, otherwise we would not know what is text and what is a tag. XML deals with this by allowing these characters and others to be escaped when they occur in text. So for example, if the original text really contains a < then we must replace it in the XML with < or < with these representing the name of the character and the number respectively. Strictly, the only characters that must be escaped in XML text are < and &. It is common but not necessary to also escape > for consistency with <. We see these in the SPDX XML files as:
< <
> >
& &
Escaping of other characters is optional, and whilst they make it harder for people to read, a computer program reading XML files should deal with these just fine. If we come across these escaped characters and want to check what they mean, then paste the full name such as < or full number such as < into google and the top hit will usually show us which character it represents. So my take on these is that so long as the escaped character when converted back to a proper character still matches what is in the original license then this is correct and acceptable.
In the corrective pass I made to fix the repeated list problem, some step of the processing got a little overzealous in its escaping, and so things like “ are escaped too ("). You might also have noticed that some of the SPDX tags got lowercased. I do plan to go back and normalize both of these items after the review is done, but if necessary can do it beforehand. There are quite a few things I have planned, actually, but I don’t want to do any of it in the middle of this process.
Just a head’s up, and so nobody feels like they should go through and fix these; this can be easily automated.
Kris
Sent: Monday, May 09, 2016 03:16
To: SPDX-legal (spdx-legal@...) <spdx-legal@...>
Subject: Use of escaped characters in SPDX XML files.
Hi,
I would like to resolve some queries in the SPDX XML files in relation to the strange text such as ’. The characters < and > and & are used in the structure of XML files, for example in the tag “<p>” and so these characters cannot be used directly in text, otherwise we would not know what is text and what is a tag. XML deals with this by allowing these characters and others to be escaped when they occur in text. So for example, if the original text really contains a < then we must replace it in the XML with < or < with these representing the name of the character and the number respectively. Strictly, the only characters that must be escaped in XML text are < and &. It is common but not necessary to also escape > for consistency with <. We see these in the SPDX XML files as:
< <
> >
& &
Escaping of other characters is optional, and whilst they make it harder for people to read, a computer program reading XML files should deal with these just fine. If we come across these escaped characters and want to check what they mean, then paste the full name such as < or full number such as < into google and the top hit will usually show us which character it represents. So my take on these is that so long as the escaped character when converted back to a proper character still matches what is in the original license then this is correct and acceptable.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.