Nested lists in SPDX XML files.
For what it’s worth, pretty much everything is legal structure-wise; the structure of the final form of the XML will definitely be guided by any problems we need to solve that we encounter during this review process. One notable one that isn’t in the current version is dropping the <body> tag entirely and simply allowing nesting of the other entities (copyright, title, optional) as “one-or-none” matches.
Sent: Monday, May 09, 2016 11:33
To: Kris.re <Kris.re@...>
Cc: Philippe Ombredanne <pombredanne@...>; Sam Ellis <Sam.Ellis@...>; SPDX-legal (spdx-legal@...) <spdx-legal@...>
Subject: Re: Nested lists in SPDX XML files.
Thanks Kris for the good recap of where we are on error-correcting the XML representation of required, optional, and replaceable text. Of course it is no fun to manually edit XML, but I think I recall correctly that the group decided at the outset of this github project that since we can programmatically validate (at least the structural legality of) our edits, and eventually adapt existing XML-parsing and editing code into SPDX-specific tooling, this process will be worth it. Especially if Sam continues to pull everyone else's weight on github as well as his own. :-)
Thanks also for confirming the possibility of nesting entities. I have made a few edits assuming that nesting is legal, but I made that assumption silently -- props to Sam for specifically raising the question.
Best,
Brad
--
Brad Edmondson, Esq.
512-673-8782 | brad.edmondson@...
On Mon, May 9, 2016 at 10:39 AM, Kris.re <Kris.re@...> wrote:
Sam:
They are definitely supposed to be nested. If you are seeing one like this than either the original template's spacing was equally flat (I could see it for one or two, but not many), or the correction pass I made to remove the redundant list items wasn't quite correct. There were about 3-4 types of failures that I thought I was able to automatically correct, but obviously I didn't look quite deep enough. If you spot these, feel free to label them with a 'bug' label and I will address them in batch. I can roll back the git history and extract the correctly nested version, or at least part of it, or convert it by hand or something.
Philippe:
The initial proposal tried to minimize the use of XML tags at the cost of making whitespace significant. After some discussion, this seemed (besides being a bit of a messy solution) to not meet our needs, and so I revised it so that whitespace was NOT significant, which required more structural tags to identify paragraphs and the like.
Regarding "mixed format and structure", if you're referring to the <b> tag, it is not a format tag but was intended to be "b for bullet". Many of the tags are likely getting renamed and this is probably one of them. The original choice was geared towards having as little visual space taken up by tags as possible, but as you can see we've moved past the point where that's a useful compromise here. If you're referring to <p>, <br/>, <list> or <li>, these are indeed structural tags, though the line is a bit fuzzy because it's often easiest to think about them in formatting terms. We might think of "<br/>" as a new line, which is presentation, as opposed to a structural break between two sections of text, or "<p>" as two new lines as opposed to a grouping of text into a paragraph, for example. Their primary use IS for presentation: we do need to render HTML files as one of our outputs, so we need enough information about the *structure* of the document to *format* it usefully.
> But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?
Our purpose right now is not primarily editing, though I welcome Sam's contributions on this count, since otherwise it'll be me doing this work ;) The primary work right now is verifying the selection of "matchable sections" of text, which were done partly by an automated process, and partly by myself without full legal understanding and context of these licenses.
> I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.
I have to agree in most cases, but luckily for us, editing is not something we should have to do a lot of and when we do it will be pretty targeted. The bulk of the effort is the initial construction of the XML file, which can be eased by way of tooling. It does happen that XML is perfect for our particular use case, in my opinion, since what we need to do is mark up some source text somewhat arbitrarily to add information about it. That's something that a data-only format like JSON just doesn't have the capability to elegantly express.
> At least why not use plain HTML if you need to mix format and structure?
> You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
I don't know about you, but in the past when I've used such tools, the output HTML is far from minimal and often times a mess. Different tools will produce different HTML, which is a problem when working in a shared space (git repository). A WYSIWYG editor doesn't necessarily have the tools to represent the information we are working with, either: replaceable text, you could perhaps hack as a hyperlink but that would be kind of nasty... and we still have other information to bind, such as the name of the variable to store the match in. Our lists are sometimes ordered, sometimes unordered; when they're ordered, they are often formatted various ways. Adjusting the format to be correct (letters, numbers, roman numerals, whatever) is doable to some extent with css, but not in a way a WYSIWYG editor would necessarily support (what do you do about "2.a)"?) How do you represent sections of text that should be optional?
I like the idea of using an existing such editor naively, but when it comes down to the details we'd essentially be hacking or working around almost every part of the structure of the files except, maybe, list formatting.
I do think that in the future some work to make actually editing the files and creating them easier should be done, and I'll probably be volunteering for that, but I'd like to wrap up by bringing back to the first point: the current effort is not primarily intended to be focused on editing massive swathes of XML, and certainly not creating whole files. After we've vetted the structure of the documents from a legal perspective with regards to matching, there is a lot of programmatic work we can do without worry, so I wouldn't get too wrapped up in the details of the names of tags or stuff like that just yet.
Kris
-----Original Message-----
From: spdx-legal-bounces@... [mailto:spdx-legal-bounces@...] On Behalf Of Philippe Ombredanne
Sent: Monday, May 09, 2016 09:55
To: Sam Ellis <Sam.Ellis@...>
Cc: SPDX-legal (spdx-legal@...) <spdx-legal@...>
Subject: Re: Nested lists in SPDX XML files.
On Mon, May 9, 2016 at 9:25 AM, Sam Ellis <Sam.Ellis@...> wrote:
> Hi,
>
> When reviewing the SPDX XML files, I see many licenses that contain
> nested lists such as:
>
> 1) some text…
> a) some text…
>
> And these are converted to XML like this:
>
> <list>
> <li>
> <b>1)</b><p>some text…</p>
> </li>
> </list>
> <list>
> <li>
> <b>a)</b><p>some text…</p>
> </li>
> </list>
>
> Note that the XML above places the bullets sequentially rather than
> being nested. I would like to check, does the XML syntax support
> nesting, and if so, should we be using it to represent cases such as this?
>
> To make it clearer, this is how the nested equivalent of the above
> might look, with the a) list inside the 1) list:
>
> <list>
> <li>
> <b>1)</b><p>some text…</p>
> </li>
> <list>
> <li>
> <b>a)</b><p>some text…</p>
> </li>
> </list>
> </list>
>
> My view is that by representing nested lists sequentially then we are
> losing some of the structure of the original text. On the other hand,
> if the main purpose of the lists is to allow for identification of
> bullets then the sequential representation is just fine for this.
I have a lot of respect for what you are embarking in:
I would not dare editing by hand 100 of such XML files: I find this rather confusing and error prone.
And I am a programmer...
But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?
Now, if there is a purpose, you raise a good point in this post (and your previous post about XML entities escaping).
Why is this format mixing structure and formatting together, in a pseudo-HTML format?
Is this meant to become the reference text for SPDX licenses?
I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.
At least why not use plain HTML if you need to mix format and structure?
You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
--
Cordially
Philippe Ombredanne
_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal
_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal
Sam:
They are definitely supposed to be nested. If you are seeing one like this than either the original template's spacing was equally flat (I could see it for one or two, but not many), or the correction pass I made to remove the redundant list items wasn't quite correct. There were about 3-4 types of failures that I thought I was able to automatically correct, but obviously I didn't look quite deep enough. If you spot these, feel free to label them with a 'bug' label and I will address them in batch. I can roll back the git history and extract the correctly nested version, or at least part of it, or convert it by hand or something.
Philippe:
The initial proposal tried to minimize the use of XML tags at the cost of making whitespace significant. After some discussion, this seemed (besides being a bit of a messy solution) to not meet our needs, and so I revised it so that whitespace was NOT significant, which required more structural tags to identify paragraphs and the like.
Regarding "mixed format and structure", if you're referring to the <b> tag, it is not a format tag but was intended to be "b for bullet". Many of the tags are likely getting renamed and this is probably one of them. The original choice was geared towards having as little visual space taken up by tags as possible, but as you can see we've moved past the point where that's a useful compromise here. If you're referring to <p>, <br/>, <list> or <li>, these are indeed structural tags, though the line is a bit fuzzy because it's often easiest to think about them in formatting terms. We might think of "<br/>" as a new line, which is presentation, as opposed to a structural break between two sections of text, or "<p>" as two new lines as opposed to a grouping of text into a paragraph, for example. Their primary use IS for presentation: we do need to render HTML files as one of our outputs, so we need enough information about the *structure* of the document to *format* it usefully.
> But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?
Our purpose right now is not primarily editing, though I welcome Sam's contributions on this count, since otherwise it'll be me doing this work ;) The primary work right now is verifying the selection of "matchable sections" of text, which were done partly by an automated process, and partly by myself without full legal understanding and context of these licenses.
> I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.
I have to agree in most cases, but luckily for us, editing is not something we should have to do a lot of and when we do it will be pretty targeted. The bulk of the effort is the initial construction of the XML file, which can be eased by way of tooling. It does happen that XML is perfect for our particular use case, in my opinion, since what we need to do is mark up some source text somewhat arbitrarily to add information about it. That's something that a data-only format like JSON just doesn't have the capability to elegantly express.
> At least why not use plain HTML if you need to mix format and structure?
> You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
I don't know about you, but in the past when I've used such tools, the output HTML is far from minimal and often times a mess. Different tools will produce different HTML, which is a problem when working in a shared space (git repository). A WYSIWYG editor doesn't necessarily have the tools to represent the information we are working with, either: replaceable text, you could perhaps hack as a hyperlink but that would be kind of nasty... and we still have other information to bind, such as the name of the variable to store the match in. Our lists are sometimes ordered, sometimes unordered; when they're ordered, they are often formatted various ways. Adjusting the format to be correct (letters, numbers, roman numerals, whatever) is doable to some extent with css, but not in a way a WYSIWYG editor would necessarily support (what do you do about "2.a)"?) How do you represent sections of text that should be optional?
I like the idea of using an existing such editor naively, but when it comes down to the details we'd essentially be hacking or working around almost every part of the structure of the files except, maybe, list formatting.
I do think that in the future some work to make actually editing the files and creating them easier should be done, and I'll probably be volunteering for that, but I'd like to wrap up by bringing back to the first point: the current effort is not primarily intended to be focused on editing massive swathes of XML, and certainly not creating whole files. After we've vetted the structure of the documents from a legal perspective with regards to matching, there is a lot of programmatic work we can do without worry, so I wouldn't get too wrapped up in the details of the names of tags or stuff like that just yet.
Kris
-----Original Message-----
From: spdx-legal-bounces@... [mailto:spdx-legal-bounces@...] On Behalf Of Philippe Ombredanne
Sent: Monday, May 09, 2016 09:55
To: Sam Ellis <Sam.Ellis@...>
Cc: SPDX-legal (spdx-legal@...) <spdx-legal@...>
Subject: Re: Nested lists in SPDX XML files.
On Mon, May 9, 2016 at 9:25 AM, Sam Ellis <Sam.Ellis@...> wrote:
> Hi,
>
> When reviewing the SPDX XML files, I see many licenses that contain
> nested lists such as:
>
> 1) some text…
> a) some text…
>
> And these are converted to XML like this:
>
> <list>
> <li>
> <b>1)</b><p>some text…</p>
> </li>
> </list>
> <list>
> <li>
> <b>a)</b><p>some text…</p>
> </li>
> </list>
>
> Note that the XML above places the bullets sequentially rather than
> being nested. I would like to check, does the XML syntax support
> nesting, and if so, should we be using it to represent cases such as this?
>
> To make it clearer, this is how the nested equivalent of the above
> might look, with the a) list inside the 1) list:
>
> <list>
> <li>
> <b>1)</b><p>some text…</p>
> </li>
> <list>
> <li>
> <b>a)</b><p>some text…</p>
> </li>
> </list>
> </list>
>
> My view is that by representing nested lists sequentially then we are
> losing some of the structure of the original text. On the other hand,
> if the main purpose of the lists is to allow for identification of
> bullets then the sequential representation is just fine for this.
I have a lot of respect for what you are embarking in:
I would not dare editing by hand 100 of such XML files: I find this rather confusing and error prone.
And I am a programmer...
But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?
Now, if there is a purpose, you raise a good point in this post (and your previous post about XML entities escaping).
Why is this format mixing structure and formatting together, in a pseudo-HTML format?
Is this meant to become the reference text for SPDX licenses?
I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.
At least why not use plain HTML if you need to mix format and structure?
You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
--
Cordially
Philippe Ombredanne
_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal
_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal
They are definitely supposed to be nested. If you are seeing one like this than either the original template's spacing was equally flat (I could see it for one or two, but not many), or the correction pass I made to remove the redundant list items wasn't quite correct. There were about 3-4 types of failures that I thought I was able to automatically correct, but obviously I didn't look quite deep enough. If you spot these, feel free to label them with a 'bug' label and I will address them in batch. I can roll back the git history and extract the correctly nested version, or at least part of it, or convert it by hand or something.
Philippe:
The initial proposal tried to minimize the use of XML tags at the cost of making whitespace significant. After some discussion, this seemed (besides being a bit of a messy solution) to not meet our needs, and so I revised it so that whitespace was NOT significant, which required more structural tags to identify paragraphs and the like.
Regarding "mixed format and structure", if you're referring to the <b> tag, it is not a format tag but was intended to be "b for bullet". Many of the tags are likely getting renamed and this is probably one of them. The original choice was geared towards having as little visual space taken up by tags as possible, but as you can see we've moved past the point where that's a useful compromise here. If you're referring to <p>, <br/>, <list> or <li>, these are indeed structural tags, though the line is a bit fuzzy because it's often easiest to think about them in formatting terms. We might think of "<br/>" as a new line, which is presentation, as opposed to a structural break between two sections of text, or "<p>" as two new lines as opposed to a grouping of text into a paragraph, for example. Their primary use IS for presentation: we do need to render HTML files as one of our outputs, so we need enough information about the *structure* of the document to *format* it usefully.
But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?Our purpose right now is not primarily editing, though I welcome Sam's contributions on this count, since otherwise it'll be me doing this work ;) The primary work right now is verifying the selection of "matchable sections" of text, which were done partly by an automated process, and partly by myself without full legal understanding and context of these licenses.
I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.I have to agree in most cases, but luckily for us, editing is not something we should have to do a lot of and when we do it will be pretty targeted. The bulk of the effort is the initial construction of the XML file, which can be eased by way of tooling. It does happen that XML is perfect for our particular use case, in my opinion, since what we need to do is mark up some source text somewhat arbitrarily to add information about it. That's something that a data-only format like JSON just doesn't have the capability to elegantly express.
At least why not use plain HTML if you need to mix format and structure?I don't know about you, but in the past when I've used such tools, the output HTML is far from minimal and often times a mess. Different tools will produce different HTML, which is a problem when working in a shared space (git repository). A WYSIWYG editor doesn't necessarily have the tools to represent the information we are working with, either: replaceable text, you could perhaps hack as a hyperlink but that would be kind of nasty... and we still have other information to bind, such as the name of the variable to store the match in. Our lists are sometimes ordered, sometimes unordered; when they're ordered, they are often formatted various ways. Adjusting the format to be correct (letters, numbers, roman numerals, whatever) is doable to some extent with css, but not in a way a WYSIWYG editor would necessarily support (what do you do about "2.a)"?) How do you represent sections of text that should be optional?
You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
I like the idea of using an existing such editor naively, but when it comes down to the details we'd essentially be hacking or working around almost every part of the structure of the files except, maybe, list formatting.
I do think that in the future some work to make actually editing the files and creating them easier should be done, and I'll probably be volunteering for that, but I'd like to wrap up by bringing back to the first point: the current effort is not primarily intended to be focused on editing massive swathes of XML, and certainly not creating whole files. After we've vetted the structure of the documents from a legal perspective with regards to matching, there is a lot of programmatic work we can do without worry, so I wouldn't get too wrapped up in the details of the names of tags or stuff like that just yet.
Kris
-----Original Message-----
From: spdx-legal-bounces@... [mailto:spdx-legal-bounces@...] On Behalf Of Philippe Ombredanne
Sent: Monday, May 09, 2016 09:55
To: Sam Ellis <Sam.Ellis@...>
Cc: SPDX-legal (spdx-legal@...) <spdx-legal@...>
Subject: Re: Nested lists in SPDX XML files.
On Mon, May 9, 2016 at 9:25 AM, Sam Ellis <Sam.Ellis@...> wrote:
Hi,I have a lot of respect for what you are embarking in:
When reviewing the SPDX XML files, I see many licenses that contain
nested lists such as:
1) some text…
a) some text…
And these are converted to XML like this:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
</list>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
Note that the XML above places the bullets sequentially rather than
being nested. I would like to check, does the XML syntax support
nesting, and if so, should we be using it to represent cases such as this?
To make it clearer, this is how the nested equivalent of the above
might look, with the a) list inside the 1) list:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
</list>
My view is that by representing nested lists sequentially then we are
losing some of the structure of the original text. On the other hand,
if the main purpose of the lists is to allow for identification of
bullets then the sequential representation is just fine for this.
I would not dare editing by hand 100 of such XML files: I find this rather confusing and error prone.
And I am a programmer...
But I think I lost track of the value and purpose of this editing in the first place... can someone refresh me?
Now, if there is a purpose, you raise a good point in this post (and your previous post about XML entities escaping).
Why is this format mixing structure and formatting together, in a pseudo-HTML format?
Is this meant to become the reference text for SPDX licenses?
I am questioning the use of XML in first place, which may be a format that is barely OK for saving data files, but is quite terrible for editing IMHO.
At least why not use plain HTML if you need to mix format and structure?
You could then use some of the decent HTML WYSIWYG editing tools available and not have to spend more time on the form than the substance?
--
Cordially
Philippe Ombredanne
_______________________________________________
Spdx-legal mailing list
Spdx-legal@...
https://lists.spdx.org/mailman/listinfo/spdx-legal
Hi,I have a lot of respect for what you are embarking in:
When reviewing the SPDX XML files, I see many licenses that contain nested
lists such as:
1) some text…
a) some text…
And these are converted to XML like this:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
</list>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
Note that the XML above places the bullets sequentially rather than being
nested. I would like to check, does the XML syntax support nesting, and if
so, should we be using it to represent cases such as this?
To make it clearer, this is how the nested equivalent of the above might
look, with the a) list inside the 1) list:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
</list>
My view is that by representing nested lists sequentially then we are losing
some of the structure of the original text. On the other hand, if the main
purpose of the lists is to allow for identification of bullets then the
sequential representation is just fine for this.
I would not dare editing by hand 100 of such XML files: I find this
rather confusing and error prone.
And I am a programmer...
But I think I lost track of the value and purpose of this editing in
the first place... can someone refresh me?
Now, if there is a purpose, you raise a good point in this post (and
your previous post about XML entities escaping).
Why is this format mixing structure and formatting together, in a
pseudo-HTML format?
Is this meant to become the reference text for SPDX licenses?
I am questioning the use of XML in first place, which may be a format
that is barely OK for saving data files, but is quite terrible for
editing IMHO.
At least why not use plain HTML if you need to mix format and structure?
You could then use some of the decent HTML WYSIWYG editing tools
available and not have to spend more time on the form than the
substance?
--
Cordially
Philippe Ombredanne
Hi,
When reviewing the SPDX XML files, I see many licenses that contain nested lists such as:
1) some text…
a) some text…
And these are converted to XML like this:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
</list>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
Note that the XML above places the bullets sequentially rather than being nested. I would like to check, does the XML syntax support nesting, and if so, should we be using it to represent cases such as this?
To make it clearer, this is how the nested equivalent of the above might look, with the a) list inside the 1) list:
<list>
<li>
<b>1)</b><p>some text…</p>
</li>
<list>
<li>
<b>a)</b><p>some text…</p>
</li>
</list>
</list>
My view is that by representing nested lists sequentially then we are losing some of the structure of the original text. On the other hand, if the main purpose of the lists is to allow for identification of bullets then the sequential representation is just fine for this.