Tag-value RDF mapping proposal


Peter Williams <peter.williams@...>
 

I wrote up the beginnings of a proposal for doing to tag-value to rdf
mapping, <http://www.spdx.org/wiki/proposal-2010-12-07-1-tag-value-rdf-mapping>.
The proposal does not include any changes to the spec proper yet.
Once a consensus on the technicalities of the format and mapping
develops we can translate that in the spec.

I urge everyone to read the proposal and make comments/suggests. This
is a very important part of the spec and the more eyes we have on it
the better it will be.

Peter
www.openlogic.com


Peter Williams <peter.williams@...>
 

I realized that there was a way to simplify the tag value files a bit
so updated the proposal
(<http://www.spdx.org/wiki/proposal-2010-12-07-1-tag-value-rdf-mapping>).

These changes make producing the files a little bit simpler but it
make writing a parser a bit more complicated. They also limit
non-standard extension of the format. However, the increased
simplicity is probably worth these trade offs if we think any one
every going to produce one of these by hand.

Which raises a question: do we actually anticipate humans producing
these files? That seems like an awful lot of tedious typing for any
human to actually do it. If we don't expect humans write these files
perhaps we should revisit the supported format discussion. If we do
expect humans to write these files i would the interested in what
situations.

Peter

On Tue, Dec 7, 2010 at 11:58 AM, Peter Williams
<peter.williams@...> wrote:
I wrote up the beginnings of a proposal for doing to tag-value to rdf
mapping, <http://www.spdx.org/wiki/proposal-2010-12-07-1-tag-value-rdf-mapping>.
 The proposal does not include any changes to the spec proper yet.
Once a consensus on the technicalities of the format and mapping
develops we can translate that in the spec.

I urge everyone to read the proposal and make comments/suggests.  This
is a very important part of the spec and the more eyes we have on it
the better it will be.

Peter
www.openlogic.com


Peter Williams <peter.williams@...>
 

Hi all,

Kate posted some comments on the proposal. It is worth me responding
to the list so everyone can see where the debate is.

Also there are many special cases
regarding now following lines are parsed based on the tag. These
issues combined with the lack of an overview section on the tag-value
format make it hard to understand how the tag-value files should be
produced.
<KES>: it is one tag per data value all on the same line, unless it is
indicated as a multiple line field, in which case XML like syntax was
discussed as a way to be used to delimit it. Not sure why this is so
confusing? Are you looking at the WIKI or the .pdf? (WIKI could be
the source of the confusion due to its formatting limiations...??? )
I was unaware of the `<text>...</text>` syntax. This approach
definitely solves my multi-line value issue. I think i joined the
group after that discussion occurred and it has not yet made its way
into the official spec. Using that approach the first example would
look like

SPDXVersion: SPDX-1.0
CreatedBy: Tool: spdx-gen 1.0

[Package http://oss.net/foo-1.0.tar.gz]
DeclaredLicense: FullLicense-1
DeclaredLicense: license:GPL2
Description: <text>This
is along
multiline value</text>

[License FullLicnse-1]
LicenseText: <text>
Some terms and conditions
</text>

A uniform way declare new resources (entities) and link to them
will be introduced. A new resource would be declared by enclosing the
type of the resource and it's uri, either full or a CURIE or
node-id if it is a blank node, in square brackets ("[", "]"). The
resource type will be Package, File, Project or License
<KES> Positional order will give it, I'm not sure this is adding value
This generalized pattern will make it much easier to create backward
compatible improvements to the spdx format as time goes on. New type
of sections can be added and any spdx processor that does not
understand that type of section could just it. In the current
positional approach you run the risk of having properties attached to
the wrong top-level item in that situation.

This uniformity would also make implementing improvements easier. The
parser component of spdx processors would not have to be changed at
all to support new versions of the spec. New sections and tags would
be parsed just like the existing structures. Only parts of the tool
that interpreted the information would need to be updated.

It will make it easier to write lint like tools for spdx in the face
of future versions of spdx. By encoding the structure explicitly in
the format a lint-ish tool will be able to use that information even
if it does not understand the sections/tags themselves.

This approach would completely remove the positional nature of the
format. This, in and of itself, is a huge win in my book.
Remembering the appropriate order of the, large number of, tags in
spdx will be difficult and tedious. People do much better with
formats that have an explicit structure, rather than ones with an
implicit order based one. Explicitly structured documents are easier
to read, produce and to parse/interpret reliably. This is born out by
the fact that most, if not all, popular interchange specifications use
order independent formats such as xml.

It also makes the mapping to rdf easier to describe, understand and implement.

Arbitrary rdf properties will be
allows to be tags when they are enclosed in angle brackets ("<", ">").
This will support forward compatibility and innovation by allowing
vendor or communities to introduce new data into spdx files.
<KES> Whoa, the only place we have buy in from the lawyers is in the comment field. Introducing this notation, needs to be thought out, - also this conflicts with
the XML notation used to denote multiline fields.
I don't think it does actually conflict. The `<text>` and `</text>`
sequences can only occur in the value of a tag. This described a
syntax that can only occur in the tag part of the line. I don't think
it would be hard to all write a parser to distinguish between the two.
On the other hand, i am open to other ways to allow arbitrary rdf
values to be added.

I feel strongly that extensibility is an important feature. Spdx
needs a mechanism to allow experimentation and innovation. It is
preferable to have that happen in a way the will not cause problems
with future versions of the spec. Forcing experiments into globally
unique namespaces (using uris) will help prevent collisions between
experiments and future improvements in spdx.

I think the lawyers will be comfortable with this approach. This
ability does not constitute a legal judgment. Additionally, by
forcing such non-standard tags into their own namespaces we are also
making it clear that they are not part of the spdx standard.

Peter
www.openlogic.com


Kate Stewart <kate.stewart@...>
 

On Tue, 2010-12-14 at 12:15 -0700, Peter Williams wrote:

A uniform way declare new resources (entities) and link to them
will be introduced. A new resource would be declared by enclosing the
type of the resource and it's uri, either full or a CURIE or
node-id if it is a blank node, in square brackets ("[", "]"). The
resource type will be Package, File, Project or License
<KES> Positional order will give it, I'm not sure this is adding value
This generalized pattern will make it much easier to create backward
compatible improvements to the spdx format as time goes on. New type
of sections can be added and any spdx processor that does not
understand that type of section could just it. In the current
positional approach you run the risk of having properties attached to
the wrong top-level item in that situation.
The phone discussion highlighted that there was a terminology difference
on how sections are used in the rdf context, vs. in the specification
context. Peter is using sections in the rdf context, as Projects are
not sections in the specification.

The approach I'm advocating is by recognition of keywords to start
grouping of related tags in the section approach advocated by the
spec. By recognizing keywords associated with Package, File, License
related fields, the same effect can be created, without adding
additional character syntax.

The proposal from the meeting was to create a flex/bison (or lex/yacc -
depending on which syntax I can remember best) version of the tag
specification to clarify the keywords meeting and the fields, and
possibly rename some tag keywords to be more meaningful. I took the
action to get a draft out for discussion.

The other point worth noting is that we'll be updating the specification
to splitting out the reviewedby, into a separate section (in the spec
sense), and unifying the values associated with the tag with those used
in createdby.


This uniformity would also make implementing improvements easier. The
parser component of spdx processors would not have to be changed at
all to support new versions of the spec. New sections and tags would
be parsed just like the existing structures. Only parts of the tool
that interpreted the information would need to be updated.
I'm not sure I agree with the conclusion, and it would make remembering
the syntax to specify harder for those doing hand coding. Keeping
this as intuitive to use as DEP-5 seems a reasonable goal for the tag
version.


This approach would completely remove the positional nature of the
format. This, in and of itself, is a huge win in my book.
Remembering the appropriate order of the, large number of, tags in
spdx will be difficult and tedious. People do much better with
formats that have an explicit structure, rather than ones with an
implicit order based one. Explicitly structured documents are easier
to read, produce and to parse/interpret reliably. This is born out by
the fact that most, if not all, popular interchange specifications use
order independent formats such as xml.

It also makes the mapping to rdf easier to describe, understand and implement.
Positional nature is being overinterpretted here, I think. Within a
section, you should be able to associate related tags, without forcing a
specific order. I think that the lex/yacc will make it clearer.

I'll split the other issue referenced into its own thread.

Kate


Peter Williams <peter.williams@...>
 

On Wed, Dec 15, 2010 at 3:55 PM, Kate Stewart
<kate.stewart@...> wrote:

This uniformity would also make implementing improvements easier.  The
parser component of spdx processors would not have to be changed at
all to support new versions of the spec.  New sections and tags would
be parsed just like the existing structures.  Only parts of the tool
that interpreted the information would need to be updated.
I'm not sure I agree with the conclusion, and it would make remembering
the syntax to specify harder for those doing hand coding.   Keeping
this as intuitive to use as DEP-5 seems a reasonable goal for the tag
version.
I find DEP-5 quite unintuitive with regard to repeating structures
such as files. I prefer distinct syntactic structures to be visually
distinct. This allows humans to quickly detect the new context.
Blessing particular tags as the start of a new "section" turns those
tags into a different syntactic beast than all the other tags.
However, the uninitiated human reader is left without any hint of that
fact.

That being said, think the biggest issue we need to deal with is
forward compatibility. We need a way for future versions of the spec
to be able to introduce meaningful improvements that are backwards
compatibility with this version. Having implicit section boundaries
greatly limits the freedom of future versions. They will not be able
to introduce new sections without breaking backwards compatibility. I
think that limitation will basically make this format an evolutionary
dead end because many potential changes will be impossible, or
inordinately difficult, to implement without the use of sections.
Future versions of the spec will likely have to choose between
obsoleting a large number of tools or not implementing needed
improvements.

Peter

PS: in this email i am using "section" in the sense Kate usually does,
ie a group of related tag-value pairs in an spdx file.