Re: [EXT] [spdx-tech] Element IDs


Sean Barnum
 

>>1) namespaces must be globally unique, and UUIDs in practice collide, in part because they don't have a full 128 bits of uniqueness because they contain built-in structure, and in part because even 128 bits isn't enough.  So just for discussion/whitepaper purposes, assume that namespaces are 256 bit cryptographically-random values.

 

I would agree that the namespace portion of an identifier must be globally unique (i.e. nobody else is creating ids in that namespace without deconflicting the full identifiers).

My assertion on the call is that any presumption of “globally unique” based soley on the probability space of possible values is a poor general approach because it does not explicitly take into account the instantial value space where the number of objects may be very large and increase the probability of collisions. It does not deterministically prevent collisions. While extremely unlikely, it is possible to have a conflict with only two objects.

On the call, I mentioned that one purpose for the namespace in the id is to ensure global uniqueness.

I did not mention another purpose and that is to support linked data and other similar use cases requiring identification of objects across existing network ecosystems. Linked data relies on URIs that can be used via http to not only identify a specific object within an end datastore somewhere through an exposed SPARQL endpoint but also uses the namespace portion of the id to route the requests to the proper endpoint. So, to support use cases such as linked data we need namespaces to be URIs themselves.

 

>>2) elements are always identified by namespace and local ID, where local means under the control of the namespace owner.  Don't get hung up on what owner means - anybody can become an owner by generating a 256 bit random number for their namespace.

 

I would rephrase this to be “elements are always identified by an identifier consisting of a combination of a namespace and a component id” I am avoiding using “local id” as it may imply that that portion of the identifier is only local to that namespace whereas the same component id could exist in multiple different namespaces. It is the combination of namespace and component id that makes the identifier globally unique.

 

>>3) Element identifiers are built into the model.  The diagram shows Element having an "idString" property, but what it should have is namespace and local_id properties that together form the primary key for the element.  Sebastian is correct, a separator isn't needed in the model because that only comes into play when serializing Element identifiers.

 

I think it is important that we realize that the identifier (idString) is a valid URI that is composed of the namespace and the component id.

It is not adequate to split these properties and store them in separate properties.

The fully composed identifier is not simply an issue of serialization.

It is necessary at the model level to support any Element level ObjectProperties within the model (createdBy, originator, element, rootElement,etc) and the ‘from’ and ‘to’ properties of Relationship Elements. These refer to a single composed identifier for each Element.

Aside from the inability to support very fundamental capabilities, it would also add unnecessary complexity to the model.

I do not see any value or advantage in wanting to split this out even if it could support necessary capabilities. I see only downsides.

 

>>4) Each serialization of Element identifiers MUST allow them to be unambiguously deserialized back to namespace and local_id properties, which is where the separator comes in.  If the deserialized value doesn't have a namespace then the Element inherits it from its containing document, or if the element doesn't have a document or other source for namespace, the local_id by itself is invalid.

Each serialization of Elements MUST maintain integrity and consistency of the fully composed identifier string during serialization and deserialization. See my above comment for the quick cut on reasoning.

Each supported serialization MUST define its own set of binding rules to and from the model. Each one will be different from the others based on the structure and expressivity of the serialization form.

For any serialization that supports expressing the model (which is a requirement for it to be supported), it should be simple to define binding rules for implicit inheritance of some properties from Document if they are not defined on Elements within that Document. This should not be a serialization issue. It should be straightforward.

 

I attached a tweaked version of the external map example. In this version I added some Element metadata properties to the Document and to various Elements within the Document to illustrate the simplicity of a serialization-based avoidance of duplicating metadata unnecessarily on all Elements while still maintaining integrity and expressivity where needed.

The basic rule to apply during deserialization or splitting up of the content is:

  • If a metadata property (specVersion, created, createdBy, profile, dataLicense) is defined on Document but not defined on an Element defined within the Document then that metadata property should be duplicated onto the Element.

 

In the example you will see that specVersion, createdBy and created are all defined on the Document.

  • For the ACME identity Element,
    • there is a ‘created’ property specified because that Element was created prior to the creation of the Document
    • There is no createdBy or specVersion defined on the Element as these are the same as for the Document and so these properties from the Document should be added to the Element upon deserialization
  • For the foobar file,
    • There is no created, createdBy or specVersion defined on the Element as these are the same as for the Document (this File Element was created at same time as Document was created) and so these properties from the Document should be added to the Element upon deserialization
  • For the 5 fugazi Elements,
    • There are ‘created’ and ‘createdBy’ properties specified because these Elements were created at a different time than the Document and by a different party
    • There is no specVersion defined on the Element as this is the same as for the Document and so this properys from the Document should be added to the Element upon deserialization

I believe this simple example illustrates the simplicity of this approach and the necessity of Element based metadata properties for Elements created by different parties or by the same party at different times. They could also differ in specVersion, profile or dataLicense

 

Again, this approach is space efficient, fully supports the necessary independence of Elements, and has been in very successful and broad use for many years in other communities including cyber threat intelligence.


>>5) (My pet issue) although a namespace owner can choose anything as local_ids, Element also has information such as Class, name, and comment that can be used as a hint/label when serializing Element IDs.

I strongly agree with the value of identifiers incorporating a hint/label component. There are multiple possible ways to do this.

My earlier proposal was something like <namespace>--<hint>--<component id> but the order of hint and component id could be reversed and different separator characters could be used.


Building namespace and local_id into the model explicitly provides a foundation for discussing and accommodating all use cases.  IdString is an obstacle to that discussion.

I would strongly disagree with this assertion. Namespace and local id alone are inadequate as explained above.

The critical decision here is what an identifier consists of, what constraints apply to each identifier component and how is the identifier structured when combining the components.

It is all about the fully composed identifier (idString).

 

 

My outline for identifier requirements would be something like the following:

 

  • All Element instances must have an identifier
  • All Element identifiers MUST by globally unique
  • All Element identifiers MUST by valid URIs supporting http resolution and redirection
  • All Element identifiers MUST consist, at least, of a URI-based namespace (controlled by the producer of the identifier and Element) and a component id that is guaranteed to be unique within the namespace
    • UUID is recommended for component id
      • UUID v4 MAY be used for randomized component ids
      • UUID v5 MAY be used for deterministic component ids
  • Element identifiers MAY contain a short descriptive label to convey human-readable information about the nature and/or purpose of the Element instance
  • The namespace, component id and label portions of an Element identifier MUST be explicitly separated using a consistent character NOTE: we need to decide on this character and specify it
    • The separator character SHOULD NOT be ‘/’ as it would not explicitly disambiguate each identifier portion from the others
    • The separator character SHOULD NOT be ‘#’ as this would explicitly denote a fragment within the URI spec. It would denote that only the namespace portion could be resolved across http server infrastructure (including redirection) and that all processing of the component id must occur only on the client side. This would not support referencing or retrieval of individual elements across http (e.g. linked data).
    • Possibilities include ‘_’, ‘—‘, etc

 

 

Sean

 

Sean Barnum

C – 703-473-8262

sbarnum@...

We are here to change the world!

signature_1388200754signature_1442303485signature_245889441signature_984325223signature_929545762

signature_1845422085

 

 

From: <Spdx-tech@...> on behalf of David Kemp <dk190a@...>
Date: Tuesday, August 3, 2021 at 2:34 PM
To: SPDX-list <Spdx-tech@...>
Subject: [EXT] [spdx-tech] Element IDs

 

Kate may be right that we'll need a whitepaper.  But not just yet.  Here is what I heard today:

1) namespaces must be globally unique, and UUIDs in practice collide, in part because they don't have a full 128 bits of uniqueness because they contain built-in structure, and in part because even 128 bits isn't enough.  So just for discussion/whitepaper purposes, assume that namespaces are 256 bit cryptographically-random values.

 

2) elements are always identified by namespace and local ID, where local means under the control of the namespace owner.  Don't get hung up on what owner means - anybody can become an owner by generating a 256 bit random number for their namespace.

 

3) Element identifiers are built into the model.  The diagram shows Element having an "idString" property, but what it should have is namespace and local_id properties that together form the primary key for the element.  Sebastian is correct, a separator isn't needed in the model because that only comes into play when serializing Element identifiers.

 

4) Each serialization of Element identifiers MUST allow them to be unambiguously deserialized back to namespace and local_id properties, which is where the separator comes in.  If the deserialized value doesn't have a namespace then the Element inherits it from its containing document, or if the element doesn't have a document or other source for namespace, the local_id by itself is invalid.

5) (My pet issue) although a namespace owner can choose anything as local_ids, Element also has information such as Class, name, and comment that can be used as a hint/label when serializing Element IDs.

Building namespace and local_id into the model explicitly provides a foundation for discussing and accommodating all use cases.  IdString is an obstacle to that discussion.

Dave

Join {Spdx-tech@lists.spdx.org to automatically receive all group messages.