Captain of the Ship


David Kemp
 

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David


Gary O'Neall
 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary



On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:
We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


William Bartholomew (CELA)
 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary


On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


William Bartholomew (CELA)
 

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary

On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Gary O'Neall
 

> Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.  This brings up another meta-issue – should we be optimizing the serialization format to be used as an internal storage format or optimizing it for deserialization and reserialization?  If the latter, than having arrays of types is much easier IMHO.  If you go the type property route, all the deserializers I’m familiar with would require writing custom deserialization code whereas using the arrays can use just the of the shelf libraries.  I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.

 

To your second meta issue, Below are my thoughts based on past experience maintaining some of the SPDX tooling:

 

  • If we ONLY support JSON-LD, a number of issues go away and the tooling is vastly simplified.
  • Supporting JSON-LD and the RDF dialects are just slightly more complicated for the tooling since JSON-LD can be viewed as another dialect of RDF.
  • Supporting YAML and/or XML introduces some of the same issues as supporting a simplified JSON format.  If we support one of these, we might as well support all IMHO.
  • Tag/Value is it’s own set of (rather large) complexities.
  • Spreadsheets have a similar set of complexities as Tag/Value, but they are distinct enough that there isn’t much leverage between solving both at the same time.  I will be using spreadsheets myself, so I’ll probably continue to support some type of spreadsheet format in 3.0 if it is at all feasible.

 

Gary

 

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:20 PM
To: William Bartholomew (CELA) <willbar@...>; gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: Re: [spdx-tech] Captain of the Ship

 

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary

On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


David Kemp
 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.

+1.

It is not just a matter of your software, it is a fundamental design question whether to maintain separation between the logical model and its serializations.  Maintaining separation shouldn't be a matter of personal preference, it's good software engineering.  The OWL Web Ontology Language https://www.w3.org/TR/owl2-overview/ has an excellent diagram illustrating the separation between semantics and syntax.  Several serializations are defined in OWL (Manchester Syntax, Functional Syntax, RDF/XML, OWL/XML, and Turtle), and more syntaxes have been added since (JSON-LD, RDF-star, ...).

The SPDX graph should have an opaque internal representation; developers should be able to implement it in any programming language using any variable types or classes supported by the language. The software just needs to be able to set and get the value of every property of every element in the graph irrespective of the data formats and structures used to serialize them.

 I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.

The design convention that works for us is to just make type a normal property, since JSON doesn't have anything except normal properties :-):

type: {
    package: { ... package properties ...}
}

The schema says type is oneOf the schemas for each of the possible types:

"ElementType": {
  "type": "object",
  "additionalProperties": false,
  "minProperties": 1,
  "maxProperties": 1,
  "properties": {
    "annotation": {"$ref": "#/definitions/Annotation"},
    "relationship": {"$ref": "#/definitions/Relationship"},
    "identity": {"$ref": "#/definitions/Identity"},

    ...
  }
}

Regards,
David


On Thu, Jul 21, 2022 at 3:52 PM Gary O'Neall <gary@...> wrote:

> Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.  This brings up another meta-issue – should we be optimizing the serialization format to be used as an internal storage format or optimizing it for deserialization and reserialization?  If the latter, than having arrays of types is much easier IMHO.  If you go the type property route, all the deserializers I’m familiar with would require writing custom deserialization code whereas using the arrays can use just the of the shelf libraries.  I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.

 

To your second meta issue, Below are my thoughts based on past experience maintaining some of the SPDX tooling:

 

  • If we ONLY support JSON-LD, a number of issues go away and the tooling is vastly simplified.
  • Supporting JSON-LD and the RDF dialects are just slightly more complicated for the tooling since JSON-LD can be viewed as another dialect of RDF.
  • Supporting YAML and/or XML introduces some of the same issues as supporting a simplified JSON format.  If we support one of these, we might as well support all IMHO.
  • Tag/Value is it’s own set of (rather large) complexities.
  • Spreadsheets have a similar set of complexities as Tag/Value, but they are distinct enough that there isn’t much leverage between solving both at the same time.  I will be using spreadsheets myself, so I’ll probably continue to support some type of spreadsheet format in 3.0 if it is at all feasible.

 

Gary

 

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:20 PM
To: William Bartholomew (CELA) <willbar@...>; gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: Re: [spdx-tech] Captain of the Ship

 

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary

On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Dick Brooks
 

REA also uses an abstract in memory SBOM model that is populated from SPDX and CycloneDX SBOM’s of different formats.  

 

Thanks,

 

Dick Brooks

 

Active Member of the CISA Critical Manufacturing Sector,

Sector Coordinating Council – A Public-Private Partnership

 

Never trust software, always verify and report!

http://www.reliableenergyanalytics.com

Email: dick@...

Tel: +1 978-696-1788

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of David Kemp
Sent: Thursday, July 21, 2022 5:39 PM
To: Gary O'Neall <gary@...>
Cc: William Bartholomew (CELA) <willbar@...>; SPDX-list <Spdx-tech@...>
Subject: Re: [spdx-tech] Captain of the Ship

 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.


+1.

It is not just a matter of your software, it is a fundamental design question whether to maintain separation between the logical model and its serializations.  Maintaining separation shouldn't be a matter of personal preference, it's good software engineering.  The OWL Web Ontology Language https://www.w3.org/TR/owl2-overview/ has an excellent diagram illustrating the separation between semantics and syntax.  Several serializations are defined in OWL (Manchester Syntax, Functional Syntax, RDF/XML, OWL/XML, and Turtle), and more syntaxes have been added since (JSON-LD, RDF-star, ...).

The SPDX graph should have an opaque internal representation; developers should be able to implement it in any programming language using any variable types or classes supported by the language. The software just needs to be able to set and get the value of every property of every element in the graph irrespective of the data formats and structures used to serialize them.

 I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.


The design convention that works for us is to just make type a normal property, since JSON doesn't have anything except normal properties :-):

type: {

    package: { ... package properties ...}
}

The schema says type is oneOf the schemas for each of the possible types:

"ElementType": {
  "type": "object",
  "additionalProperties": false,
  "minProperties": 1,
  "maxProperties": 1,
  "properties": {
    "annotation": {"$ref": "#/definitions/Annotation"},
    "relationship": {"$ref": "#/definitions/Relationship"},
    "identity": {"$ref": "#/definitions/Identity"},

    ...
  }
}


Regards,
David

 

On Thu, Jul 21, 2022 at 3:52 PM Gary O'Neall <gary@...> wrote:

> Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.  This brings up another meta-issue – should we be optimizing the serialization format to be used as an internal storage format or optimizing it for deserialization and reserialization?  If the latter, than having arrays of types is much easier IMHO.  If you go the type property route, all the deserializers I’m familiar with would require writing custom deserialization code whereas using the arrays can use just the of the shelf libraries.  I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.

 

To your second meta issue, Below are my thoughts based on past experience maintaining some of the SPDX tooling:

 

  • If we ONLY support JSON-LD, a number of issues go away and the tooling is vastly simplified.
  • Supporting JSON-LD and the RDF dialects are just slightly more complicated for the tooling since JSON-LD can be viewed as another dialect of RDF.
  • Supporting YAML and/or XML introduces some of the same issues as supporting a simplified JSON format.  If we support one of these, we might as well support all IMHO.
  • Tag/Value is it’s own set of (rather large) complexities.
  • Spreadsheets have a similar set of complexities as Tag/Value, but they are distinct enough that there isn’t much leverage between solving both at the same time.  I will be using spreadsheets myself, so I’ll probably continue to support some type of spreadsheet format in 3.0 if it is at all feasible.

 

Gary

 

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:20 PM
To: William Bartholomew (CELA) <willbar@...>; gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: Re: [spdx-tech] Captain of the Ship

 

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary

On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Sean Barnum
 

I agree with William.

 

There are other complexities as well with adding type grouping extra layers in the serialization as well that, in my opinion, outweigh their usefulness to the one use case of “if you want to find everything of a certain type”

 

sean

 

From: Spdx-tech@... <Spdx-tech@...> on behalf of William Bartholomew (CELA) via lists.spdx.org <willbar=microsoft.com@...>
Date: Thursday, July 21, 2022 at 3:15 PM
To: gary@... <gary@...>, Spdx-tech@... <Spdx-tech@...>, David Kemp <dk190a@...>
Subject: [EXT] Re: [spdx-tech] Captain of the Ship

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary



On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Sean Barnum
 

Honestly, JSON-LD is 100% JSON.

It just defines and requires some reserved terms.

 

With a properly defined JSON-LD context the same serialization could easily be both.

 

It could be serialized as simple JSON and the context does all of the work of providing the specific details for expanding the content into full JSON-LD and the user never even has to know it exists.

 

I strongly believe our default serialization should be JSON-LD. It provides a lot of advantages including clean and explicit alignment to the model; support by an entire ecosystem of RDF tooling including lossless translation to/from a wide array of serialization formats, validation, etc.; ability to support both JSON-LD and JSON as stated above, etc.

 

sean

 

From: Spdx-tech@... <Spdx-tech@...> on behalf of William Bartholomew (CELA) via lists.spdx.org <willbar=microsoft.com@...>
Date: Thursday, July 21, 2022 at 3:20 PM
To: William Bartholomew (CELA) <willbar@...>, gary@... <gary@...>, Spdx-tech@... <Spdx-tech@...>, David Kemp <dk190a@...>
Subject: [EXT] Re: [spdx-tech] Captain of the Ship

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary


On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


Sean Barnum
 

Agree. This is what I alluded to in a prior response.

Adding the type properties rather than a simple array makes serialization/deserialization more complicated and does not work with most standards-based ecosystem tooling.

 

sesan

 

From: Spdx-tech@... <Spdx-tech@...> on behalf of Gary O'Neall <gary@...>
Date: Thursday, July 21, 2022 at 3:53 PM
To: willbar@... <willbar@...>, Spdx-tech@... <Spdx-tech@...>, 'David Kemp' <dk190a@...>
Subject: [EXT] Re: [spdx-tech] Captain of the Ship

> Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

Good point if you’re using the serialization format to represent your internal storage of the graph.  In all my SPDX software, I use a different internal representation of the SPDX graph than what is represented in the serialization format so this particular situation never comes up.  This brings up another meta-issue – should we be optimizing the serialization format to be used as an internal storage format or optimizing it for deserialization and reserialization?  If the latter, than having arrays of types is much easier IMHO.  If you go the type property route, all the deserializers I’m familiar with would require writing custom deserialization code whereas using the arrays can use just the of the shelf libraries.  I’m happy to be proven wrong on this point if anyone knows of a deserializer for JSON (not JSON-LD) that can understand the type property.

 

To your second meta issue, Below are my thoughts based on past experience maintaining some of the SPDX tooling:

 

  • If we ONLY support JSON-LD, a number of issues go away and the tooling is vastly simplified.
  • Supporting JSON-LD and the RDF dialects are just slightly more complicated for the tooling since JSON-LD can be viewed as another dialect of RDF.
  • Supporting YAML and/or XML introduces some of the same issues as supporting a simplified JSON format.  If we support one of these, we might as well support all IMHO.
  • Tag/Value is it’s own set of (rather large) complexities.
  • Spreadsheets have a similar set of complexities as Tag/Value, but they are distinct enough that there isn’t much leverage between solving both at the same time.  I will be using spreadsheets myself, so I’ll probably continue to support some type of spreadsheet format in 3.0 if it is at all feasible.

 

Gary

 

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:20 PM
To: William Bartholomew (CELA) <willbar@...>; gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: Re: [spdx-tech] Captain of the Ship

 

There’s a meta-question here that we need to answer related to JSON serialization, would SPDX 3.0 support JSON and JSON-LD, just JSON, or just JSON-LD? I’d lean towards JSON-LD as long as we have a purely mechanical upgrade process from SPDX 2.x JSON to SPDX 3.x JSON-LD. If we adopt JSON-LD then a number of serialization design questions already have answers, and it is still parseable as JSON.

 

 

Regards,

 

William Bartholomew (he/him) – Let’s chat

Principal Security Strategist

Global Cybersecurity Policy – Microsoft

 

My working day may not be your working day. Please don’t feel obliged to reply to this e-mail outside of your normal working hours.

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of William Bartholomew (CELA) via lists.spdx.org
Sent: Thursday, July 21, 2022 12:16 PM
To: gary@...; Spdx-tech@...; David Kemp <dk190a@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary

On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


David Kemp
 

I agree.

As I said earlier, if you don't serialize as a map the first thing you would do after deserializing an array of elements would be to build a primary index (IRI: value).  It would be almost as easy to build a secondary index (type: [values]).  The first is a single line of Python, the second might take two.

David



On Tue, Jul 26, 2022 at 10:55 AM Sean Barnum <sbarnum@...> wrote:

I agree with William.

 

There are other complexities as well with adding type grouping extra layers in the serialization as well that, in my opinion, outweigh their usefulness to the one use case of “if you want to find everything of a certain type”

 

sean

 

From: Spdx-tech@... <Spdx-tech@...> on behalf of William Bartholomew (CELA) via lists.spdx.org <willbar=microsoft.com@...>
Date: Thursday, July 21, 2022 at 3:15 PM
To: gary@... <gary@...>, Spdx-tech@... <Spdx-tech@...>, David Kemp <dk190a@...>
Subject: [EXT] Re: [spdx-tech] Captain of the Ship

Unfortunately, that one is a two-edged sword. If you don’t know the type (e.g. you’re trying to look something up by ID) then you need to search through all the types to find the ID. Conversely, if you want to find everything of a certain type then grouping by type is beneficial.

 

I’d lean towards not grouping by type because you can always create a type->id mapping when deserializing. Given that we’ll have more types with profiles, I think grouping by type will have more downsides than upsides.

 

William

 

From: Spdx-tech@... <Spdx-tech@...> On Behalf Of Gary O'Neall via lists.spdx.org
Sent: Wednesday, July 20, 2022 10:26 AM
To: Spdx-tech@...; David Kemp <dk190a@...>; SPDX-list <Spdx-tech@...>
Subject: [EXTERNAL] Re: [spdx-tech] Captain of the Ship

 

One additional consideration that came up in the 2.X discussion was how to handle the type for the elements.

In David's example, the type is one of the properties. For 2.X, we implemented separate arrays for each type. For some of the JSON serialization libraries, this affords a significant convenience when deserializing into objects of the same type.

Note that this isn't an issue for JSON-LD or RDF serialization formats which natively handle types.

Gary



On July 20, 2022 11:57:01 AM CDT, David Kemp <dk190a@...> wrote:

We discussed whether elements should be serialized as maps or arrays, and I provided an example map serialization for discussion.  The two serialization formats are equivalent, in that they deserialize to identical logical nodes.  But the discussion highlighted some practical distinctions:

1) Members of a map are pre-indexed by IRI, while an array must be searched member by member to find the element with a specified IRI.  Because looking up element references is a common operation, the first step after receiving an array of elements would be to build an index from IRI to element position in the array.

2) In order to find the captain of a ship with 1000 rooms, you'd need to search each room to look for someone wearing a captain's uniform.  Or in order to find an SBOM element in an array of 1000 elements, you'd need to examine all elements to determine which one(s) are the SBOM type.  That's true whether the 1000 elements are serialized as a map or an array.  BUT, if the 1000 elements were serialized as a map AND a rootElements property existed to list the SBOM IRI(s), no searching is required, the map points directly to the captain.

Conclusion: serialization as a map doesn't help finding the captain if the captain's ID isn't specified along with the map.  But if the captain's ID is specified, map serialization is hugely more efficient than having to search 1000 elements in an array to find that ID.

In any case, here is the JSON-serialized array equivalent of the previous map example, along with listing the 5 default properties at the top level instead of nested in a "defaults" property:

{
  "namespace": "urn:acme.dev:",
  "createdBy": ["identities:fred"],
  "created": "2022-04-05T22:00:00Z",
  "specVersion": "3.0",
  "profiles": ["Core", "Software"],
  "dataLicense": "CC0-1.0",
  "elementValues": [
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/du.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1/src/echo.c",
      "type": {
        "file": {
          "filePurpose": ["APPLICATION", "SOURCE"]
        }
      }
    },
    {
      "id": "artifacts:gnu-coreutils/v9.1",
      "type": {
        "package": {
          "packagePurpose": ["APPLICATION", "SOURCE"],
          "downloadLocation": "http://mirror.rit.edu/gnu/coreutils/coreutils-9.1.tar.gz",
          "homePage": "https://www.gnu.org/software/coreutils/"
        }
      },
      "name": "GNU Coreutils"
    },
    {
      "id": "relationships:gnu-coreutils/v9.1",
      "type": {
        "relationship": {
          "relationshipType": "CONTAINS",
          "from": "urn:acme.dev:artifacts:gnu-coreutils/v9.1",
          "to": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c"
          ]
        }
      }
    },
    {
      "id": "identities:fred",
      "type": {
        "actor": {}
      },
      "identifiedBy": [{"email": "fred@..."}]
    },
    {
      "id": "sboms:gnu-coreutils/v9.1",
      "type": {
        "sbom": {
          "elements": [
            "artifacts:gnu-coreutils/v9.1/src/du.c",
            "artifacts:gnu-coreutils/v9.1/src/echo.c",
            "artifacts:gnu-coreutils/v9.1",
            "relationships:gnu-coreutils/v9.1",
            "identities:fred"
          ]
        }
      }
    }
  ]
}

Regards,
David

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.