Hi Jilayne, since you asked for input ASAP, here are a few immediate gut reactions :)
I think getting the data of seeing a bunch of different ways that people said "this code is released into the public domain" is _slightly_ useful, but not very useful. My guess is that there's a ton of variations that are substantively saying the same thing but doing so in a way that would be extremely difficult to meaningfully capture into a few categories with regexs / pattern matching.
If the goal is really to find one or a few different regular-expression-matchable phrases that would go on the license list in its current form and format, then maybe that would be helpful data. But I guess I'm skeptical that we would find those patterns in a way that fits the current approach to license IDs on the license list, without ending up with a hundred variations of basically the same thing.
Maybe I'm jumping ahead to "what are the options?" before getting the data, but it seems to me like there are basically 4 options for whether and how to capture public domain statements:
1. No change: Don't add "this is in the public domain" statements to the license list. People can use LicenseRef's if they want.
Pro: Maintains the current approach that the License List is for licenses with specific text.
Con: Doesn't solve the problem people are having, with wanting to represent public domain statements generally with a common identifier.
2. Add a category ID to the spec: Alongside NONE and NOASSERTION as values defined in the SPDX spec, add PUBLIC-DOMAIN as another option defined in the spec rather than on the license list. Unlike NONE and NOASSERTION, PUBLIC-DOMAIN would presumably be useable in complex expressions (e.g. MIT AND PUBLIC-DOMAIN).
Pro: Provides a general identifier for public domain statements. Also maintains the current approach that the License List is for licenses with specific text.
Con: We're frankly too late to get this in as a substantive change for the SPDX 2.3 spec.
3. Add a category ID to the license list: Rather than changing the spec, add a category ID for "Public-Domain" (or similar) to the License List. Modify the license list schema somehow to indicate that this ID is meant to represent the collection of texts stating that a work is in the public domain, rather than one specific text.
Pro: Wouldn't be tied to a change to the spec. Would probably represent the way that most human users tend to think about public domain statements.
Con: Breaks expectations about all other License List entries, that they are tied to a particular text. Might also have implications for the SPDX spec that aren't coming to mind at the moment.
4. Add each statement individually: Add Public-Domain-1, Public-Domain-2, ... to the License List as separate entries, to capture every non-matching representation that we run into to say "this is in the public domain".
Pro: Maintains the current approach that the License List is for licenses with specific text.
Con: Get ready for 700 new Public Domain entries on the license list :) Probably becomes unwieldy for humans to meaningfully make use of this.
Definitely open to other options, but these are the ones that come to mind offhand. (And the above is intentionally ignoring public domain dedications that really do have a set standard text, such as CC-PDDC.)
Personally, if we weren't about to have SPDX 2.3 released imminently, I'd probably lean towards option 2. Given that it is about to be released, I could be persuaded to consider option 3, though I suspect we would need significant input from the tooling community as to whether this breaks too many current expectations on their side.
Steve