Encode identifiers as dereferenceable HTTP URIs
I agree but I don't think there is a need to make any changes in the DCAT specification. It's an automatic consequence of DCAT being an RDF vocabulary.
Yes - that was my impression too. No changes necessary to satisfy this requirement.
This issue is strictly related to providing guidance on how to use DCAT to specify identifiers as DOIs, ISBNs, etc. - see the related use case.
Currently, these IDs are encoded as simple strings, unless they are used as part of the primary resource URI. An option could be to encourage the use of owl:sameAs whenever the ID can be resolvable when encoded as URI (as for DOIs).
So, there may be no need to create new property / classes, but rather to describe how to use the existing ones to address these use cases.
It could be part of DCAT guidance, maybe in the usage note of https://www.w3.org/TR/vocab-dcat/#Property:dataset_identifier? In fact, the current usage note only suggests that the "identifier might be used as part of the URI of the dataset" but it would be good to mention other identifiers in the usage note as well.
It could be part of DCAT guidance, maybe in the usage note of https://www.w3.org/TR/vocab-dcat/#Property:dataset_identifier? In fact, the current usage note only suggests that the "identifier might be used as part of the URI of the dataset" but it would be good to mention other identifiers in the usage note as well.
+1 from me. One of options mentioned in the related use case is to use dct:identifier with a datatype denoting the identifier type (DOI, etc.). But these datatypes need to be defined. There's of course also the other option of using specific properties for each type of identifier (prism:doi, bibo:doi, etc.).
But for specifying multiple identifiers as HTTP URIs we need a property as owl:sameAs, which needs to be added to the DCAT spec.
Should we add the 'documentation' tag for this requirement then?
The library world has struggled with this same problem. There are many identifiers that are not (yet) expressed as IRIs. As these are just alpha-numeric strings, there is a need to give a context so that they are meaningful/useful. This has led to some awkward models of identifiers being at least 2-part: the identifier string, and the "provenance" of the identifier. So although one should prefer IRI forms when available, what should be done with a string like "098378297" when it is the identifier from some agency? That's the hard part.
@kcoyle - could you provide a pointer to a catalogue from the library world with that situation? In those examples, is it not possible to get a description of the resource being identified at all?
For the case of life science data, which would be also applicable to other scientific domains I imagine, our paper "Identifiers for the 21st century: how to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data"( https://doi.org/10.1371/journal.pbio.2001414) presents the situation and could be a useful reference.
Here are some (there are about a dozen common ones) (type followed by example):
I'm not sure what you mean about getting a description of the resource - the bibliographic record is a description of the resource; what is problematic is that not all identifiers have a URI form. These are not "web-based identifiers" and I suspect that other data providers also have older identifiers that are not (yet) web-based. These are in library data. In addition, the identifiers for the records in a data file in library data are simple strings, like "##2001627090" (the octothorps represent blanks). These are exported as URIs when the data is converted to RDF ("https://lccn.loc.gov/2015020100") but there is still a majority of data that is not in RDF.
The concept of a "data catalog" is not applied to these files of records although there are sites that provide files of the records for download. This may or may not fit into the context of DCAT.
The European DCAT-AP includes a property adms:identifier with range adms:Identifier for Dataset. adms:Identifier is based on the UN/CEFACT Identifier class and consists of:
Whatever mechanism we decide on for handing identifiers, it should be comprehensive enough to be used in a DCAT2 and also elsewhere since we have the requirement of referring to alternate identifiers (non-HTTP URIs) for things like physical samples in SOSA ontology catalogues.
Discussed at some length in meeting https://www.w3.org/2018/02/07-dxwgdcat-minutes
AndreaPerego: main issue is that a number of other identifier systems are used for data citation, publishers, etc
... e.g. DataCite supports quite a few identifier systems
... DCAT-AP also discussed this at length
... agencies want to use their internal identifiers, not necessarily URIs
... may connect datasets using SPARQL queries, etc not just URIs
... what is needed by different communities is ability to specify different kinds of identifiers, and their type
... need to indicate that a string _is_ an identifier
... whenenver possible make identifiers resolvable by _encoding_ them as URIs, but this does not apply to all identifier systems
... situation is quite complicated
... there are some other URI systems, but not necessarily resolvable
... case sensitivity is also an issue
... proposal made in UC is to try to address both issue
... 1. encode as http URIs where possible
... 2. encode as a string using dct:identifier property, and note the type of the identifier using ^^type indicator
... UC is about _providing guidance_ where standard RDF http URI does not apply
... also for how to use SPARQL queries, for example
NicholasCar: We had the same issue with physical samples.
... recommendation there is to supplement identifier field with identifier-type
... need a comprehensive schema for alternative identifiers
SimonCox: Makx suggested looking at ADMS.
... https://‌www.w3.org/‌TR/‌vocab-adms/#identifier
SimonCox: There's an adms:Identifier class there.
SimonCox: It is based on UN/CEFACT. So, this appear that fullfills the proposal you made, AndreaPerego.
... Adopt or clone adms:Identifier
AndreaPerego: alternative proposals: PRISM, BIBO,
... specific fields for well-known identifier schemes, e.g. bibo:DOI
... these are already used by some important services, e.g. crossRef
... need to explain how these different approaches map to each other
Proposal: promote adms:Identifier to DCAT
I suggest we clone ADMS identifier or use an ontology addressing only identifiers such as http://data.press.net/ontology/identifier/ or http://ows.usersmarts.com/owldocgen/owldoc?url=http://www.opengis.net/ont/common/identifier# . The reason to use a micro ontology for just identifiers is that it can be reused for many other purposes. It would be nice to submit this small ontology for standardization by W3C.
I like! I think the first microbotology needs a few more things though: notes on identifier formats; whether they are structured or opaque strings etc. These could be optional
On this topic, one of my priorities at GS1 now is to make barcodes dereferenceable. In more formal terms, we're defining how GTINs (the numbers you see beneath a barcode) and our other less well-known identifiers, can be encoded in HTTP URIs. I mention it here because there is a close relationship between our GTINs and ISBN and ISSN (ISBNs all begin 978 or 979 but are part of the EAN/UPC/GTIN world). Therefore, if this WG has use cases for dereferenceable ISBNs, I'd be pleased to know, especially if you have any idea where they should dereference to!
In fact, you could call the definition of adms:Identifier a micro-ontology: it defines a class and a set of properties to describe it, plus a note that "_it may also be useful to provide further properties_".
Therefore, if this WG has use cases for dereferenceable ISBNs, I'd be pleased to know, especially if you have any idea where they should dereference to!
I don't have a use case, but my first reaction would be that ISBNs should dereference to the national library for the jurisdiction where the publication was published (or where the publisher is located). But maybe I'm biased since I work in a national library...
Also relevant to this discussion is the schema.org discussion on identifiers and the sdo:identifier term.
I've been involved with the International GeoSample Numbers (IGSN) which are Handles that resolve on HTTP via igsn.org. IGSNs were created as the effort to get DOIs for physical (geo) samples was too great a number of years ago.
ISBNs could implement a Handle network including Handle nodes at various ISBN minting agencies (national libraries) around the world and then an HTTP resolver, perhaps isbn.org or similar (they USA is currently hogging that!).
@nicholascar yes, resolvers in the Handle style are along the lines we're thinking for our GTINs and other identifiers, but it's early days yet in terms of process and thinking. And @larsgsvensson, yes, I'd immediately thought of national libraries as a good end point. I believe the ISBN identifier space is not as clear cut as one would like but that's life. Again, early days for what I'm working on. @agbeltran yes, that discussion about identifiers on schema.org is very reminiscent of the ADMS discussion that @makxdekkers referred to.
Ever get the feeling there's nothing new under the Sun?
@philarcher
I believe the ISBN identifier space is not as clear cut as one would like but that's life
Of course it isn't... Some thoughts on resolution of urn:isbn that might apply to ISBN resolution in general can be found in the ISBN URN namespace registration at IANA (Section "Resolution")
LinkedIn use a helpful pattern for their URIs that is a hybrid incorporating urn-style components:
https://www.linkedin.com/feed/update/urn:li:activity:6442389840137842688
Slightly OT: This seems a clear case of URN namespace squatting, since the urn namespace li isn't registered with IANA...
The latest DCAT meeting discussed the likelihood that we can progress this adequately for the second public working draft. After some discussion, the consensus of the (small) meeting was that versioning might influence our views on identifiers enough to make any limited update somewhat confusing. Inclusion in 2PWD to be kept under review.
Dropping here a link to Wikidata identifiers as it was mentioned today in the F2F: https://www.wikidata.org/wiki/Wikidata:Identifiers
after a review of the discussion, it looks like there are two proposals:
ADMS kind of approach-- identifiers have a datatype like skos:notation, i.e. typed literal, and the value for the typed literal is the identifier type. e.g.
dcat:identifier "978-3-16-148410-0"^^https://www.iso.org/standard/36563.html
Its not clear to me how ADMS would serialize the other properties (version and managing authority)
schema.org, ISO19115, DATS approach-- make identifier an object/class with a code property (the identifier string), a scheme property, maybe an authority property.
Personally I think the second approach is more transparent and widely used.
Schema.org implements the identifier as a PropertyValue, which obfuscates things;
DATS uses 'identifier' and 'identifierSource' as the property names;
ISO19115-1 uses 'code', 'codespace', and 'version', with a citation for the 'authority'
DataCite has 'identifier' and 'identifierType'
proposal:
class: dcat:identifier
Properties:
dcat:code -- the identifier string; for a well formed URI this would be all that's necessary
dcat:identifierType -- literal or URI
dcat:version -- literal
authority -- foaf:organization
@smrgeoinfo ADMS also makes the identifier a class, namely adms:Identifier.
The spec at https://www.w3.org/TR/vocab-adms/#identifier indeed does not provide a full recommendation on how to express the other properties of the Identifier, but I would suggest:
I would not be in favour of defining a dcat:Identifier class alongside the adms:Identifier class that basically does the same thing.
adms:identifier is already adopted in some DCAT application profiles, so I second the idea of using it rather than introducing new terms, at least as a first attempt.
As part of the action 259 which has been assigned to me in the last week dcat call, I have drafted the following wiki page, DCAT-Identifiers.
In such a page, I have tried to set up a proposal based on existing adms:identifier examples.
The page is still in progress, I certainly need to update it with the latest @makxdekkers suggestions. Though it is not yet complete, and corrections might be needed, I guess it can help the discussion.
@riccardoAlbertoni thanks, that wiki page is helpful. A couple comments:
in the Representing HTTP dereferenceable secondary identifier section, there seems to be an assumption that the ^^xsd:anyURI type implies that the literal is an HTTP URI, but the data type allows any valid RFC-3986 URI (e.g. urn:), and these might not be dereferenceable.
Also, in the example, with a doi:
skos:notation "10.1109/5.771073"^^dcat:doi ;
adms:schemeAgency "International DOI Foundation" .
I would suggest that the issuing authority of interest should be the registrant for the 10.1109 doi space, "IEEE Xplore Digital Library", perhaps this should be added as a dct:creator. There are two concerns-- the authority that defined the identifier scheme (DOI foundation), and the authority responsible for assigning and maintaining identifiers using that scheme (IEEE).
@makxdekkers I got the impression from the adms doco that the identifier scheme is encoded as the data type in the skos:notation typed literal, so using skos:inScheme would be redundant, and I think its also not consistent with the intention of skos:inScheme.
in the Representing HTTP dereferenceable secondary identifier section, there seems to be an assumption that the ^^xsd:anyURI type implies that the literal is an HTTP URI, but the data type allows any valid RFC-3986 URI (e.g. urn:), and these might not be dereferenceable.
I see your point @smrgeoinfo, the title is slightly misleading.
I suspect that the only way to know if a URI is HTTP dereferenceable is to try to resolve it as It can be broken.
As far as I can understand, indicating an urn is useful as well. Independently from their dereferenceability, secondary IDs are indicated to say that others might refer to the same dataset with different IDs, they are useful to manage/ group duplicates. So I have made the distinction between dereferenceable and non-deferenceable URIs less sharp.
@smrgeoinfo wrote
I would suggest that the issuing authority of interest should be the registrant for the 10.1109 doi space, "IEEE Xplore Digital Library", perhaps this should be added as a dct:creator. There are two concerns-- the authority that defined the identifier scheme (DOI foundation), and the authority responsible for assigning and maintaining identifiers using that scheme (IEEE).
@smrgeoinfo Please take a look at example 7, Have I correctly interpreted your suggestion?
To answer 'Question 1' in 'Proposal 1' from @riccardoAlbertoni's notes on the wiki, the DataCite schemas include an XSD with a list of identifier types/schemes here:
https://schema.datacite.org/meta/kernel-4.1/include/datacite-relatedIdentifierType-v4.xsd
Also FAIRsharing keeps a registry of identifier schemes: https://fairsharing.org/standards/?q=&selected_facets=type_exact:identifier%20schema
As regards @smrgeoinfo point on identifying both the identifier scheme and the organisation minting the identifiers, it seems to me that is a use case not covered by ADMS, as adms:schemaAgency covers the name of the "agency that manages the identifier scheme" as a literal, while dct:creator would be used to point to the representation of such organisation rather than a separate one? is that correct @makxdekkers ?
Apart from that interpretation of ADMS, example 7 would cover accounting for both the identifier scheme/type and the organisation maintaining it IMO.
@agbeltran Yes, dct:creator and adms:schemaAgency should be for the same organisation. The literal option was provided because schema agencies might not be in Linked Data space and have no URI.
Yes, dct:creator and adms:schemaAgency should be for the same organisation. The literal option was provided because schema agencies might not be in Linked Data space and have no URI.`
Then, assuming we want to distinguish between (a) the authority that defined the identifier scheme (DOI foundation), and (b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE), we need to consider a property distinct from dct:creator for (b)
I see two alternative options here
Which of the two the group thinks is more reasonable?
Does anyone see further options?
@riccardoAlbertoni I am not in favour of your proposal.
As I understand it, the DOI Foundation is the schema agency for DOI. Period. The fact that DOI is organised in such a way that there are registration agencies and registrants for sub-spaces under DOI should be irrelevant. Moreover, naming the registrant goes against the philosophy of DOI where the sub-spaces are abstracted from the organisation that registers them, with the advantage that DOIs don't change when the organisation changes or the responsibility for that sub-space is handed over to someone else. Your proposal risks creating a dependency that DOI itself tries to avoid.
So, in summary, I vote against both options, and suggest to use adms:Identifier as specified allowing only one single agency.
Thank @makxdekkers for your comment.
If I have correctly interpreted your message you are not in favour of the requirement behind my modelling attempt, namely the need to mention both
a) the authority that defined the identifier scheme (DOI foundation), and
b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE),
as it was suggested by @smrgeoinfo. @smrgeoinfo Have I misinterpreted your suggestion?
I've found @makxdekkers' motivations convincing, I also guess that similar considerations might hold for other identifier schemes.
So I have included your motivations for not representing (b) in example 7.
Correct, I am not in favour of the requirement to model more than one authority for identifiers.
short story:
I think what a user really needs to know is what is the identifier scheme (not who defined it), in particular, if those identifiers can be dereferenced, how can they are dereferenced, and what kind of representations of the identified resource should be available. The agent defining the scheme is not the info needed for this use case. Back to the original question, if identifiers are are required to be http: URIs, the base identifier scheme is known (http), but the practical matter is that various agent embed identifiers within the http uri, and the identifier scheme that matters to the user is not http, but what the embedded scheme is, e.g. doi, ark, igsn...
details
a) the authority that defined the identifier scheme (DOI foundation), and
b) the authority responsible for assigning and maintaining identifiers using that scheme (IEEE),
@riccardoAlbertoni yes you are interpreting my suggestion as intended, and I think @makxdekkers point about the registering agent is valid.
If a registered URI type is used (following RFC-3986), the identifier scheme is part of the URI; a separate identifier scheme property is redundant in that case. If the skos:notation in the adms:identifier has type ^^xsd:anyURI, then the identifier for the scheme should be the prefix on the ID string ('http:' in the example 7).
DOI is registered as a namespace in the 'info' URI scheme (see faq #11 ), so it would appear that to formally encode a DOI as an rfc 3986 URI it would look like 'info:doi/10.1109/5.771073'. The info namespace registry was off line when I tried and check this.
As far as dct:creator, it seems odd to me that the dct:creator property on an adms:Identifer is not the creator of the identifier instance, rather it is the creator of the identifier scheme. This would be confusing if one were not conversant in the usage recommendations for adms; if that's the convention we should stick with it.
To me, the major use case for knowing the identifier scheme is that it should tell you how you can dereference the identifier, and ideally what kind of representations for the identified resource are available, so there is no particular need to identify the agent responsible for actually issuing and maintaining the lifecycle of the identifier, in the case of a DOI, knowing the scheme lets a user know that the registering agent is specified by the prefix part of the id string and there are ways to dereference that.
Marking this issue as 'due for closing' given PR https://github.com/w3c/dxwg/pull/614
Closing after merging #614
Most helpful comment
This issue is strictly related to providing guidance on how to use DCAT to specify identifiers as DOIs, ISBNs, etc. - see the related use case.
Currently, these IDs are encoded as simple strings, unless they are used as part of the primary resource URI. An option could be to encourage the use of
owl:sameAswhenever the ID can be resolvable when encoded as URI (as for DOIs).So, there may be no need to create new property / classes, but rather to describe how to use the existing ones to address these use cases.