Dxwg: Primary and alternative identifier [RIDALT]

Created on 18 Jan 2018 · 15Comments · Source: w3c/dxwg

Primary and alternative identifier [RIDALT]

Provide means to distinguish the primary and alternative (legacy) identifiers.

Related use cases: Modeling identifiers and making them actionable [ID11]

dcat Dataset identification referencing requirement

Source

jpullmann

All 15 comments

Related to #53 & #68.

andrea-perego on 20 Jan 2018

Following from discussions in #53 and @riccardoAlbertoni's proposal in the wiki, neither Dublin Core nor ADMS have terms to represent alternative identifiers.

In DATS we followed DataCite approach by having a representation of: primary identifier, alternate identifiers and related identifiers.

agbeltran on 21 Nov 2018

Proposal 1 in the wiki suggests using

dct:identifier for primary id
adms:identifier for secondary id

as far as I have understood, this was the guideline made by DCAT-AP to manage duplicates, though I am not 100% sure this suggestion is still valid in the newest DCAT-AP release.

@agbeltran Is this acceptable or Do you think we should add new specific terms in DCAT 2?

I guess @andrea-perego and @makxdekkers might have their own view on the opportunities/issues behind the reuse of such a guideline.

riccardoAlbertoni on 22 Nov 2018

That proposal sounds OK to me.

makxdekkers on 22 Nov 2018

👍1

+1 from me as well.

andrea-perego on 23 Nov 2018

👍1

The issue I see with the proposed solution is that one may want/need to specify the agency that manages the identifier scheme also for primary identifiers, rather than just for secondary identifiers.

agbeltran on 28 Nov 2018

@agbeltran In my mind, the distinction between 'primary' and 'secondary' identifiers is related to what you want to do with them.
The 'primary' identifier in @riccardoAlbertoni's proposal and at https://joinup.ec.europa.eu/release/dcat-ap-how-manage-duplicates, is used for (a) linking back to the orginal publication of that dataset, and (b) to (string-)compare identifiers to see if two descriptions refer to the same dataset.
The 'secondary' identifiers are ones that play a role in a wider context, and for which you need to declare that context to understand what they are.
From what I remember of the development of ADMS, the adms:Identifier class was created primarily for non-resolvable identifiers. For example, a prublisher might have an identifier "XYZ123", either a local production number, or coined in some other (non-Web) context, in which case it would be necessary to express what it was or where to look it up.
During development of DCAT-AP, it was noted that in situations that descriptions were exchanged, shared or harvested, intermediaries could change, e.g. correct or enhance, a description along the way.
It was then agreed that there needed to be a way to refer back to the original description of a dataset, and the notion of primary and secondary identifiers was introduced with different usage.
It might be that this is more an issue for a profile than for the base standard, though.

makxdekkers on 28 Nov 2018

Thanks @makxdekkers. First, while I realise that the wiki and the discussions have focused on identifiers for datasets, the solution we provide should also tackle identifiers for other entities (catalogues, people, services, and even distributions).

For datasets, as we are mostly considering them in the context of a catalogue (even though we do have an issue about the relationship between datasets and catalogues #62), the primary identifier would be the identifier for the dataset in the catalogue being considered IMO. When dealing with these identifiers programmatically, I think it would still be useful to be able to indicate what is the identifier type for those primary identifiers.

agbeltran on 28 Nov 2018

@agbeltran , I think the catalogue context may apply also to identifiers for other resources.

Take ORCIDs as an example: we are using them in the JRC Data Catalogue for dataset authors/contributors, along with their name and, possibly, email. In the JRC Data Catalogue, contributors are identified by a specific URI, whereas the ORCID is specified both as an alternative URI (owl:sameAs) and identifier. On the other hand, in the ORCID catalogue / registry, their primary URI / ID is the ORCID.

andrea-perego on 9 Dec 2018

@andrea-perego I think your comment indicates that the notions of primary and secondary identifiers are very much application-specific. I am still of the opinion that the base standard should not try to solve the issue. It might mention that there is a requirement in particular applications to make the distinction, but it should not mandate a general approach.

makxdekkers on 10 Dec 2018

👍1

@agbeltran @andrea-perego Considering @makxdekkers' comment, I have changed the text of issue 67 about primary and secondary ids, which is now

The need to distinguish between primary and legacy identifiers for a dataset has been posed as a requirement. However, it is very much application-specific and should be better addressed in application profiles rather than being mandate a general approach.

I have moved the issue before of the duplication guidelines. If the issue text captures fairly well the group agreement, we might reuse the issue text directly in the document and to close the issue. Otherwise please feel free to suggest a rephrasing ...

riccardoAlbertoni on 11 Dec 2018

@riccardoAlbertoni , I am happy with the revision - although we may need come back to this, or at least the text may need to be further elaborated. E.g., we need to define what we mean with primary, legacy, and alternative / secondary identifiers.

Only, I would recommend, following @makxdekkers 's consideration in https://github.com/w3c/dxwg/pull/614#issuecomment-444594450 , to replace "legacy" with "alternative" or at least with "legacy, and alternative".

andrea-perego on 11 Dec 2018

👍1

@andrea-perego @riccardoAlbertoni My worry was indeed that it is not clear what the meaning of 'primary' and 'alternative' is, basically because it depends entirely on context. There are many examples of situations where things (people, cars etc.) have several identifiers that may be primary in one context and secondary in others (e.g. tax identification, social security numbers). In the case of datasets, @agbeltran argues that "_the primary identifier would be the identifier for the dataset in the catalogue being considered_"; other people think that the primary identifier should be the one in the catalogue where the dataset was first published. So I think it will be hard to come up with a generally applicable definition of the terms 'primary' and 'secondary/alternative'.
Having said that, I have no objections to the current formulation "_it is very much application-specific and should be better addressed in application profiles_" as this means we don't have to say anything more about it.

makxdekkers on 12 Dec 2018

While there one or two minor edits still to be done on the PR #614, there appears to be consensus around the following conclusion:

"The group has agreed that distinguishing between primary, secondary (alternative) identifiers it is very much application-specific and should be better addressed in application profiles rather than being mandate a general approach."

Flagging this as "due for closing" to encourage any concerns to be raised.