Provide a way to link to structured information about the provenance of a dataset including:
dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset
About agent roles, I have a punctual proposal which is about relaxing the domain of dcat:contactPoint, to allow its use not only for datasets, but also for other resources (e.g., catalogues, catalogue records).
This issue popped up during the development of GeoDCAT-AP, since all the agent roles (dcat:contactPoint included) supported in ISO 19115 can be specified for any resource.
Besides this, the "contact point" is probably the most important role for data consumers, not only for datasets. For instance, for a dcat:Catalog it is possible to specify the dct:publisher, but if I need to ask questions and/or report issues about the catalogue I need to get in touch with the publisher's dedicated contact point.
I wonder whether this (and similar revisions to DCAT) requires the creation of a separate use case.
:warning: As decided at the end of the DCAT subgroup telecon of 31 Jan 2018, a separate issue has been created (#95)
Re-represented as RDA Prov Patterns WG Use Case 41: http://patterns.promsns.org/usecase/41
Several patterns for "providing a way to link to structured information about the provenance of a dataset" are given in both PROV an in patterns by the RDA Prov Patterns WG, such as http://patterns.promsns.org/pattern/12. We should reuse these.
I propose to untag "quality" from this issue, as this issue is more related to provenance than quality. Clearly "provenance" might influence quality but considered that we have the tag "provenance", I think we can remove "quality".
A placeholder section/sub-section or proposal for the DCAT document would be appreciated - to alert the community when we release the FPWD.
Regarding the first of the three items in the description of this requirement:
[Provide a way to link to] the input data used to create a dataset to the dataset:
Assuming an established prov:Entity/dcat:Dataset close relationship, Use two of three of the patterns related in the RDA's Pattern Associating metadata in documents with graph provenance (the third patter is not relevant):
Pattern 1: store provenance in a different document/service to the Dataset metadata and link with either prov:has_provenance or prov:has_query_service relations
This is appropriate when potentially detailed provenance information cannot be well catered for within the standard DCAT document. This will be the case in purpose-built systems that cater for DCAT but not all the possibilities of PROV, even for Dataset/Dataset (Entity/Entity) relations.
Example: Dataset X was derived from Dataset Y and Dataset Z:
Within the DCAT document:
:Dataset_X prov:has_provenance :Bundle_N <-- here the DCAT record for Dataset_X points to a document defined as a prov:Bundle within qhich Dataset_X is referenced and any amounts of PROV provenance relationships given, to other datasets in the same catalogue or others.
Instead of a provenance document, a dataset could be linked to a provenance query service using prov:has_query_service.
Pattern 2: link datasets directly to others with PROV-O relations
This is appropriate when the system used to store DCAT information can store any PROV relationships.
Example: Dataset X was derived from Dataset Y and Dataset Z:
Within the DCAT document:
:Dataset_X prov:wasDerivedFrom :Dataset_Y , Dataset_Z ;
or, qualified forms (see https://www.w3.org/TR/prov-o/#qualifiedDerivation):
:Dataset_X prov:qualifiedDerivation [
a prov:Derivation;
prov:entity :Dataset_Y ;
## More details about the activity underpinning the derivation
prov:hadGeneration :a_detailed_generation;
...
] , [
a prov:Derivation;
prov:entity :Dataset_Z ;
prov:hadGeneration :different_detailed_generation;
...
]
Regarding the second of the three items in the description of this requirement:
[Provide a way to link to] the software used to produce the dataset to the dataset:
Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.
Example:
Dateset X was derived from Dataset Y and the derivation was made using Software Z
As long the specific instance of software that was used can be recorded (i.e. not the URI of the GitHub repo but of the specific commit that was used) then the above can simply be recorded as:
:Dataset_X prov:wasDerivedFrom :Dataset_Y , :Software_Z
where the derivation from Software Z is understood to be a derivation by instruction due to :Software_Z being a prov:Plan. If this requires more spelling out:
:Dataset_X
prov:wasDerivedFrom :Dataset_Y ;
prov:qualifiedDerivation [
prov:entity :Software_Z ; # still subclassed from Entity as Plan!
prov:hadRole :some_special_role_for_software ;
] ;
Regarding the third of the three items in the description of this requirement:
[Recommend] an extensible model different types of agent roles
For the general case of role or other qualifications, see Qualified forms [RQF] #79 where a proposal for qualified forms is made with agent roles as an example.
@nicholascar wrote:
Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.
I don't know if that's possible: Usually software is considered a prov:Agent, more specifically a prov:SoftwareAgent: "A software agent is running software."
@nicholascar , some time ago I added examples of provenance patterns in the wiki, and some of them relate to yours:
https://github.com/w3c/dxwg/wiki/Provenance-patterns
Would you mind having a check, and see if you think they should be revised/extended?
This gets messier with things like Shacl and Spin where the software is data.
Software is an entity, an instance of running software is an agent?
This fits with software being subject to processes such as automated testing.
@rob-metalinkage Can you expound on this a bit?
"This gets messier with things like Shacl and Spin where the software is data"
Which software is data? I read Shacl as taking instance data as input, so I'm not sure which software you mean. But I may be thinking of something other than what you meant.
@rob-metalinkage @larsgsvensson we have long-used precedence with instances of software being prov:Plan (subClassOf prov:Entity) objects used to guide a prov:Activity that then produces things and an executing agent, like a server, being a prov:Agent. This is shown in quick outline in the RDA Provenanc ePatterns WG's pattern 18: https://patterns.promsns.org/pattern/18.
I have run this pattern of the instance of software used being modelled as a prov:Plan and the execution system running it being a prov:Agent or a prov:SoftwareAgent past several original members of PROV (Luc, & Paolo) as well as many PROV practitioners over many years and it works fine although it wasn't expressly catered for in the 2013 PROV publications.
The pattern is generalisable to include methods other than software, such as scientific methods.
I will re-document that pattern for the RDA WG in more detail shortly.
@nicholascar It seems that we need to define exactly what we mean by software... I'd say that we need to differentiate between the sequence of commands being executed (aka a _programme_), the execution of that programme (aka a _process_) and any input passed to that process (let's call it _input_).
If we look at the case of a SHACL engine validating a piece of RDF using a SHACL file it seems to me that the execution of the validation is a prov:Activity executed by the SHACL engine (_programme_; prov:SoftwareAgent rdfs:subClassOf prov:Agent and uses the SHACL file (_input_; prov:Plan rdfs:subClassOf prov:Entity). The _process_ is tricky and doesn't always map to PROV-O since the validation can be done by a server process (e. g. a servlet) that's running already and that handles several validations in parallel.
If that's what you mean, I fully agree. And we need a better word than "software".
@larsgsvensson the differentiation you describe is how I describe things so I agree with your general characterisations.
I do agree that defining a process can be tricky but if we stick to the "provenance that we want", not a "provenance that could be modelled" then we can usually do something sensible. In the example you give of a servlet validating something I would model it thus:
prov:Activity - starting and ending with the processing of the RDF of interest, regardless of any other jobs it may be doing (we don't care about those)prov:SoftwareAgent, if that's important to know, or perhaps the server itselfprov:Entityprov:Entity - not a prov:Planprov:Entity that prov:wasDerivedFrom the two inputs AND the prov:Plan that instructed that the SHACL input be applied to the RDF inputSo this modelling will allow someone to see when (Activity) something (whichever Agent) did what (Plan) with what inputs (Entity x 2) and what output (Entity). Sure, you could model things differently but what's the Use Case?
Could someone clarify the relevance of profile_negotiation to this issue, or remove the tag?
Provenance information should probably be available at the level of representations (dcat:Distributions) as well as dcat:Datasets
(Does this need a new Issue?)
I think we should consider here cases where "provenance" is expressed in a discursive way - e.g., when describing the dataset lineage (as mentioned in UC9).
This is quite a common practice for scientific data and in some domains, as the geospatial one. In most cases, these lineage descriptions are such that they cannot be easily converted into a machine-actionable representation.
In DCAT-AP, this is done by using dct:provenance/dct:ProvenanceStatement/rdfs:label, and according to the report on DCAT-AP usage statistics from the European Data Portal, this information is included in more than 50% of the EDP records (391,616).
It may be worth considering its inclusion in DCAT.
As there has been no further discussion on this issue, I propose to close it.
As there has been no further discussion on this issue, I propose to close it.
Noting no objections, I am closing this issue.