Dxwg: Provenance information [RPIF]

Created on 18 Jan 2018 · 20Comments · Source: w3c/dxwg

Provenance information [RPIF]

Provide a way to link to structured information about the provenance of a dataset including:

the input data used to create a dataset to the dataset.
the software used to produce the dataset to the dataset.
an extensible model different types of agent roles
funders

dct:creator, dct:publisher etc are special cases, which require guidance, further roles may be defined in provenance or other richer models. The requirement is to establish an extensible mechanism, and for profiles to specify canonical equivalents for the special case properties of dcat:Dataset

dcat Dataset contactPoint provenance referencing requirement roles

Source

jpullmann

👍1

All 20 comments

About agent roles, I have a punctual proposal which is about relaxing the domain of dcat:contactPoint, to allow its use not only for datasets, but also for other resources (e.g., catalogues, catalogue records).

This issue popped up during the development of GeoDCAT-AP, since all the agent roles (dcat:contactPoint included) supported in ISO 19115 can be specified for any resource.

Besides this, the "contact point" is probably the most important role for data consumers, not only for datasets. For instance, for a dcat:Catalog it is possible to specify the dct:publisher, but if I need to ask questions and/or report issues about the catalogue I need to get in touch with the publisher's dedicated contact point.

I wonder whether this (and similar revisions to DCAT) requires the creation of a separate use case.

:warning: As decided at the end of the DCAT subgroup telecon of 31 Jan 2018, a separate issue has been created (#95)

andrea-perego on 20 Jan 2018

Re-represented as RDA Prov Patterns WG Use Case 41: http://patterns.promsns.org/usecase/41

nicholascar on 30 Jan 2018

Several patterns for "providing a way to link to structured information about the provenance of a dataset" are given in both PROV an in patterns by the RDA Prov Patterns WG, such as http://patterns.promsns.org/pattern/12. We should reuse these.

nicholascar on 30 Jan 2018

👍1

I propose to untag "quality" from this issue, as this issue is more related to provenance than quality. Clearly "provenance" might influence quality but considered that we have the tag "provenance", I think we can remove "quality".

riccardoAlbertoni on 2 Feb 2018

A placeholder section/sub-section or proposal for the DCAT document would be appreciated - to alert the community when we release the FPWD.

dr-shorthair on 7 Mar 2018

Regarding the first of the three items in the description of this requirement:
[Provide a way to link to] the input data used to create a dataset to the dataset:

Proposal

Assuming an established prov:Entity/dcat:Dataset close relationship, Use two of three of the patterns related in the RDA's Pattern Associating metadata in documents with graph provenance (the third patter is not relevant):

Pattern 1: store provenance in a different document/service to the Dataset metadata and link with either prov:has_provenance or prov:has_query_service relations

This is appropriate when potentially detailed provenance information cannot be well catered for within the standard DCAT document. This will be the case in purpose-built systems that cater for DCAT but not all the possibilities of PROV, even for Dataset/Dataset (Entity/Entity) relations.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:has_provenance :Bundle_N <-- here the DCAT record for Dataset_X points to a document defined as a prov:Bundle within qhich Dataset_X is referenced and any amounts of PROV provenance relationships given, to other datasets in the same catalogue or others.

Instead of a provenance document, a dataset could be linked to a provenance query service using prov:has_query_service.

Pattern 2: link datasets directly to others with PROV-O relations
This is appropriate when the system used to store DCAT information can store any PROV relationships.

Example: Dataset X was derived from Dataset Y and Dataset Z:

Within the DCAT document:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , Dataset_Z ;

or, qualified forms (see https://www.w3.org/TR/prov-o/#qualifiedDerivation):

:Dataset_X prov:qualifiedDerivation [
    a prov:Derivation;
    prov:entity :Dataset_Y ;

    ## More details about the activity underpinning the derivation        
    prov:hadGeneration :a_detailed_generation; 
    ...
] , [
    a prov:Derivation;
    prov:entity :Dataset_Z ;       
    prov:hadGeneration :different_detailed_generation; 
    ...
]

nicholascar on 23 May 2018

Regarding the second of the three items in the description of this requirement:
[Provide a way to link to] the software used to produce the dataset to the dataset:

Proposal

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

Example:

Dateset X was derived from Dataset Y and the derivation was made using Software Z

As long the specific instance of software that was used can be recorded (i.e. not the URI of the GitHub repo but of the specific commit that was used) then the above can simply be recorded as:

:Dataset_X prov:wasDerivedFrom :Dataset_Y , :Software_Z

where the derivation from Software Z is understood to be a derivation by instruction due to :Software_Z being a prov:Plan. If this requires more spelling out:

:Dataset_X
    prov:wasDerivedFrom :Dataset_Y ;
    prov:qualifiedDerivation [
        prov:entity :Software_Z ; # still subclassed from Entity as Plan!
        prov:hadRole :some_special_role_for_software ;
    ] ;

nicholascar on 23 May 2018

Regarding the third of the three items in the description of this requirement:
[Recommend] an extensible model different types of agent roles

Proposal

For the general case of role or other qualifications, see Qualified forms [RQF] #79 where a proposal for qualified forms is made with agent roles as an example.

nicholascar on 23 May 2018

@nicholascar wrote:

Interpret software as a specialised form of prov:Entity, a prov:Plan and then apply all of the Entity/Entity mapping patterns described above.

I don't know if that's possible: Usually software is considered a prov:Agent, more specifically a prov:SoftwareAgent: "A software agent is running software."

larsgsvensson on 23 May 2018

@nicholascar , some time ago I added examples of provenance patterns in the wiki, and some of them relate to yours:

https://github.com/w3c/dxwg/wiki/Provenance-patterns

Would you mind having a check, and see if you think they should be revised/extended?

andrea-perego on 26 May 2018

This gets messier with things like Shacl and Spin where the software is data.

Software is an entity, an instance of running software is an agent?

This fits with software being subject to processes such as automated testing.

rob-metalinkage on 26 May 2018

@rob-metalinkage Can you expound on this a bit?

"This gets messier with things like Shacl and Spin where the software is data"

Which software is data? I read Shacl as taking instance data as input, so I'm not sure which software you mean. But I may be thinking of something other than what you meant.

kcoyle on 26 May 2018

@rob-metalinkage @larsgsvensson we have long-used precedence with instances of software being prov:Plan (subClassOf prov:Entity) objects used to guide a prov:Activity that then produces things and an executing agent, like a server, being a prov:Agent. This is shown in quick outline in the RDA Provenanc ePatterns WG's pattern 18: https://patterns.promsns.org/pattern/18.

I have run this pattern of the instance of software used being modelled as a prov:Plan and the execution system running it being a prov:Agent or a prov:SoftwareAgent past several original members of PROV (Luc, & Paolo) as well as many PROV practitioners over many years and it works fine although it wasn't expressly catered for in the 2013 PROV publications.

The pattern is generalisable to include methods other than software, such as scientific methods.

I will re-document that pattern for the RDA WG in more detail shortly.

nicholascar on 28 May 2018

👍1

@nicholascar It seems that we need to define exactly what we mean by software... I'd say that we need to differentiate between the sequence of commands being executed (aka a _programme_), the execution of that programme (aka a _process_) and any input passed to that process (let's call it _input_).

If we look at the case of a SHACL engine validating a piece of RDF using a SHACL file it seems to me that the execution of the validation is a prov:Activity executed by the SHACL engine (_programme_; prov:SoftwareAgent rdfs:subClassOf prov:Agent and uses the SHACL file (_input_; prov:Plan rdfs:subClassOf prov:Entity). The _process_ is tricky and doesn't always map to PROV-O since the validation can be done by a server process (e. g. a servlet) that's running already and that handles several validations in parallel.

If that's what you mean, I fully agree. And we need a better word than "software".

larsgsvensson on 30 May 2018

@larsgsvensson the differentiation you describe is how I describe things so I agree with your general characterisations.

I do agree that defining a process can be tricky but if we stick to the "provenance that we want", not a "provenance that could be modelled" then we can usually do something sensible. In the example you give of a servlet validating something I would model it thus:

the process as a prov:Activity - starting and ending with the processing of the RDF of interest, regardless of any other jobs it may be doing (we don't care about those)
the servlet as a prov:SoftwareAgent, if that's important to know, or perhaps the server itself
- the choice of which Agent to model will come down to what facts are most importantt o know for a Use Case such as recording info for potential process recreation
the input of the RDF file being validated as a prov:Entity
the input of a SHACL file as a prov:Entity - not a prov:Plan
- here the SHACL file is not instructing the Activity. It's determining a validation assessment but the conducting of the Activity itself is, in fact, guided by the code that applies the validation to the data, the SHACL file to the input RDF.
the output of the validation task - pass, fail, error messages etc - a prov:Entity that prov:wasDerivedFrom the two inputs AND the prov:Plan that instructed that the SHACL input be applied to the RDF input

So this modelling will allow someone to see when (Activity) something (whichever Agent) did what (Plan) with what inputs (Entity x 2) and what output (Entity). Sure, you could model things differently but what's the Use Case?

nicholascar on 2 Jun 2018

Could someone clarify the relevance of profile_negotiation to this issue, or remove the tag?

azaroth42 on 21 Jun 2018

Provenance information should probably be available at the level of representations (dcat:Distributions) as well as dcat:Datasets

(Does this need a new Issue?)

dr-shorthair on 25 Jul 2018

I think we should consider here cases where "provenance" is expressed in a discursive way - e.g., when describing the dataset lineage (as mentioned in UC9).

This is quite a common practice for scientific data and in some domains, as the geospatial one. In most cases, these lineage descriptions are such that they cannot be easily converted into a machine-actionable representation.

In DCAT-AP, this is done by using dct:provenance/dct:ProvenanceStatement/rdfs:label, and according to the report on DCAT-AP usage statistics from the European Data Portal, this information is included in more than 50% of the EDP records (391,616).

It may be worth considering its inclusion in DCAT.

andrea-perego on 16 Feb 2019

As there has been no further discussion on this issue, I propose to close it.

andrea-perego on 29 Oct 2020

👀1

As there has been no further discussion on this issue, I propose to close it.

Noting no objections, I am closing this issue.

andrea-perego on 13 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Quality-related information [RDQIF]

jpullmann · 7Comments

dcat:compressFormat and dcat:packageFormat description inconsistency

jakubklimek · 6Comments

dct:creator for dcat:Resource

agbeltran · 5Comments

Further elaborating class Organization/Person

andrea-perego · 5Comments

There is a need to distinguish between distributions that package the entire dataset and those that support access to specific items, queries, and packaged downloads of data. [ID51] (5.51)

nicholascar · 6Comments