Hi, I've been studying DCAT besides VoID and SD (sparql description) and I'm looking for a way to express that a dataset with quads contains a subset of triples, selected by their graph (default or named graph). As far as I've seen, there is no way I can express this using DCAT, but I can of course have missed someting :).
I'm planning to mainly use it for internal data management, but it might also be useful when exchanging RDF quads.
My current best attempt in case of a SPARQL endpoint:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix dcat-ext: <http://example.org/dcat-extension#> .
:dataset1 a dcat:Dataset ; # dataset of quads: default graph + NG1
void:subset :dataset1_NG1 , :dataset1_defaultGraph ; # there is no similar term in DCAT for sub-datasets?
dcat:distribution :dataset1_all_sparql .
:dataset1_all_sparql a dcat:Distribution ;
dcat:accessService :mysparqlEndpoint .
:mysparqlEndpoint a dcat:DataService ;
dcat:endpointURL <http://mydomain.org/sparql> ; # SPARQL endpoint serving triples in a default graph and NG1
dcat:servesDataset :dataset1 , :dataset1_NG1 , :dataset1_defaultGraph .
:dataset1_NG1 a dcat:Dataset ;
dcat:distribution :dataset1_NG1_sparql .
:dataset1_defaultGraph a dcat:Dataset ;
dcat:distribution :dataset1_defaultGraph_sparql .
:dataset1_NG1_sparql a dcat:Distribution ;
dcat-ext:graph :NG1 ; # sd:name cannot be used since its rdfs:domain is sd:NamedGraph
dcat:accessService :mysparqlEndpoint .
:NG1 a sd:NamedGraph ;
sd:name <http://mydomain.org/NG1> . # the object is the actual URI of the named graph (conform SD)
:dataset1_defaultGraph_sparql a dcat:Distribution ;
dcat-ext:graph dcat-ext:defaultGraph ; # sd:name cannot be used since its rdfs:domain is sd:NamedGraph
dcat:accessService :mysparqlEndpoint .
My current best attempt in case of an RDF quad format (e.g. TriG):
(can coexist with SPARQL distribution description above)
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix dcat-ext: <http://example.org/dcat-extension#> .
:dataset1 a dcat:Dataset ; # dataset of quads: default graph + NG1
void:subset :dataset1_NG1 , :dataset1_defaultGraph ; # there is no similar term in DCAT for sub-datasets?
dcat:distribution :dataset1_all_trig .
:dataset1_all_trig a dcat:Distribution ;
dcat:downloadURL <http://mydomain.org/myfile.trig> ; # internal or shared TriG file
dcat:mediaType <https://www.iana.org/assignments/media-types/application/trig> .
:dataset1_NG1 a dcat:Dataset ;
dcat:distribution :dataset1_NG1_trig .
:dataset1_defaultGraph a dcat:Dataset ;
dcat:distribution :dataset1_defaultGraph_trig .
:dataset1_NG1_trig a dcat:Distribution ;
dcat-ext:graph :NG1 ; # sd:name cannot be used since its rdfs:domain is sd:NamedGraph
dcat:downloadURL <http://mydomain.org/myfile.trig> ; # internal or shared TriG file. Not sure if this is correct, as the TriG file contains more than the :dataset1_NG1
dcat:mediaType <https://www.iana.org/assignments/media-types/application/trig> .
:NG1 a sd:NamedGraph ;
sd:name <http://mydomain.org/NG1> . # the object is the actual URI of the named graph (conform SD)
:dataset1_defaultGraph_trig a dcat:Distribution ;
dcat-ext:graph dcat-ext:defaultGraph ; # sd:name cannot be used since its rdfs:domain is sd:NamedGraph
dcat:downloadURL <http://mydomain.org/myfile.trig> ; # internal or shared TriG file. Not sure if this is correct, as the TriG file contains more than the :dataset1_defaultGraph
dcat:mediaType <https://www.iana.org/assignments/media-types/application/trig> .
Note that I had to add dcat-ext:graph (property) and dcat-ext:defaultGraph (instance)
Good work @mathib . This general approach is what we expect - extend the DCAT model with elements from other RDF vocabularies, while being appropriately careful about imported entailments.
Regarding the subset issue: DCAT provides a generic mechanism for dataset-dataset relationships - see https://www.w3.org/TR/vocab-dcat-2/#qualified-forms . This requires a separate enumeration of 'roles' for the related resource.
Regarding the description of a SPARQL endpoint, with named graphs etc: I made a start on this in the background, and you can see traces of it in https://www.w3.org/TR/vocab-dcat-2/#ex-elaborated-bag but I was not sufficiently familiar with the details of the SPARQL Service Description vocabulary to do it justice, so I deferred a more rigorous version. However, the general expectation was that the dcat:endpointDescription would do the heavy lifting.
It looks like you have gone in a slightly different direction, attempting to describe each named-graph as a distinct DCATA dataset, with associated distributions. This is reasonable. But the level of detail that you are suggesting here is very RDF-specific, and as the scope of DCAT is much broader, my hunch is that this probably would not be suitable for the DCAT core. In fact it looks more like VOID work.
Thanks for the reply. That's very useful information. Some replies on the comments:
Regarding the subset issue: DCAT provides a generic mechanism for dataset-dataset relationships - see https://www.w3.org/TR/vocab-dcat-2/#qualified-forms . This requires a separate enumeration of 'roles' for the related resource.
I looked at the dcat:qualifiedRelation but I'm a bit afraid that by selecting one of the many roles that are out there, the role might be misinterpretted. The closest IANA relation I've found is http://www.iana.org/assignments/relation/item. I don't have access to ISO 19115-1 and DataCite doesn't have URIs for the roles which are needed in DCAT (dcat:hadRole is an objectProperty). I think it's probably better and simpler if I stay with void:subset (domain and range are void:Dataset). I'm thinking of making void:Dataset a subclassOf dcat:Dataset in my extension
Regarding the description of a SPARQL endpoint, with named graphs etc: I made a start on this in the background, and you can see traces of it in https://www.w3.org/TR/vocab-dcat-2/#ex-elaborated-bag but I was not sufficiently familiar with the details of the SPARQL Service Description vocabulary to do it justice, so I deferred a more rigorous version. However, the general expectation was that the dcat:endpointDescription would do the heavy lifting.
I wondered about the usefulness of dcat:endpointDescription in the case of SPARQL endpoints, since most (all?) are (and probably should be, since SD is a W3C Recommendation) selfdescribing using SD via their endpoint URL (thus the object of dcat:endpointURL and dcat:endpointDescription are the same). The standard usage of SD is to describe the SPARQL endpoint in sense of functionality and available graphs. I'm looking for a method to be able to reference a specific named graph (or default graph) as a dataset, to add specific metadata to it and to be able to define separate distributions to external persons/organizations.
It looks like you have gone in a slightly different direction, attempting to describe each named-graph as a distinct DCATA dataset, with associated distributions. This is reasonable. But the level of detail that you are suggesting here is very RDF-specific, and as the scope of DCAT is much broader, my hunch is that this probably would not be suitable for the DCAT core. In fact it looks more like VOID work.
Happy to see that you see my initial approach as reasonable 馃槃. Indeed, I'm aware of the overlap of my current approach with VoID (RDF datasets). I hoped that it's still somehow valid DCAT for the rest. In fact, I feel that parts of VoID can be seen as a subset of DCAT (well void:Dataset subclassOf dcat:Dataset), but DCAT has elaborated more on the distributions and is a W3C Recommendation. In VoID "distributions" are just a flat list on the void:Dataset using void:sparqlEndpoint and void:dataDump; they also lack terminology for pointing to a specific graph inside a SPARQL endpoint (or RDF quad file) and because of the flat list of VoID "distributions", it's also not possible to mention the name of a graph associated with one of the distributions.
hmm, maybe it's better if I reintroduce the useful VoID terminology in my DCAT extension, since VoID terminology is not dereferenceable anymore at least since 2015...
Try https://github.com/cygri/void/blob/master/rdfs/void.ttl
(not the canonical namespace, for sure).
FWIW - ISO 19115 codelists are cached here: http://registry.it.csiro.au/def/isotc211 and will (real soon now) be published using canonical URIs starting with https://www.isotc211.org/
Try https://github.com/cygri/void/blob/master/rdfs/void.ttl
(not the canonical namespace, for sure).
woops, just noticed that RDF is still served at the VoID namespace via content negotation (e.g. http://rdfs.org/ns/void#Dataset). Only the HTML page is missing when navigating to the VoID namespace URI in the browser (content negotation to HTML still works though, redirecting to http://vocab.deri.ie/void.html)
@mathib , do you have any further point you would like to discuss? Otherwise, we are going to close this issue.
Hi @andrea-perego !
I have not been following the DCAT progress too close lately. Is there now a modeling approach for the referencing of named graphs in quadstores and RDF quad files? If this is not yet done but still wanted, I'll share below the final modeling approach I took in my research, using a self-made extension of DCAT, i.e. CDC. The modeling would be done as follows for default graph and specific named graph distributions (either served via an RDF quad file or quadstore service):
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix cdc: <https://w3id.org/cdc#> .
:dataset1 a dcat:Dataset ;
dcat:distribution :dataset1_quadStoreDistr , :dataset1_quadFileDistr .
:dataset1_quadStoreDistr a cdc:DefaultGraphDistribution ; # the content of :dataset1 is available in the default graph of this SPARQL endpoint service
dcat:accessService :mysparqlEndpoint .
:dataset1_quadFileDistr a cdc:NamedGraphDistribution ; # the content of :dataset1 is available in the named graph :myNamedGraph of the TriG file
cdc:graphURI :myNamedGraph ;
dcat:downloadURL <https://mydomain.org/file1.trig> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/trig> .
:mysparqlEndpoint a dcat:DataService ;
dcat:endpointURL <http://mydomain.org/sparql> ;
dct:conformsTo <https://www.w3.org/TR/sparql11-query> ;
dcat:servesDataset :dataset1 .
cdc:DefaultGraphDistribution and cdc:NamedGraphDistribution are subclasses of dcat:Distribution. If no such specific class is used on a distribution and the RDF file or data service supports quads, the content of the dataset should be assumed to be located in all named graphs and the default graph.
Thus instead of creating different datasets, one dcat:Dataset with optionally different distributions suffices as the content of the reflected RDF dataset remains the same. Each distribution then indicates if the content of the dataset is spread over the entire triplestore/quadstore or RDF triple/quad file, or if it's located in a specific named graph or the default graph of a quadstore or RDF quad file. I refrained from reusing SD (SPARQL Description) terminology as I wanted to have a solution that is applicable beyond SPARQL endpoint services, e.g. for quad RDF files and quad pattern fragment servers (triple pattern fragment servers with quad support, such as this one). The SD modeling patterns are also relatively ackward to use in combination with the concept of dcat:Distribution.
Personally, I would be in favor to see the concepts cdc:DefaultGraphDistribution, cdc:NamedGraphDistribtion and cdc:graphURI to be included in DCAT as I believe this can be useful for others who want to indicate the used named/default graphs in distributions as well!
Thanks for sharing this work, @mathib .
About your question, DCAT at the moment does not include the features you mention - the main point is whether they are in scope with DCAT or better left to DCAT profiles (as yours). We'll discuss this.
On a different note, looking at the CDC profile, I see that you have notions as cdc:AdditionAndDeletionDistribution. Would you mind providing a summary of them? I have the impression that they may be in line with requirements contributed in https://github.com/w3c/dxwg/issues/1289 by @riannella about release types (although the use case is from another domain).
About your question, DCAT at the moment does not include the features you mention - the main point is whether they are in scope with DCAT or better left to DCAT profiles (as yours). We'll discuss this.
I think it is, as I'm probably not the only one dealing with RDF publication mechanisms that include quads.
I see that you have notions as cdc:AdditionAndDeletionDistribution. Would you mind providing a summary of them?
This is a bit more difficult to explain and different from the dataset versioning as required in the #1289. I developed the notion of dataset complements, i.e. a distribution of a dataset2 that complements an older dataset1 from another or the same organization (e.g. in the case of heavy collaboration as is normal during a construction project). As such, it's not necessary for the stakeholder who wants to add and/or remove parts of the received dataset, when creating their own contributions, to literally introduce the triples from someone else in his/her new dataset. It's enough that on DCAT level, there's the indication of which dataset distribution adds (cdc:AdditionDistribution), deletes (cdc:DeletionDistribution) or both adds and deletes (cdc:AdditionAndDeletionDistribution) content. For practical situations such as querying and reasoning over a dataset combined with a deletion, there's also an option to calculate the resulting dataset based on the additions and/or deletions. Using the cdc:StandaloneDistribution class and cdc:standaloneOf property, it's possible keep track of this 'standalone' distribution on dataset metadata level. Maybe the following diagram with an example might help to clearify things (where dataset C comes from stakeholder1 and dataset D comes from stakeholder2 who proposes a correction that consists of a deletion and an addition):

The benefits of this approach is that (as long as the additions and/or deletions are smaller compared to the original dataset), that it's easier to exchange changes and to keep track of who said what on data level (responsabilities). I would say that this collaboration approach is orthogonal to dataset versioning as you always complement a specific dataset version.
Most helpful comment
FWIW - ISO 19115 codelists are cached here: http://registry.it.csiro.au/def/isotc211 and will (real soon now) be published using canonical URIs starting with https://www.isotc211.org/