Dxwg: Related vocabularies mapping [RVM]

Created on 18 Jan 2018  Â·  23Comments  Â·  Source: w3c/dxwg

Related vocabularies mapping [RVM]

Define guidelines how to create a DCAT description of a VOID or Data Cube dataset


Related use cases: Cross-vocabulary relationships [ID36] 
dcat Dataset Distribution future-work requirement void

All 23 comments

:warning: Examples have been revised as per https://github.com/w3c/dxwg/issues/88#issuecomment-360281475. Changes are shown as diffs

About the relationship between dcat:Dataset and qb:DataSet, I tend to agree that qb:DataSet could be more similar to the notion of dcat:Distribution. I think, however, that, to better grasp their relationship, we should try to think of examples where they are used together, DCAT for the metadata and QB for the actual data.
For instance, suppose that the distribution of a dataset is provided as a zipped bundle of CSV and QB-RDF files:

````diff
a:Dataset a dcat:Dataset ;

We had a similar case, and the options we thought of were 2:

<http://example.org/data.rdf> a qb:DataSet ;
  ...

````

a:QbDataSet a qb:DataSet ;

- dct:isPartOf http://example.org/data.rdf ;
+ dct:isPartOf http://example.org/data.zip ;
...
````

Note that, in the latter option, there's only an indirect relationship between the qb:DataSet and the dcat:Distribution (i.e., the distribution download URL is not the distribution URI, but it is denoting another type of resource).

qb:Dataset allows one to make statements about the internal structure, and even more significantly the semantics of the data set - whats a dimensions vs a measure is one of the most important things. It alos provides us with a means to bind SKOS hierarchies to property ranges. Ignoring this or reinventing it would be Bad I think.

The range of the qb property (e.g. qb:dimension) that identifies the property in the described dataset is defined as a subclass of qb:ComponentProperty - but that simply means that when you reference a property in your dataset that can be entailed - so it can be used to described any dataset where you can identify an equivalent rdf:Property...

So the question is, does this mean it can only be used to described data encoded in RDF (a distribution) or any dataset for which an equivalent RDFS schema can be defined or inferred?

Simply adding an attribure to a Property definition to identify its "local name" in a given distribution would be adequate in cases where there is no formal RDF mapping. It can be used to define Xpath elements of an XML encoding for example.

Lets put examples up - but I would expect it is useful to describe data structures as the Dataset level, and maybe just add some annotation properties to these descriptions for specific Distributions, and it will work quite happily for datasets that are available via APIs that slice and dice across dimensions - the most basic form of REST API as it turns out.

IMHO his doesnt require anything more than guidance - dont reinvent this wheel (structure definition) - this is how you describe using QB... and it will work perfectly if and when an RDF version of data is available, but in the meantime is still the best way to describe data and API semantics (remember it buys us dimension identification, classification vocabulary identification, and breaks things into predictable components we can annotate however we want to)

@rob-metalinkage , about this point

Lets put examples up - but I would expect it is useful to describe data structures as the Dataset level, and maybe just add some annotation properties to these descriptions for specific Distributions, and it will work quite happily for datasets that are available via APIs that slice and dice across dimensions - the most basic form of REST API as it turns out.

I'm not completely sure that it applies to all cases. In my understanding, the definition of dcat:Dataset is not so strict to require that all its distributions share the same data structure definition.

Good point - but it probably something we would entail for a distribution if no specific structure is defined, use a structure scoped for the dataset. Otherwise we may have a lot of redundancy if there are many distributions. In general I suspect the semantics are best captured in most cases by defining dimensions, measures and attributes for the Dataset, and each distribution perhaps as a slice.

@andrea-perego, I haven't yet read all the discussion, sorry,
but I wonder if in your example dcat:Dataset should be replaced with dcat:Distribution?

Reading the DCAT 1.0, the property dcat:downloadURL has domain dcat:Distribution.

As a consequence, my understanding is that the triple
a:Dataset dcat:downloadURL <http://example.org/data.zip> .
entails
a:Dataset a dcat:Distribution

Moreover,
although dcat:Distribution and dcat:Dataset are probably not defined as disjoint classes, section 4 in the DCAT REC says

Notice that a dataset in DCAT is defined as a "collection of data, published or curated by a single agent, and available for access or download in one or more formats". A dataset does not have to be available as a downloadable file.

Am I missing anything?

Thanks for spotting this, @riccardoAlbertoni . It's indeed a typo - sorry.

What I actually meant is:

````turtle
a:Dataset a dcat:Dataset ;
dcat:distribution a:Distribution .

a:Distribution a dcat:Distribution ;
dcat:downloadURL http://example.org/data.zip .
````

I'll correct the original examples.

The big difference between qb:DataSet and dcat:Dataset is that the former actually holds the data whereas the latter describes what data is present. So I think that actually a cube DSD is more related to dcat:Dataset because the latter may also want to talk about the dimensions (and their values).

@rob-metalinkage responded by email:

There is no particular reason qb: cant be used as metadata for datasets - they do not strictly need to be RDF encoded datasets - and the same with VoiD (and I had a conversation with Richard Cygniak when I was looking into this). True, both have RDF friendly special semantics - such as void:sparqlEndpoint - which indirectly requires a RDF representation via SPARQL I guess. But its optional.

qb:Dataset defintion is "Represents a collection of observations, possibly organized into various slices, conforming to some common dimensional structure." - it makes no statement about RDF or data being present - in fact the standard says "This cube model is very general and so the Data Cube vocabulary can be used for other data sets such as survey data, spreadsheets and OLAP data cubes [OLAP]."

so, under the open-world assumption db:DataSet may just be metadata. I'm not sure if dcat:Dataset works equally well for metadata or datasets.

qb:DSD is the structural description part of QB, and could certainly be used as properties to qualify either dcat:Dataset or dcat:Distribution instances.

AFAICT the only tricky thing to manage here is how the rdf:Property object described using QB is mapped to the structure of a dataset - for those instances such as spreadsheets etc where elements do not natively have URI names.

NB: There may need to be some clever entailment rules for profiles too - to allow narrower definitions of individual components to specialise a DSD defined in an ancestor profile.

(such as a commonly used dimension based on biological taxa codes, but a dataset where the range is limited to a specific genus)

Also see #60

@rob-metalinkage wrote:

There is no particular reason qb: cant be used as metadata for datasets - they do not strictly need to be RDF encoded datasets - and the same with VoiD

Can you expand a bit on VoID not being only for RDF? Given the definition "A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider", I'd say that VoID is _only_ about RDF triples...

@dr-shorthair I agree that QB can be used to represent dataset statistics. It's important not to muddy the waters by confusing:

  • the use of DCAT to represent stats datasets (eg StatDCAT-AP)
  • vs the use of QB to represent dataset stats

Challenge... how the rdf:Property object described using QB is mapped to the structure of a dataset

  • Agree: as #161 says "The real challenge is how to do it for other datasets."

But I also see other challenges:

  • how to harmonize this "DCAT using QB" with VOID because VOID is very prevalent for RDF datasets
  • how to capture specific subsets, eg (see #161) "startup companies in Italy". AFAIK VOID can't express this (class/property partitions don't fix a property value) but maybe some VOID extensions can. And I think that qb:DSDs/slices can express it

for those instances such as spreadsheets etc where elements do not natively have URI names.

But see CSVW. The future DCAT should interplay with such RDFization standards...

so, under the open-world assumption db:DataSet may just be metadata

+1, the question of whether the data is actually there (especially in RDF) is another matter to what the definitions say. I can imagine that there are things that are considered (dcat) datasets, but not (qb) datasets, because they're not usefully modeled as data cubes. Maybe a domain-specific file format, or e.g. a trained machine learning model (like a parameter set for https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html ), ... could be considered non-cube datasets.

FWIW Schema.org's Dataset type is slightly more inclusive than DCAT's in that it de-emphasises any requirement to be managed/curated/published in a catalog or portal.

Answer per email from @rob-metalinkage

@larsgsvensson asked "Can you expand a bit on VoID not being only for RDF? Given the definition "A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider", I'd say that VoID is _only_ about RDF triples..."

Several years ago I had a conversation with one of the authors Richard Cygniak (it was when I was at CSIRO and I dont have access to the mail thread any more).

My Use Case was describing datasets that are not _currently_ published as RDF, but that in an evolving Linked Data environment would ideally be, so we could kick start the process of providing fine grained semantics about data and services. I asked whether there is a need for a dataset to be stored as RDF - and if merely capable of being expressed as RDF was sufficient. From memory Richard C confirmed that this was a reasonable interpretation.

Given pretty much all the metadata in Void is optional, there is no problem with describing TechnicalFeatures that relate to non RDF access methods or distributions, and leaving out the RDF specific sparqlEndpoint.

Only properties like void:vocabulary, classPartition etc reference IRI identifiers, and hence a contract around the RDF model that would be assumed.

(QB helps resolve the shortfall, but void:vocabulary is not actually that useful anyway as discussed above). so the same issues apply to how mappings from IRI based identifiers to local identifiers (e.g. column names in a spreadsheet) is the one mechanism we need to think about for all these cases.

Following @agbeltran pointer to DISCO in https://github.com/w3c/dxwg/issues/164#issuecomment-373355860 :

DISCO includes in a specific section some considerations about QB according to which it seems they consider a db:DataSet related to dcat:Distribution. Quoting:

[...] The Discovery Vocabulary contains a property “aggregation” (pointing from a disco data set to a Data Cube dataset) that indicates that a Cube dataset was derived by tabulating a record-level dataset.

Data Cube provides for the description of the structure of such cubes, but also for the representation of the cube data itself, that is, the observations that make up the cube dataset. This is not the case for the Discovery Vocabulary, which only describes the structure of a dataset, but is not concerned with representing the actual data in it. The actual data is assumed to sit in a data file (e.g., a CSV file, or in a proprietary statistics package file format) that is not represented in RDF.

The interplay of Data Cube and Disco needs further exploration regarding the relationship of aggregate data, aggregation methods, and the underlying microdata. The goal would be to drill down to the related microdata based on a search resulting in aggregate data. On the one hand aggregate data are often easily available and gives a quick overview. On the other hand microdata enable more detailed analyses.

@andrea-perego Thanks for the link!
http://rdf-vocabulary.ddialliance.org/discovery.html#useOfOtherVocabularies elaborates on the reuse and connections to other vocabs including DCAT.
But that's squarely aimed at statistical datasets. The challenge is how to partition metadata in two groups: for general vs for statistical datasets.

@larsgsvensson "Only properties like void:vocabulary, classPartition etc reference IRI identifiers, and hence a contract around the RDF model"

The challenge is how to represent such dataset descriptors for non-RDF datasets. And we need such descriptors to be compatible with RDF datasets, which I think means they should still be classes and properties!
Eg DISCO gives some examples at http://rdf-vocabulary.ddialliance.org/discovery.html#fig-descriptivestatistics. I feel there's a deep connection between Variable and rdf:Property that is not explored here.

@VladimirAlexiev there will be other groups as well - for example Geo, for which an "Application Profile" has been developed (GeoDCAT-AP). Perhaps you need something similar for stastical datasets?

Perhaps I'd point the obvious, but there is StatDCAT-AP. Comparing DISCO to that should be a very profiable exercise.

(I myself have contributed to the HCLS profile)

StatDCAT-AP is at https://joinup.ec.europa.eu/release/statdcat-ap-v100. It's a fairly basic profile, an extension of the European DCAT-AP. It adds a small number of properties to DCAT-AP:

<http://data.europa.eu/m8g/attribute> a rdf:Property ;
  rdfs:label "attribute"@en ;
  vann:usageNote "Aditional optional property. Cardinality [0..n]. This property links to a component used to qualify and interpret observed values, e.g. units of measure, any scaling factors and metadata such as the status of the observation (e.g. estimated, provisional). Attribute is a ‘conceptual’ entity that applies to all distribution formats, e.g. in case a dataset is provided both in SDMX and in Data Cube."@en ;
  rdfs:range qb:AttributeProperty .

<http://data.europa.eu/m8g/dimension> a rdf:Property ;
  rdfs:label "dimension"@en ;
  vann:usageNote "Aditional optional property. Cardinality [0..n]. This property links to a component that identifies observations, e.g. the time to which the observation applies, or a geographic region which the observation covers. Dimension is a ‘conceptual’ entity that applies to all distribution formats, e.g. in case a dataset is provided both in SDMX and in Data Cube."@en ;
  rdfs:range qb:DimensionProperty .

<http://data.europa.eu/m8g/numSeries> a rdf:Property ;
  rdfs:label "number of data series"@en ;
  vann:usageNote "Aditional optional property. Cardinality [0..n]. This property contains the number of data series contained in the Dataset. This property should be defined as rdfs:Lideral typed as xsd:date or xsd:dateTime."@en ;
  rdfs:range rdfs:Literal .

<http://www.w3.org/ns/dqv#hasQualityAnnotation> a rdf:Property ;
  rdfs:label "quality annotation"@en ;
  vann:usageNote "Aditional optional property. Cardinality [0..n]. This property links to a statement related to quality of the Dataset, including rating, quality certificate, feedback that can be associated to the Dataset."@en .

<http://data.europa.eu/m8g/statUnitMeasure> a rdf:Property ;
  rdfs:label "unit of measurement"@en ;
  vann:usageNote "Aditional optional property. Cardinality [0..n]. This property links to a unit of measurement of the observations in the dataset, for example Euro, square kilometre, purchasing power standard (PPS), full-time equivalent, percentage. Unit of measurement is a ‘conceptual’ entity that applies to all distribution formats, e.g. in the case when a dataset is provided both in SDMX and in Data Cube."@en ;
  rdfs:range skos:Concept .

You'll see that the ranges are from DataCube but it is not necessary that the objects are part of a QB description. As stated in the usage notes, those properties can apply irrespective of the encoding of the data, in SDMX or QB or otherwise.

To move this issue forward we need some options:
1) do nothing - reject related requirements that can be satisfied by other vocabularies as out of scope
2) extend DCAT to possibly duplicate or subclass other vocabularies
3) put in "hard guidance" in DCAT regarding recommended vocabularies
4) push problem to profiles of DCAT, indirectly via Profile Guidance deliverable
5) push problem to description of profiles data conforms to
6) other options?

(note 3,4,5 can kind of cascade - hard guidance is use DCAT profile, which in turns specifies a form of data description profile that uses the vocabularies required...) - but can do 5 without 3 or 4 and 4 without 3)

I think it is time to discuss options and vote.

I believe it is of interest to express some general guidance on how DCAT and the other dataset-defining vocabularies are related and what the best practices are in combining them. However, the best way is to put that relationship/guidance in the other vocabularies (too). Hence they have to be revised/updated also, otherwise one has 2 messages: one from the DCAT-2018 perspective and one from datacube(2014)/void(2011) perspective. And that leads to more confusion.

For me this is a meta expression over all the vocabularies & application profiles. Not sure if there is a place to put this.

What you describe is an Application Profile. Best done with rdf shapes, in
addition to prose.
I don't see how you can change the TRs of those other vocabs because it's
not easy to reconvene those working groups. Best to do it externally, as
you suggest.

http://vocab.getty.edu/doc/#Descriptive_Information describes one such
combination (void, dcat and adms) in prose and diagrams

On Tue, Sep 4, 2018, 16:08 bertvannuffelen notifications@github.com wrote:

I believe it is of interest to express some general guidance on how DCAT
and the other dataset-defining vocabularies are related and what the best
practices are in combining them. However, the best way is to put that
relationship/guidance in the other vocabularies (too). Hence they have to
be revised/updated also, otherwise one has 2 messages: one from the
DCAT-2018 perspective and one from datacube(2014)/void(2011) perspective.
And that leads to more confusion.

For me this is a meta expression over all the vocabularies & application
profiles. Not sure if there is a place to put this.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/w3c/dxwg/issues/88#issuecomment-418360724, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAguukBL8DIWGQ8lvj2b7mhke6AzYfiNks5uXntggaJpZM4Rjksq
.

Including the comment from an email by Guillaume Duffes:

- Define guidelines how to create a DCAT description of a VOID or Data Cube dataset<https://github.com/w3c/dxwg/issues/88>:
    ** We agree with the fact that this "DCAT using QB" should be harmonised with VOID and clarified. It is crucial for the official statistics community to use the proper semantics and relationship between these vocabularies..
    ** In our mind, qb:DataStructureDefinition and qd:Slice are meant to express subsets (see also #161<https://github.com/w3c/dxwg/issues/161>) like "startups in Italy". A similar use case is provided in the RDF Data Cube Recommendation.

Closing the issue as we do not have enough use cases grounding a solution, see resolution and related discussion https://www.w3.org/2021/01/20-dxwgdcat-minutes.html#r04,

Was this page helpful?
0 / 5 - 0 ratings

Related issues

davebrowning picture davebrowning  Â·  7Comments

andrea-perego picture andrea-perego  Â·  3Comments

riccardoAlbertoni picture riccardoAlbertoni  Â·  4Comments

andrea-perego picture andrea-perego  Â·  5Comments

dr-shorthair picture dr-shorthair  Â·  6Comments