Dxwg: How to express available formats for a dcat:Dataservice

Created on 6 Sep 2019  ·  22Comments  ·  Source: w3c/dxwg

Sorry for coming in late in to this discussion, with the risk of missing out on obvious solutions and asking ignorant questions. We (Norway DCAT-AP-NO working group) are looking forward to be able to describe APIs in a DCAT-catalog, and the introduction of the dcat:DataService class seems to address that need neatly. Our biggest concern now is that users of a catalog of dcat:DataServices would expect to get hints on available formats for the cataloged APIs.

Based on user needs the information on formats is so important for the users of the data catalog that we placed it as one of the elements in the listings-page, see: https://fellesdatakatalog.brreg.no/apis

In DCAT2 this need seems to be neglected, leaving the publishers with these three options:

  1. Repeat dcat:Distribution for each format, populating (or polluting) the catalog with several dcat:Distributions for each dcat:DataService with dcat:format/dct:mediatype) (and possibly dcat:accessURL) as the only deviation.

  2. Exclude information on available formats for the dcat:Catalog, forcing the users to leave the data portal and follow either dcterms:conformsTo or dcat:endpointDescription (if the dcat:endpointDescription is an URL) to detect formats.

  3. Use dct:description or dcat:endpointDescription (for dcat:DataService) and provide format-information as text only.

Option 1)
We think this is tedious for providers/publishers and not very user friendly for the catalog end users. For dcat:DataServices providing content negotiating, it makes even less sense (to us), since the dcat:accessURL will be identical for each dcat:Distribution. This approach implies that all dcat:DataServices has at least one dcat:Distribution.

Option 2)
We are aware that information on formats is in the dcat:endpointDescription, but rarely in a way that makes sense for a RDF/linked data environment. This leaves the users with no filtering options and the catalog provider with no easy way of enriching the catalog service itself with information on formats for dcat:DataServices by harvesting from dcat:endpointDescriptions.

Option 3)
Option 3 (and 2 combined) may be sufficient for some, but is not a very machine-readable approach. Filtering options will also be limited.

User stories:
As a portal end user I would like to know in which formats a given dataservice provides data, so that I can get information on available formats without leaving the catalog and search and filter the datasets / dataservices based on available formats

As a catalog provider I would like to express available formats for a dcat:DataService without having to repeat new dcat:Distributions for each available format.

As a dcat:DataService provider I would like to provide information on available formats for APIs that do not distribute any datasets (e.g. a currency-conversion service / a CSV-to-DCAT-transformation service) and therefore does not have relations to any dcat:Distribution

Proposal:
Add a “dcat:availableFormats” property to dcat:DataServices allowing publishers to list all data-serializations the service offers in a machine-readable way.

PS: A humble thank you to all that have contributed to this work.

(Posted on behalf of the DCAT-AP-NO working group in Norway)

dcat DataService feedback future-work

Most helpful comment

The addition of DataService to the DCAT vocabulary is perhaps the most significant enhancement in this version. It was in response to a number of use-cases submitted in the early phase of the work. We worked on the details of the model starting around 18 months ago. DataService made it into the head on 10 June 2018 (sha 471a597427b8770c678c2ba03bb02938ee821a47), and the model was finalized with the last details merged into the head 28 March 2019 (sha 33035d5fa4076fce19b221d173eba2b2b6ed98e9) .

The specific requirement that you are raising now was not visible to us while we were preparing the solution. In the example above I show that the available formats from a dcat:DataService can be made available on the path dcat:servesDataset/dcat:distribution/dct:format . So the model is competent to express the information in this requirement, though perhaps not optimized for it. The model is integrated and consistent with the original DCAT view (which separates conceptual datasets from their serializations) and I disagree that this path is particularly 'file-oriented', especially when implemented with blank-nodes as shown. However, in every model (ontology) choices are made about which relationships to give names to, and which ones to leave as paths through a graph. These choices are usually informed by known applications.

This issue is now tagged as a future priority, so let's look again when the dust on the CR/Rec process has settled.

All 22 comments

As I understand the model, a dcat:Distribution may have multiple dct:format values, and by implication, those formats would be available via any dcat:accessService/dcat:DataService associated with that distribution. I think this addresses your first and second user story.

The third user story is about a dcat:DataService that is a data processing service, and agree that the current DCAT model does not handle this, but depends on the dcat:endpointDescription/rdfs:Resource object to provide information about operations, input and output formats, templates for URL parameters, etc. From the discussion in #432 it seems that the data processing functions included in the dcat:DataService definition are construed in the context of access to a particular dataset (subset, reproject, upscale, downscale, regrid...).

Your user story (if I understand correctly) is about a data processing service that takes an input resource, does some processing, and returns an output resource. There is other information one would want to have to make a searchable catalog for such data processing services-- for example, a 'function' property (e.g currency-conversion, CSV-to-DCAT-transformation, units conversion, geographic reprojection...). I think these kind of data processing services would need to be represented as another kind of dcat:Resource for a catalog.

Thanks for the comments and suggestions. Unfortunately they arrive largely too late to be considered for the current release of DCAT. However, they should be added to the backlog for a future update. The plan is to maintain DCAT as an 'evergreen standard' so revisions should come reasonably frequently. Of course there is also nothing to stop you developing a DCAT extension for data-processing-services to test the waters.

Meanwhile, some observations partly in response to @smrgeoinfo comment: since a data-processing-service will usually return a serialized dataset (i.e. a dcat:Distribution) then I think it usually is a kind of dcat:DataService. (I disagree with @smrgeoinfo here ;-) )

Nevertheless, describing the details of a particular data-service type, or the specifics of a particular endpoint, was considered beyond the scope of this DCAT version. Rather, we delegated those details to external specifications through the values of the dct:conformsTo and dcat:endpointDescription properties of the service description. Various standards for these already exist, as mentioned in the document.

Thanks alot for comments.
@smrgeoinfo: We have to conform to EUs DCAT-AP, and they have put 0..1 on both dcat:mediaType and dct:format for distributions. We will feed this back to their AP-work, but it is a hard sell since multible formats for distributions seems to be in breach with the concept of a dcat:distribution(?)

@dr-shorthair: Yes, we are rudely late. I see the point of what to bring in and what to leave out of the scope. Still, we find formats for APIs as so vital for the users that for the future, we suggest lifting it into the dcat:Catalog itself.

:)

@oystein-asnes wrote:

we find formats for APIs as so vital for the users that for the future, we suggest lifting it into the dcat:Catalog itself.

Could you provide a concrete proposal?

Two things:

  1. I agree with @oystein-asnes regarding the interpretation that a dcat:Distribution can only have one format. In part due to the DCAT-AP constraint and in part from the definition of dcat:Distribution in the original DCAT recommendation that says that distributions are used to "represent different formats of the dataset or different endpoints".
  2. Maybe we do not need to introduce a new property like dcat:availableFormats. Maybe we can reuse the dct:format property on DataService. I see nothing in the Dublin Core specification hindering repeating it, the limitation on repeating it on dcat:Distribution is based on the semantics of the class not the property itself.

I agree with matthiaspalmer on this one. If a dataset is available in multiple formats, each can be described as a distribution.

The cardinality of dc:format looks like an issue with EUs DCAT-AP - DCAT isnt the source of the problem . I dont think DCAT should need to change here. You could choose your own profile which uses an entailment rule to transform to DCAT-AP - thats a system decision what profiles to support. Better to fix DCAT-AP though.

Thanks @agreiner - indeed, each format corresponds with a :Distribution.

There is no direct relationship from :DataService to :Distribution (see https://w3c.github.io/dxwg/dcat/images/DCAT-summary-all-attributes.png) but I think the model is correct.
So in order to indicate multiple formats you could do something like

my:DataProcessingService987 a dcat:DataService ;
    dcat:servesDataset [ a dcat:Dataset ;
        dcat:distribution [ a dcat:Distribution ;
            dcat:mediaType  <https://www.iana.org/assignments/media-types/text/csv> ;
        ] ;
        dcat:distribution [ a dcat:Distribution ;
            dcat:mediaType  <https://www.iana.org/assignments/media-types/text/turtle> ;
        ] ;
        dcat:distribution [ a dcat:Distribution ;
            dcat:mediaType  <https://www.iana.org/assignments/media-types/application/json> ;
        ] ;
    ] ;
.

@rob-metalinkage

The cardinality of dc:format looks like an issue with EUs DCAT-AP - DCAT isnt the source of the problem . I dont think DCAT should need to change here. You could choose your own profile which uses an entailment rule to transform to DCAT-AP - thats a system decision what profiles to support. Better to fix DCAT-AP though.

I don't agree. As @matthiaspalmer notes at https://github.com/w3c/dxwg/issues/1055#issuecomment-529869464, the definition of dcat:Distribution in DCAT 2014 suggests a Distribution has only one format: "_Represents a specific available form of a dataset_". In addition, the definition of dct:format in the DCAT spec at https://www.w3.org/TR/vocab-dcat/#Property:distribution_format is "_The file format of the distribution_"; the format, not a format, implying a single one.

Wording is a little loose but the intention was clear.

I agree with @makxdekkers but also add that the object of dct:format is the format of the subject of dct:format, and it is hard to imagine single resources that are simultaneously more than one format.

Just in case it was not clear, since dcat:mediaType rdfs:subPropertyOf dct:format . the snippet above could also be written

my:DataProcessingService987 a dcat:DataService ;
    dcat:servesDataset [ a dcat:Dataset ;
        dcat:distribution [ a dcat:Distribution ;
            dct:format <https://www.iana.org/assignments/media-types/text/csv> ;
        ] ;
        dcat:distribution [ a dcat:Distribution ;
            dct:format <https://www.iana.org/assignments/media-types/text/turtle> ;
        ] ;
        dcat:distribution [ a dcat:Distribution ;
            dct:format <https://www.iana.org/assignments/media-types/application/json> ;
        ] ;
    ] ;
.

If you cannot contemplate content negotiation occurring in your implementations then you are free to use the long-hand version @dr-shorthair suggests. If you have content negotiation and use that pattern then there is a lot of redundant duplication of the resource identifier, but it doesnt break. DCAT-AP is effectively saying thats the pattern you MUST use. If that is unacceptable to a user of DCAT-AP (such as implied by the feedback) thats still an issue with DCAT-AP not DCAT itself. The response to the question should not be dictated by whether people wish to use content-negotiation in their own environments, or understand its application in general, but whether DCAT itself needs to change - and I dont think it does because its not the source of the restriction which is apparently problematic.

@rob-metalinkage
As far as I am concerned, DCAT-AP implements the approach implied by DCAT, namely that one distribution is associated with one format. I do not understand how you can read https://www.w3.org/TR/vocab-dcat/#class-distribution differently. If DCAT had contemplated a shortcut for content negotiation, if would have said so.

@makxdekkers thats _an_ interpretation - but it explicitly states "This represents a general availability of a dataset it implies no information about the actual access method of the data" [1] and nowhere can I see where explicitly states "one distribution is associated with one format." ... can you provide a pointer to where that is stated?

[1] http://www.w3.org/ns/dcat#Distribution

unless it is not obvious that "Examples of distributions include a downloadable CSV file, an API or an RSS feed" includes "API" and APIs can definitely support multiple formats... ?
anyway this is all just information - its up to DCAT contributors to decide if they wish to support an end-run around this, and if the intent is truly to restrict one format per Distribution then the add an axiom to declare this.

@rob-metalinkage
I already admitted that the wording in DCAT 2014 is a little loose, so it indeed doesn't contain the exact text that you are looking for: "one distribution is associated with one format." But you knew that already; it wasn't really necessary for you to ask me.

However, at the same time, the specification does not anywhere imply that a single distribution could have more than one format; none of the examples give that impression and the text that I referred to does imply, at least to me, that there is indeed a maximum of one format for a distribution -- _a specific_ form, _the_ format..

Other than that, I am just stating what my honest understanding was of the intention of the group that developed DCAT 2014 and of which I was a member.

It seems to me that the situation where a distribution might have multiple formats is via dcat:Distribution/dcat:accessService/dcat:DataService, for which the data service offers different formats via content negotiation. As far as I can tell, there really isn't any mechanism to describe content negotiation options other than via the dcat:DataService/dcat:endpointDescription. The interpretation that in dcat:Distribution each dcat:downloadURL is associated with one dcat:mediaType (or dcat:packageFormat). @dr-shorthair 's encoding of various formats for a Dataset accessed via a service is fine, but really doesn't help the user figure out how they are supposed to access a particular format. Perhaps there would need to be an understanding to look at the dcat:DataService/dcat:endpointDescription.

@makxdekkers thanks for the clarification around the intent - in that case to uphold that intention it looks like there are two fixes needed for DCAT :

1) make that intention absolutely clear in unambiguous wording and axiomitisation
2) clearly identify how APIs including content-negotiation (in various forms) can advertise the multiple formats they do offer (maybe revisit the proposed approach of adding new properties)

the alternative is to clearly state that single value is _not_ the intention by providing examples and making wording more consistent. I could live with either way - but it does appear the current ambiguity is problematic.

Could we have specific wording proposals please? preferably in a PR!

Thank you all for an enlightening discussion. Besides the discussion on cardinality for dcat:Distribution our concern is that a format property is the only one we need to complement an API-description based on dcat:dataService properties. Unless there are downloadable files involved, the rest of the class does not add any value.

By using dcat:distribution to express format (regardless of content negotiation and multiple formats), we need to add accessURL (mandatory in DCAT-AP) and it also makes sense to add dcat:distribution (for the dcat:Dataset) to associate the distribution to the related dataset . The result is a dcat:Dataset/dcat:Distribution/dcat:accessService/dcat:DataService model instead of a much simpler dcat:Dataset/dcat:DataService/ (or only dcat:DataService if standalone dataservices are accepted in the catalog.

We are fine with a catalog service (a data portal) being agnostic on _how_ to access data in a certain format, leaving this to the endpoint description, but our users expect to be able to filter/search on formats.

Adding a dcat:Distribution for each format, seems to be a file-oriented way of describing APIs, making catalogs (data portals) less human-user friendly and the model more complex. We might have missed out on the basics so feel free to convince us that we need both dcat:Distribution and dcat:DataService to provide an API-description.

The addition of DataService to the DCAT vocabulary is perhaps the most significant enhancement in this version. It was in response to a number of use-cases submitted in the early phase of the work. We worked on the details of the model starting around 18 months ago. DataService made it into the head on 10 June 2018 (sha 471a597427b8770c678c2ba03bb02938ee821a47), and the model was finalized with the last details merged into the head 28 March 2019 (sha 33035d5fa4076fce19b221d173eba2b2b6ed98e9) .

The specific requirement that you are raising now was not visible to us while we were preparing the solution. In the example above I show that the available formats from a dcat:DataService can be made available on the path dcat:servesDataset/dcat:distribution/dct:format . So the model is competent to express the information in this requirement, though perhaps not optimized for it. The model is integrated and consistent with the original DCAT view (which separates conceptual datasets from their serializations) and I disagree that this path is particularly 'file-oriented', especially when implemented with blank-nodes as shown. However, in every model (ontology) choices are made about which relationships to give names to, and which ones to leave as paths through a graph. These choices are usually informed by known applications.

This issue is now tagged as a future priority, so let's look again when the dust on the CR/Rec process has settled.

Was this page helpful?
0 / 5 - 0 ratings