Dxwg: A separate class for DatasetSeries (?)

Created on 12 Nov 2020  路  34Comments  路  Source: w3c/dxwg

Do we need a separate class for DatasetSeries?
Below some leftover of the discussion in issue #868

for future reference (v3?) I agree DatasetSeries should be a separate subclass of dcat:Resource. Noting that a series would have all the properties that are specific to dataset, from a modeling perspective it might be treated as a subclass of Dataset, with the addition of a mandatory(2..N) 'hasPart' relationship, and properties indicating how the 'granules' in the collection are defined (time, space...).
_Originally posted by @smrgeoinfo in https://github.com/w3c/dxwg/issues/868#issuecomment-518337457_

Yes @smrgeoinfo that is my thinking as well.
Richer treatment of relations between resources (esp. datasets) is one of the features that has been added in DCAT2, so we have the platform already.
https://www.w3.org/TR/vocab-dcat-2/#qualified-relationship
_Originally posted by @dr-shorthair in https://github.com/w3c/dxwg/issues/868#issuecomment-518389845_

dataset-series dcat due for closing

Most helpful comment

A class DatasetSeries is going to be added by PR #1292. If there are no objections I think we can close this issue as soon as the PR is merged.

All 34 comments

Agree that this is very useful addition and would be welcomed in the Aviation industry, as datasets are published as part of a series each month.

We have modelled the dcat:Catalog as representing a "Dataset Series" and dcat:Dataset for all the datasets in that Series.
The main reason for this is that a dcat:Dataset is defined as "A collection of data..." and dcat:Catalog as "A curated collection of metadata about resources". This closely maps the Series/Dataset relationship.

The proposals in section 11.1 do not seem to fit this model.
1) Publishing a dcat:Dataset as a "series" and the "dataset" as a dcat:Distribution breaks the DCAT model.
2) Using dcat:Datatset as both a "series" and a Dataset breaks the DCAT model.

Since a dcat:Catalog can refer to many other dcat:Catalog entities, then it can be used to model the idea that an over-aching Catalog is made of many Catalog "dataset" series (each of which contains the actual Datasets).

We suggest that Section 11 makes this the preferred solution.

To make it even more specific (and inference-able!), we should add a predicate from dcat:Dataset to dcat:Catalog called dcat:series (that is used when the Catalog acts as a Series metadata)
This latter point would also address https://github.com/w3c/dxwg/issues/1288

I would argue against saying that a dcat:Catalog should be sometimes "reused" to also be a dcat:DatasetSeries just because a Catalog is often also connected to a software instance, web site, user interface, data dumps, etc., whereas a Dataset Series is something recorded in such Catalog.

In addition, in Czechia, we model some dcat:Datasets as dcat:DatasetSeries and we have an additional constraint that a dcat:Dataset which is a dataset series cannot have any distributions, just other datasets as parts.

@riannella Why do you think a dataset series is not also "A collection of data...", just structured differently - in partial datasets?

I think we clearly and unambiguously need to define what a "Dataset Series" is first.
Section 11 says

Dataset series are defined in [ISO-19115] as a collection of datasets [鈥 sharing common characteristics.

Hence, a dataset series is a collection of datasets, not a dataset itself.

I agree that we need a clear definition of the series.

First, I am not sure if the ISO-19115 definition (for geodata) should be taken as is also for DCAT.

Second, we could iterate that question and ask whether a collection of collections of data is a collection of data, i.e. whether in this sense, the collections can be flattened, in which case a dataset series would also be a dataset.

If not, then I think there is no need to forcibly merge dataset series with either catalog or dataset. It can be a completely standalone subclass of dcat:Resource, similarly to dcat:CatalogRecord.

@riannella it depends on how you conceptualize it. I think there is a case for saying that a dataset series is _conceptually_ a single dataset, but composed of several subsets, which share most of their properties. A catalog is more heterogeneous.

Both datasets and dataset series are resources that can be catalogued. Different dataset series can be created through different aggregations of datasets. These relations should be allowable in DCAT2

@riannella it depends on how you conceptualize it. I think there is a case for saying that a dataset series is _conceptually_ a single dataset, but composed of several subsets, which share most of their properties. A catalog is more heterogeneous.

I agree with the view expressed by @dr-shorthair and the flattening idea from @jakubklimek. This is also the direction followed in the early draft in Pr #1292.

Considering how dataset series are meant in different portals, standards, or application domains besides those already considered in the document might help us to ground and generalize better the current definition.

@sjskhalsa wrote:

Both datasets and dataset series are resources that can be catalogued. Different dataset series can be created through different aggregations of datasets. These relations should be allowable in DCAT2

@sjskhalsa, I think this is the idea supported in sections 11.1 Dataset series specification when using dct:hasPartand its inverse dct:isPartOf, do you have anything different in mind?

@sjskhalsa wrote:

Both datasets and dataset series are resources that can be catalogued. Different dataset series can be created through different aggregations of datasets. These relations should be allowable in DCAT2

@sjskhalsa, I think this is the idea supported in sections 11.1 Dataset series specification when using dct:hasPartand its inverse dct:isPartOf, do you have anything different in mind?

@riccardoAlbertoni, no, I think 11.1 is a good solution and agree with @jakubklimek that dcat:DatasetSeries should become a separate subclass of dcat:Resource. Datasets can be thought of as the data payload (e.g. files on a storage medium) whereas a Dataset Series could be a "virtual" Dataset in that it exists only by virtue of the metadata describing a collection of Datasets.

I would be happy to see a new dcat:DatasetSeries class added that supports the metadata describing the collection of Datasets (as per @sjskhalsa comment above

Might be related to the Dublin Core Collection Description Application Profile (#244).

I see a distinct difference between a collection and a series. A collection is any grouping based on whatever characteristics you choose. A collection can be created after the fact with items that have no inherent interdependence (a collection of teacups, a collection of paintings). A series generally has a temporal basis and there is the implication of interdependence in the process of creation. Also, a collection is bounded and a series can be open-ended. (Series' are defined as "continuing resources" in library cataloging parlance.)

I would find it hard to state that a series is a single dataset, but can easily see a single metadata record for the series as well as individual metadata records for the members of the series. A certain amount depends on how distinct the members of the series are; if they contain the same data representing a different time then a single metadata description may suffice; if there are significant variants (change in creator, or the addition of elements) then dataset-level metadata may be needed.

Geospatial datasets arranged as adjacent or overlapping tiles or scenes are a common non-temporal series. Go into any map shop (!) and you will find map-series of the same theme and scale, covering a jurisdiction. So I don't think we would want to limit the idea of 'dataset series' to temporal series. The point is that _most_ of the metadata is shared, which is why it is more specific than 'collection'. And I would assert quite strongly that such a series is _conceptually_ a single dataset. That is certainly how they are thought of in the geospatial community.

@dr-shorthair Yes, I am aware that there are non-temporal uses of "series" in the world. Perhaps I misunderstood but I was under the impression that that there was an interest in defining time-based, ongoing resources, things that are considered to be "serial" in nature. Those pose some particular metadata requirements and afford certain services which users of DCAT may need. For "serial" resources the seriality is key; sharing metadata isn't what defines that type of relationship, as important metadata can change over time, for example when the organization providing the data changes name (as government offices tend to do). This might be a separate discussion if dataset series is being defined based on commonality of metadata.

I think it might be good to distinguish between time series on one hand and slices on the other hand. In my mind, different maps covering a geographic area are more like slices than a series. But it does depend on the definition of 'series'.

@makxdekkers said:

I think it might be good to distinguish between time series on one hand and slices on the other hand. In my mind, different maps covering a geographic area are more like slices than a series. But it does depend on the definition of 'series'.

IMO, we should address these issues stepwise:

  1. First we should agree on a general definition of dataset series, and decide which are the relevant properties
  2. Then, if need be, we can specialise this notion (e.g., dataset series that are time series), and decide which additional properties should be added.

About point (1), I think the current definition does the job:

A collection of datasets that are published separately, but share some common characteristics that groups them, and could also be made available as a single dataset.

although, considering what discussed so far, it should be probably slightly revised (by dropping the last sentence):

A collection of datasets that are published separately, but share some common characteristics that groups them.

About point (2), actually, we already have something specifically related to time series. Quoting from 搂11.1 How to specify dataset series:

It is worth noting that a dataset series may evolve over time, by acquiring new datasets. E.g., a dataset series about yearly budget data will acquire a new child dataset every year. In such cases, it might be important to link the yearly releases with relationship specifying the previous, next, and latest ones. In such scenario, DCAT recommends [...] using the [VOCAB-ADMS] properties adms:prev, adms:next, and adms:last, respectively.

@kcoyle , are these properties (at least partially) addressing your comments? Which additional metadata do you think should be added for time series?

Thanks, @andrea-perego. The use cases that I'm aware of for time-based datasets are:

  • For a an ongoing serially issued dataset, the ability to know when each issuing cycle or date is expected such that one can program automatic downloads (e.g. first of every month)
  • Conversely, this also allows a user to quantify if a scheduled update has not been received
  • The complication is that there are often exceptions to the regularity (e.g. published every week except for the week including December 25 and no publication from August 15-31). Although it may be hard to develop a predictive algorithm* it could be useful to respond via API with helpful comments.

The "next" could be a solution if "next" will include a future dataset. (Feb 27 current; next March 6, as recorded on Feb 27)

I'm still concerned that "datasetSeries" may eventually need a more specific definition. "... share some common characteristics that groups them" could possibly be augmented with "area of coverage, source of data collection..." and other "characteristics" that will help users know if what they have is a series.

* Libraries have these for serial publications but they are horribly complex

A Map or Image Series usually has in common

  • scale (e.g. 1:25,000, 1:50,000, 30m pixels)
  • themes (e.g. roads, topography, history, wavelength bands)
  • publisher
  • some quality standards
  • maybe revision cycle (version 1, version 1990)
  • license

What might vary between tiles, scenes, or sheets is

  • survey and publication dates
  • surveyor or sensor
  • area of interest (coverage) - which often has overlaps between tiles

Thanks, @kcoyle & @dr-shorthair .

@kcoyle , at the moment, issuing cycles / updates are covered in DCAT by dct:accrualPeriodicity. Do you think something more specific / different is needed?

About metadata shared / varying between datasets in a series, the draft includes a section on "property values inheritance" trying to provide guidance on which information should be specified in records of dataset series:

https://raw.githack.com/w3c/dxwg/dcat-dataseries-issue1272/dcat/index.html#dataset-series-properties

Do you see any gaps? Should any additional aspects be covered?

@kcoyle said:

I'm still concerned that "datasetSeries" may eventually need a more specific definition. "... share some common characteristics that groups them" could possibly be augmented with "area of coverage, source of data collection..." and other "characteristics" that will help users know if what they have is a series.

Maybe we can include some examples (either in the definition or in the usage note) to clarify this. E.g.:

A collection of datasets that are published separatelly, but share some common characteristics that groups them.
Examples include: budget datasets released on a yearly basis, population statistics split into age groups, datasets of images from a satellite network, each covering a different geographical area.

WDYT?

@andrea-perego Thanks again. Because irregular periodicity is very complex it is probably best to wait until there is a concrete use case by someone employing DCAT. Solving it will be a misery akin to the versioning problem. Perhaps it can sit in the background until needed.

The section you refer to is quite complete. I do note that it says that items in a series are a subset:

A dataset series can be seen as the result of subsetting (or slicing) a single dataset based on the values of one or more metadata element.

Which is different to the definition above:

Dataset series are defined in [ISO-19115] as a collection of datasets [鈥 sharing common characteristics.

This implies to me that the dataset series that @riannella is working with is not precisely the result of subsetting but instead can by formed by collecting. It also appears to me that @dr-shorthair 's example may not result from subsetting based on values. Perhaps the definition in the draft needs to include datasets created separately but that form a coherent collection. That is, if one considers these both as a series in spite of the difference in their creation.

The section you refer to is quite complete. I do note that it says that items in a series are a subset:

Indeed, @kcoyle . Besides the definition of dataset series, we need to revise also the guidance section, to make it very clear that subsetting is just one of the possible cases of dataset series.

About the section on property value inheritance, I think it would be worth reviewing it to see if some of the discussed properties are specific to the dataset series itself, and not inherited from child dataset. One case is actually dct:accrualPeriodicity, which, based on the discussion above, should not be inherited from child datasets but rather should specify the frequency of update of the dataset series itself (i.e., when a new child dataset is supposed to be added).

The dataset series use case I am looking at (in the aviation sector) is based on both a theme (features) and time.
Such as aerodromes, vertical obstacles, airspace, etc
There are well defined release phases (called "AIRAC" cycles) that are released every X days (depending on the State (country).
Here are the EU dates: https://www.nm.eurocontrol.int/RAD/common/airac_dates.html
Here are the AU dates: https://www.airservicesaustralia.com/industry-info/aeronautical-information-management/document-amendment-calendar/

There maybe releases/amendments in between these dates too.
For our use case, we want to also indicate that the dataset was released "on an AIRAC cycle" as part of the series.
(so we will use dct:accrualMethod with a URI)

I am happy with the definition above proposed by @andrea-perego

@riannella Would you want to specify future dates, or would periodicity suffice? In other words, could "next" indicate a future release?

@kcoyle Probably not at this stage. If we set the future dates as part of the description, then the expectation would be high that we meet it :-)
We would generally have a "system" that allowed customers to be notified of new additions to a dataset series (regardless of the actual dates).

The section you refer to is quite complete. I do note that it says that items in a series are a subset:

Indeed, @kcoyle . Besides the definition of dataset series, we need to revise also the guidance section, to make it very clear that subsetting is just one of the possible cases of dataset series.

About the section on property value inheritance, I think it would be worth reviewing it to see if some of the discussed properties are specific to the dataset series itself, and not inherited from child dataset. One case is actually dct:accrualPeriodicity, which, based on the discussion above, should not be inherited from child datasets but rather should specify the frequency of update of the dataset series itself (i.e., when a new child dataset is supposed to be added).

These revisions are now implemented in PR https://github.com/w3c/dxwg/pull/1292

Preview of the relevant sections:

The description of dcat:Dataset should be supplemented by recommendation to use dcterms:isPartOf (or dcat:inSeries) to match the recommendation to use dcterms:hasPart from a series to one of its parts.

In Example 35, can we add a parent dcat:Catalog entity to show the complete hierarchy:

ex:EUCatalogue a dcat:Catalog ;
    dcterms:title "European Data Catalogue"@en ;
    dcterms:hasPart ex:budget , ex:employment , ex:finance ;
  .

In Example 35, can we add a parent dcat:Catalog entity to show the complete hierarchy:

ex:EUCatalogue a dcat:Catalog ;
    dcterms:title "European Data Catalogue"@en ;
    dcterms:hasPart ex:budget , ex:employment , ex:finance ;
  .

added the catalog using dcat:dataset instead of dcterms:hasPart in PR #1328

added the catalog using dcat:dataset instead of dcterms:hasPart in PR #1328

The dcat:dataset property makes _sense_ when referring to a dcat:Dataset, but not a dcat:DatasetSeries.

In Example 35, it uses dcterms:hasPart to refer to the dcat:Dataset.

So, there are inconsistencies.

What about making the dcat:DatasetSeries a subclass of dcat:Catalog?
I am not really a fan of this, but it seems like we are discussing a lot of parallel ideas.

@init-dcat-ap-de Please scroll up in this thread and you will see that this was already discussed - e.g. https://github.com/w3c/dxwg/issues/1272#issuecomment-784014802

A class DatasetSeries is going to be added by PR #1292. If there are no objections I think we can close this issue as soon as the PR is merged.

Was this page helpful?
0 / 5 - 0 ratings