Dxwg: Generalize dcat:byteSize to dcat:size

Created on 9 Aug 2018 · 22Comments · Source: w3c/dxwg

At the moment, DCAT provides a property to indicate the size of a distribution in bytes (dcat:byteSize). We discussed that this should be generalized to dcat:size with an additional indication of the unit of measurement. For the latter, we would consider an existing ontology (such as UO, QUDT, OM etc).

As per discussions in meeting (https://www.w3.org/2018/07/19-dxwgdcat-minutes.html#x07) and action (https://www.w3.org/2017/dxwg/track/actions/158).

dcat due for closing future-work statistics

Source

agbeltran

👍1

Most helpful comment

@riccardoAlbertoni The issue of granularity/scale -- whether the size is expressed in bytes, kilobytes, megabytes etc -- is really a case of trying to be helpful to people at the expense of efficiency of data. Creating a complex mechanism with an additional class to reduce the number of digits, e.g. from "1000000000" (bytes) to "1" (terabyte) will actually increase the number of bytes on the wire.
dcat:byteSize "1000000000000" is actually shorter than (inventing some properties) dcat:scaledSize [dcat:scale "TB" ; dcat:number "1"].
The other thing is a potential requirement to express different _types_ of sizes, e.g. number of observations, number of rows in a spreadsheet, number of articles in a legal text etc. If there is a small number of such types, the VOID approach makes sense. If there are a large number of types, a structured approach should be better, which is what Data Cube does with sdmx-attribute:unitMeasure.
In my mind, in DCAT we're just talking about byte size so I don't see the need for a more complex approach.

makxdekkers on 30 Aug 2018

👍3

All 22 comments

I agree that a discussion on this topic is meritted. But I am not sure an additional property is warranted.

Too many ways to do the same thing has a cost. While a more flexible property makes things easier for the data provider, it creates more work for the consumer. I believe dcat:byteSize is enough, though I suggest that its range should be xsd:positiveInteger (which is a valid OWL-2 datatype - see https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes ) see #125 .

dr-shorthair on 13 Aug 2018

👍1

In fact, DCAT 2014 originally had a property dcat:size that was deprecated

dcat:size a rdf:Property;
    rdfs:isDefinedBy dcat:;
    rdfs:label "size (Deprecated)";
    rdfs:comment "the size of a distribution. This term has been deprecated";
    rdfs:domain dcat:Distribution;
    owl:deprecated true ;
    rdfs:subPropertyOf dct:extent .

I found some of the old discussions here:

https://lists.w3.org/Archives/Public/public-gld-wg/2012Oct/0117.html

agbeltran on 14 Aug 2018

👍1

Right. The additional point in that post

stating that the value can be approximate

addresses the lurking issue ('what if I only want to indicate the size in round numbers?')
The usage note https://w3c.github.io/dxwg/dcat/#Property:distribution_size states

The size in bytes can be approximated when the precise size is not known.

Perhaps this could be clarified with an example or two?

dr-shorthair on 15 Aug 2018

@agbeltran It seems to me that dcat:size is fundamentally different from dcat:byteSize, because it would necessarily have a resource as its range -- as you point out, and as the example in the old discussion shows, it needs a link to a controlled vocabulary for its unit of measurement. As I understand it, dcat:byteSize was defined as a simpler way to express something that people saw as a main requirement at the time. So you can't generalise dcat:byteSize, but would need to define a parallel property for more general cases.
One process question: would we be able to 'un-deprecate' a property in the namespace, or would this need to get a new URI?

makxdekkers on 24 Aug 2018

Thanks @makxdekkers - according to the discussion, I think we would not un-deprecate the property, but keep dcat:byteSize while revising its axioms (see also #125 #110), for example considering the change of range to xsd:positiveIntegeras suggested by @dr-shorthair above. In terms of the point raised by this issue, we need also to confirm that keeping byte as the unit provides enough flexibility to describe large distributions (we also need to decide if relaxing the domain).

agbeltran on 30 Aug 2018

Some examples of the use of dcat:byteSize in this SPARQL endpoint with query select * where { ?s a dcat:Dataset. ?d a dcat:Distribution. ?s dcat:distribution ?d. ?d dcat:byteSize ?size. FILTER ( STRLEN(?size) > 10) } LIMIT 100

agbeltran on 30 Aug 2018

@agbeltran Are you now proposing to drop the idea of adding a more general 'size' property, and just revise the axiom (datatype) of dcat:byteSize?
As to changing the datatype fromxsd:decimal to xsd:positiveInteger, I wonder if that would break implementations that currently specify a number with ^^xsd:decimal? For example, I see
<dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">246629.0</dcat:byteSize> in https://www.govdata.de/ckan/catalog/catalog.rdf. I guess this would become non-conformant if the axiom were changed.
If it's just for elegance, does it make sense to force people to convert their existing data?

makxdekkers on 30 Aug 2018

👍1

Yes, we opened the issue to investigate if the property dcat:size plus a unit of measurement provided more flexibility than dcat:byteSize but we tracked back the reasons why dcat:size was deprecated (e.g. it would usually require the use of a blank node, see link above for more info). So, unless you (or others) think it would be necessary, I don't think we need to undeprecate dcat:size.

I wonder though if with the current representation is too cumbersome (or if there are limitations) to represent dataset distributions that are actually terabytes of data (e.g. multi-dimensional microscopy images can weigh up to several TB each and datasets can be hundreds of TB in total).

agbeltran on 30 Aug 2018

About changing the datatype, I agree that we should be careful about current implementations. Maybe we can continue that discussion in the specific issue #125

agbeltran on 30 Aug 2018

I just saw the proposal from @riccardoAlbertoni in today's call https://www.w3.org/2018/08/30-dxwgdcat-minutes#x10 to create a new class for size with a number and a unit of measurement. @agbeltran then said that the object would be assigned a IRI.
I think this is not realistic. Who would assign IRIs to "1024 bytes" and to any other number of bytes? In my mind, assigning IRIs to these kinds of things with low reusability does not make sense. Don't forget that to do this right, you would need to resolve an IRI like http://foo.bar/size/bytes/1024. As I wrote in https://github.com/w3c/dxwg/issues/300, minting a URI creates a maintenance commitment.
It's far more likely that it would be done as a blank node. This was precisely why the original dcat:size did not make it into DCAT2014.
Also note that VOID took the simple approach, defining a set of properties for various measures: https://www.w3.org/TR/void/#statistics.

makxdekkers on 30 Aug 2018

👍1

Thanks @makxdekkers - I totally agree with your view and my comment on the call was pointing out that I don't think that creating a size object is useful, as it would require to assign an IRI to such object which is not really reusable and bears the maintenance costs that you referred to.

agbeltran on 30 Aug 2018

@makxdekkers I am not sure about what is realistic and what is not. My comment in today's call was more a reaction to an emerging proposals to have distinct size properties for every possible unit of measures, which sound to me as bad modelling, and dangerous in a longer-term perspective.

If the rationale behind this discussion is to make users more comfortable in expressing and reading the size, we have to consider that the name for multiples of bytes will evolve and which scale to use might be application dependent: if we add the property TerabyteSize, sooner or later we might need to add exabyteSize ... etc.

I am not against the use of blank node in this specific case n-ary relation if there is such a dire need of expressing the size in different unit of measures.

However, I tend to agree with you, If we do not want to have blank nodes, and no other solutions than adding new properties with hard-coded scale/size are on the table, we should replicate the simple approach from VOID which probably corresponds to live with bytesize.

riccardoAlbertoni on 30 Aug 2018

makxdekkers on 30 Aug 2018

👍3

There seem to be four distinct issues with dcat:byteSize as the only option:
1) very large numbers - certainly not human readable practically
2) exact semantics - is this expected to be exact or approximate? what if the resource varies over time and the exact value cannot be predicted
3) cost of computation of exact bytesize
4) difference in values with different encoding choices that may be negotiated for a distribution

what feels to me "reasonable" is to keep byteSize with tighter definition about its expected semantics and introduce a new term with a simple string literal with a microformat

eg dcat:approxSize "23 MB"

such microformats are extremely common, but I havent had too much luck tracking down a standard for such a format, but there are ones for the actual postfix part

and a confusion over K = 1000 or 1024 and some ISO rules - and there are explict (e.g. KB and KiB) postfixes for these cases. IMHO this would not matter if approximation is the semantics - though would still need to be careful about byte-vs-bit (KB vs Kb)- which is an effective order of magnitude.

Here are two major development platforms that explicitly support such formats, without citing standards conformance, but do reference this issue of interpretation.

https://developer.android.com/reference/android/text/format/Formatter#formatFileSize(android.content.Context,%20long)

https://docs.microsoft.com/en-us/windows/desktop/api/shlwapi/nf-shlwapi-strformatbytesizeex

rob-metalinkage on 31 Aug 2018

@makxdekkers @agbeltran As in #300, I have to say that I do not see the problem in creating an IRI such as https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size as and instance of, probably, https://schema.org/QuantitativeValue.

If I manage https://mycatalog.com/resource/dataset/XXX/distribution/YYY, the additional cost of managing https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size still seems minimal to me - I probably use a generic method of assuring dereference of IRIs, so it does not matter how many IRIs there are.
As to the (re)usability of such IRIs - You never know when someone decides to monitor the size of a given distribution. For them, the IRI would make sense. Nevertheless I admit that the reusability of such IRI will definitely be lower that that of a dataset.

jakubklimek on 31 Aug 2018

I have a strong preference for using actual values rather than URIs for things like numbers or timestamps. For programming and for human readability, looking up a URI for such a thing strikes me as far more complex than necessary, to the point of being somewhat comical.

agreiner on 4 Sep 2018

👍1

Though the examples of programmatic formatting of numbers of bytes are the reverse of what I would call programmatic support of the suggested microformats (They take a long and turn it into a string with a convenient number and unit. Support of the microformats suggested would require a function to read the particular microformat and return the long.) I don't think it's too much to ask of a programmer to write such a thing, if we can specify the microformat. I would not worry about KiB etc, as they can be converted to KB etc, and they are rarely used.

agreiner on 4 Sep 2018

Any reason not to relate dcat:byteSize to qudt:bytes (from http://qudt.org/2.0/schema/SCHEMA_QUDT-DATATYPES-v2.0.ttl - no domain, range is xsd:integer)? Then people can use a single, named property (simple case) but those wanting more detail can apply QUDT qualifiers like qudt:Mega or qudt:Mebi (http://qudt.org/2.0/vocab/VOCAB_QUDT-UNITS-BASE-v2.0.ttl) if desired.

So even for the simple, single-property-including-units case, you relate to a comprehensive ontology for complex cases.

There is also qudt:bits.

I can't see anything in QUDT about approximate values but perhaps there are.

nicholascar on 5 Sep 2018

@nicholascar What would be the advantage of including a relationships between dcat:bytSize and qudt:bytes?
One reason maybe not to rely on QUDT is that it is developed by an organisation that does not seem to be a formal standards organisation. Their website does not say anything about their processes other than stating that the Board of Directors have the power of approval, but there is no visible community beyond that board.
Just as a minor comment, I browsed through the QUDT specification and could not find a definition of the semantic meaning of qudt:bytes: clicking on the link in http://www.qudt.org/doc/2017/DOC_SCHEMA-QUDT-DATATYPES-v2.0.html just tells you it is an owl:DatatypeProperty. Now it might be obvious -- "the number of bytes in the described resource" -- but I think it would be good practice to actually say that somewhere.

makxdekkers on 5 Sep 2018

There is clearly an area that could has the potential for revision as part of future work beyond DCAT 2. As well as dcat:bytesize, there is the adjacent area of statistics for datasets as a whole (#84) which could pick up other "dimensions" (such a number of entities in some logical view of the dataset) beyond the size of the physical representation.

Tagging for future work, and moving to future milestone (alongside #84)

davebrowning on 25 Sep 2019

There was no further discussion on this issue since 2018, and DCAT 2 has not eventually included a dcat:size property.

@agbeltran , do you think we can close it?

andrea-perego on 13 Mar 2021

Noting no objections, I'm closing this issue.

andrea-perego on 20 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Compile DCAT-rev implementation evidence

dr-shorthair · 6Comments

Clarify if DCAT's use of ProfileDesc

nicholascar · 5Comments

dcat:compressFormat and dcat:packageFormat description inconsistency

jakubklimek · 6Comments

profile ontology: rdfs:Class vs owl:Class inconsistency between turtle and document

riccardoAlbertoni · 4Comments

Further elaborating class Organization/Person

andrea-perego · 5Comments