Dxwg: dcat:byteSize - check constraints

Created on 15 Feb 2018  路  22Comments  路  Source: w3c/dxwg

In DCAT v1 the property dcat:byteSize is axiomatized

dcat:byteSize
  rdf:type owl:DatatypeProperty ;
  rdfs:domain dcat:Distribution ;
  rdfs:range rdfs:Literal ;
.
  1. Verify that the range is appropriate and necessary
  2. Verify that the domain is appropriate and necessary (see #110)
  3. Consider whether any guarded constraints (using owl:Restriction) should be introduced (see #105)
dcat Distribution due for closing future-work requires discussion

Most helpful comment

The usage note (skos:scopeNote) says "The literal value of dcat:byteSize should by typed as xsd:decimal".

  1. why 'decimal' and not 'integer'? do we anticipate fractional bytes??!
  2. I propose this be axiomatized more fully:
dcat:byteSize
  rdf:type owl:DatatypeProperty ;
  rdfs:domain dcat:Distribution ;
  rdfs:range xsd:integer ;
.

Note that xsd:integer has no limit on size - it is just xsd:decimal with zero digits after the decimal point. See https://www.w3.org/TR/xmlschema11-2/#integer

All 22 comments

Active discussion related this on #106

The usage note (skos:scopeNote) says "The literal value of dcat:byteSize should by typed as xsd:decimal".

  1. why 'decimal' and not 'integer'? do we anticipate fractional bytes??!
  2. I propose this be axiomatized more fully:
dcat:byteSize
  rdf:type owl:DatatypeProperty ;
  rdfs:domain dcat:Distribution ;
  rdfs:range xsd:integer ;
.

Note that xsd:integer has no limit on size - it is just xsd:decimal with zero digits after the decimal point. See https://www.w3.org/TR/xmlschema11-2/#integer

Relaxing the domain constraint seems to be needed now that loosely structured catalogues can be represented using dct:relation to point to individual files, so that the sizes of each of these can be described.

Yes - maybe change domain to proposed dcat:Representation - see #317

It's good to see that dcat:byteSize may be changed from xsd:decimal to xsd:integer, since a distribution cannot have a fractional number of bytes (unless I'm mistaken).

However, since a distribution can also not have a negative number of bytes, xsd:nonNegativeInteger would be even better / semantically more accurate.

Unfortunately this didn't get any attention during this phase of DCAT revision. While it is a small, logical, change, folks always get concerned about backward compatibility.

In fact the current rdfs:range (from DCAT 2014) is rdfs:Literal, with a _recommendation_ to use xsd:decimal but which also allows shorthands like "3.5k" or "10MB" which are sometimes used as approximations. IMHO those should probably be in a new prop 'approximateSize' or similar since byteSize is so clearly specific. But I think this has to stay on the backlog until we get the CR/PR/Rec out now.

@dr-shorthair , would you like to revive this issue?

Otherwise, I propose we close it.

xsd:decimal does not make sense for byteSize. Should be xsd:positiveInteger or at least xsd:integer.

@dr-shorthair said:

xsd:decimal does not make sense for byteSize. Should be xsd:positiveInteger or at least xsd:integer.

No particular concern from my side.

If we are going to change it, then probably we should go for xsd:positiveInteger or xsd:nonNegativeInteger.

In terms of principle, I agree on specializing the range of dcat:byteSize. It makes dcat:byteSize semantics more consistent, and consequently, it improves the overall metadata quality.

However, this change might have some practical burden on existing implementations. If we restrict the range of dcat:byteSize, we force people to correct the metadata in which real numbers or abbreviations are used.

I wonder how impacting this change is for people managing tons of metadata.
Probably, @dr-shorthair or @andrea-perego have already considered this aspect.
Do we think the advantages in terms of gained quality balance the extra work required to metadata managers?

@riccardoAlbertoni ,

As we are not going to change the range in the RDF definition of this property, I see this revision just as an indication of the preferred range of values to be used (i.e., use integers >0).

It would be however crucial to get some feedback from implementers - but I think we can go ahead with this revision and roll it back afterwards, if need be.

Thanks, @andrea-perego .
You are right, my worries are not justified! ... reading the bare HTML, I had missed we were not changing the range.

If we are going to change it, then probably we should go for xsd:positiveInteger or xsd:nonNegativeInteger.

xsd:nonNegativeInteger contains '0' while xsd:positiveInteger doesn't.

A usage scenario in which dcat:byteSize is equal to 0 doesn't come to my mind. Is there an exemplary use case to refer to?

Otherwise, considering it is an indication rather than an actual constraint (as we are not changing the actual range of the property), I think we can choose the most restrictive option, 'xsd:positiveInteger'.

A usage scenario in which dcat:byteSize is equal to 0 doesn't come to my mind. Is there an exemplary use case to refer to?

@riccardoAlbertoni I thought about such examples in the past when I suggested to use xsd:nonNegativeInteger for dcat:byteSize.

Use case 1: An automated process creates distributions for datasets by opening files and writing data to them. Due to a bug in the distribution creation process, some files are opened but no data gets written to them. A SPARQL query can be written to check for distributions that are suspicious / probably incorrect. The criteria for finding such suspicious dataset distributions can including checking for the SPARQL filter filter(?byteSize = 0).

Use case 2: A logging process writes a line to a log file for each visitor that visits a particular web page on a particular day. On Saturday the web page has six visitors, so the log file for that day contains six lines. On Sunday nobody visits that same web page, so the file for Sunday contains 0 lines (and therefore 0 bytes). Notice that the empty file is not a bug in this case: it communicates that the web page has 0 visitors. This empty file could be considered to encode legitimate information.

Aren't those use-cases actually a justification for why the constraint makes sense? It would allow the OWL reasoner to pick up the anomalous instances.

@dr-shorthair I'm not 100% sure what you mean with "constraint". (I would need to read the full sequence of comments, but can't at the moment :-( ) My intention was to show that there are legitimate use cases for having triples [] dcat:byteSize "0"^^xsd:nonNegativeInteger. as part of a DCAT metadata record.

This implies that I believe that [1] is too strict and that [2] is the right choice for standardization:

[1] dcat:byteSize rdfs:range xsd:positiveInteger.
[2] dcat:byteSize rdfs:range xsd:nonNegativeInteger.

This issue was discussed during the last DCAT call (https://www.w3.org/2021/04/14-dxwgdcat-minutes). Our understanding is that there is a consensus on opting for xsd:nonNegativeInteger. PR https://github.com/w3c/dxwg/pull/1326 has been revised accordingly and merged.

Please let us know if our interpretation is incorrect. Otherwise, we can close this issue.

@wouterbeek what I mean is that if there is an owl:Restriction on the type of dcat:byteSize, and some data has a value that conflicts with the type, then an OWL reasoner will throw an exception. i.e. you can rely on the OWL layer to trap the fact that there is problem with the data, rather than having to write a rule in a separate layer. If it is illogical to have a dcat:Distribution whose size is non-positive (including 0) then this should be stated in the ontology. The rest of the system should then be configured to respond in an appropriate way.

@dr-shorthair Thanks for clarifying; I agree with you.

@dr-shorthair are you suggesting changing the range of dcat:byteSize instead of mentioning xsd:nonNegativeInteger only as an indication in the REC (as in https://w3c.github.io/dxwg/dcat/#Property:distribution_size)?
In this case, could you comment on the impact this can have on existing implementations? See my comment below.

In terms of principle, I agree on specializing the range of dcat:byteSize. It makes dcat:byteSize semantics more consistent, and consequently, it improves the overall metadata quality.

However, this change might have some practical burden on existing implementations. If we restrict the range of dcat:byteSize, we force people to correct the metadata in which real numbers or abbreviations are used.

I wonder how impacting this change is for people managing tons of metadata.
Probably, @dr-shorthair or @andrea-perego have already considered this aspect.
Do we think the advantages in terms of gained quality balance the extra work required to metadata managers?

I have the same worry as @riccardoAlbertoni on this. As always, my opinion is that changes should only be made when the existing approach has proved not to work. I don't see that here. In this case, it is very conceivable that all implementation that use byteSize will have put values that are non-negative integers, and such integers are perfectly valid values for xsd:decimal. So, I don't see that there is a problem that needs to be solved -- the change of data type will only be for reasons of elegance, and will break existing implementations that have correctly asserted things like dcat:byteSize "20000000"^^xsd:decimal and will now be in error.

The revision implemented via PR https://github.com/w3c/dxwg/pull/1326 is now included in DCAT3 2PWD.

Unless there are any objections, I propose we close this issue.

Was this page helpful?
0 / 5 - 0 ratings