Dxwg: http or https in dcat:mediaType?

Created on 15 Jan 2020  Â·  13Comments  Â·  Source: w3c/dxwg

Currently, both http and https based IRIs are used in DCAT 2 examples of items from IANA registries, which is definitely not good for interoperability (they are simply different IRIs).

In DCAT 2014, the http variant was used (probably because IANA was not running on HTTPS).
Unfortunately, IANA does not publish the registries in RDF, so it is not clear which IRIs are more correct and the only IRIs available to use are those of the XML files/HTML pages representing the registry entries. Currently, both the http and https variants work, even though upgrade-insecure-requests makes web browsers switch to https.

On one hand, I am all for https. However, this is backward incompatible. Either way, the examples in DCAT 2 should be made consistent and a recommendation in one way or the other should be made to ensure interoperability. Leaving both options open unnecessarily increases the complexity of applications working with DCAT data.

  • Example for https: https://www.iana.org/assignments/media-types/application/ld+json
  • Example for http: http://www.iana.org/assignments/media-types/text/csv
dcat feedback future-work

Most helpful comment

I am with @jakubklimek on this one. Ignoring the protocol is not how identifiers are supposed to work -- an identifier is a string of characters, and a different string is a different identifier. I note that W3C has a policy that says "_The actual namespace will continue to use HTTP, even if it is also served through HTTPS_".
The problem with IANA is that the references to their media types are not really identifiers, they're just URLs of pages on their website, and IANA, as far as I know, have no published persistence policy -- they can rearrange their site as they please and no-one can complain.

All 13 comments

I'd vote for http, mostly for backward compatibility.

I prefer https. Can the https and http-URLs become related by sameAs?

Personally, I think we should bite the bullet once.

Many organisations impose rules allowing only https, together with that software implementations initiate blocking the access to http, we should consider moving away from http.

At the same time, we can impose the rule that the protocol in a URI is not a semantical distinguishing factor. That by definition http and https resources coincide.

In my opinion, this issue is a problem inherent in the mixing of identification and resolution in http URIs. Apparently, http URIs turn out not to be really persistent identifiers!
A possible solution would indeed be to ignore the protocol part, but I don't think we can decide to do that in our corner of the Web -- it should then be raised as an issue more widely at W3C.
In the meantime, we will have to live with duplicate identifiers. Maintainers of vocabularies can recommend to use https URIs for their terms but it will be hard to get people to convert their 'legacy' data that contains references to http URIs, so those maintainers are obliged to continue serving http URIs to honour their commitment to persistence.

I am all for it to have a more broader guideline on this issue.

Note that this is due the fact that RDF/Semantic Web binds the string serialization to the notion of identifier. And that thus http and https urls are different in the serialization and that based on that the difference in identity is determined.

Probably it is in general unavoidable, but a common approach to this issue is welcomed.

I think the best practice is to proxy all http traffic over to the https equivalents, not to block http. That way, the http one still resolves but doesn't have to be maintained separately. I think we should use the http urls only when the https forms don't work (if that is even the case anywhere). Here, the interoperability issue is on the iana side, and they seem to already be doing the right thing.

+1 to Annette's approach
This is easy with e.g. nginx

On Wed, 4 Mar 2020, 13:09 bertvannuffelen, notifications@github.com wrote:

Personally, I think we should bite the bullet once.

Many organisations impose rules allowing only https, together with that
software implementations initiate blocking the access to http, we should
consider moving away from http.

At the same time, we can impose the rule that the protocol in a URI is not
a semantical distinguishing factor. That by definition http and https
resources coincide.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/w3c/dxwg/issues/1206?email_source=notifications&email_token=AAIFYTAQBXZZUIYNGGBGIRDRFZHHDA5CNFSM4KG726DKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENXYNWI#issuecomment-594511577,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAIFYTBML736EA3YC25F7KLRFZHHDANCNFSM4KG726DA
.

@agreiner @pwin Actually the issue is not with implementation of dereference (nginx, redirects, etc.) but with the fact that the two IRIs (http and https) are different identifiers, worsening interoperability. Every application working with DCAT will have to know and implement, that sometimes, media types (used in multiple properties) use http and sometimes https identifiers, and that when checking for identity, this has to be taken into account.

Saying that "we can ignore the scheme in URL" is IMHO not the right way to go, because it breaks the concept of URIs.

I see only three options, where each has to be clearly stated in the documentation, since IANA has no guidance on this and DCAT is the specification saying that the IANA URIs are to be used:

  1. Use HTTPS and break backward compatibility
  2. Use HTTP - in this case, we need to say that the URI you get from the web browser when viewing the actual Media type document is not the one to use (because it is HTTPS)
  3. Allow usage of both. Then we need to say this explicitly, with the consequences for worse interoperability (example above) in mind.

And in any case, I would suggest that the group (or W3C) contacts IANA about this.

I am with @jakubklimek on this one. Ignoring the protocol is not how identifiers are supposed to work -- an identifier is a string of characters, and a different string is a different identifier. I note that W3C has a policy that says "_The actual namespace will continue to use HTTP, even if it is also served through HTTPS_".
The problem with IANA is that the references to their media types are not really identifiers, they're just URLs of pages on their website, and IANA, as far as I know, have no published persistence policy -- they can rearrange their site as they please and no-one can complain.

Maybe the best path forward would be to ask IANA how they would like people to refer to their media types in RDF? That way at least we'd find out if they have a policy.

Is there any feedback from IANA ?

As we are moving to FPWD for DCAT3, fix this inconsistency should be fixed. Based on the discussion it seems that the preference leans towards the use of the http: URI scheme.

I created a PR implementing this revision: https://github.com/w3c/dxwg/pull/1261

@akuckartz said:

Is there any feedback from IANA ?

Not to my knowledge.

Anyway, as I said in my previous comment, this inconsistency needs fixing. In case we get feedback from IANA, or the group eventually decides to go for https:, we can implement the corresponding revision in a future WD.

Was this page helpful?
0 / 5 - 0 ratings