Zstd: Content-Encoding and other HTTP Headers

Created on 31 Aug 2016 · 8Comments · Source: facebook/zstd

I think we should start the conversation about how to standardise the use of Zstandard with HTTP and REST environments.

Content-Encoding Token

Does Zstandard already have an IANA-defined Content-Encoding token? if not, an obvious candidate would be zstd itself, for example:

GET /api/posts/ HTTP/1.1
Accept: */*
Accept-Encoding: zstd, gzip
…

HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: zstd
…

Dictionary Definition Header (Zstd-Dict)

The end of the section 14.11 of the RFC-2616 (HTTP/1.1) says:

Additional information about the encoding parameters MAY be provided by other entity-header fields not defined by this specification.

Dictionaries can qualify as _"additional information about the encoding parameters"_. Ideally client and server should agree in a pre-defined built-in dictionary, but can be useful to change dictionaries on-the-fly in some scenarios. For that cases, I propose the custom header Zstd-Dict that points to a URL with the dictionary used to encode the current data being sent:

GET /api/posts/ HTTP/1.1
Accept: */*
Accept-Encoding: zstd, gzip
…

HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: zstd
Zstd-Dict: https://mydomain.com/zdicts/myapi_json_bc1b45d.zdict
…

Notice that this header is not prefixed with X-, adhering to the RFC-6648.

Thoughts?

Source

Racum

👍19

Most helpful comment

Request for HTTP content encoding has been formally started :
https://tools.ietf.org/id/draft-kucherawy-dispatch-zstd-00.html

Cyan4973 on 27 Sep 2017

🎉6

All 8 comments

I'd love this, since it perfectly fits into the narrative that Google is following with Brotli.

kitten on 1 Sep 2016

It would be great if the Zstd-Dict would include some hash or timestamp, to prevent issues from operator error that would result in an updated dictionary at the same uri (copying the wrong file, a bug in the dictionary generator that would reuse previous names, ...).
What would be the expected behavior if the client cannot retrieve the dictionary? (timeout, 200/500/418, firewall, server offline, web farm where the new dict has not made it to all the hosts yet...).
How can the client determine that the downloaded dictionary is complete (not corrupted or truncated).
When inspecting captured HTTP traffic, the catpure tool may not be able to retrieve the corresponding dictionary (offline, server is dead, dictionary has long been removed/revoked, ...) and make inspecting the body impossible.

How would this work in the reverse direction, for POST/PUT ?

KrzysFR on 1 Sep 2016

@KrzysFR

It would be great if the Zstd-Dict would include some hash or timestamp, to prevent issues from operator error that would result in an updated dictionary at the same uri (copying the wrong file, a bug in the dictionary generator that would reuse previous names, ...).

We can check the zdict URL for ETag and/or Last-Modified.

What would be the expected behavior if the client cannot retrieve the dictionary? (timeout, 200/500/418, firewall, server offline, web farm where the new dict has not made it to all the hosts yet...).

Should we specify this behaviour?

How can the client determine that the downloaded dictionary is complete (not corrupted or truncated).

This looks like a job for Content-Lenght and Content-MD5 on the zdict URL.

When inspecting captured HTTP traffic, the catpure tool may not be able to retrieve the corresponding dictionary (offline, server is dead, dictionary has long been removed/revoked, ...) and make inspecting the body impossible.

Yeap! you are right! ...this approach requires better tooling and some guaranties from both sides (specially backend) to work.

How would this work in the reverse direction, for POST/PUT ?

Having a Zstd-Dict header only makes sense in the same message as an Content-Encoding, and, to be honest, I didn't know if the RFC allowed for it on the request, the specification don't say directly, but based on this paragraph is safe to infer that this is legal:

An origin server MAY respond with a status code of 415 (Unsupported
Media Type) if a representation in the request message has a content
coding that is not acceptable.

-- Last paragraph of RFC-7231, 3.1.2.2.

Racum on 1 Sep 2016

We can check the zdict URL for ETag and/or Last-Modified.

This would work if you have already downloaded the dictionary once, but not for the first time.

Should we specify this behaviour?

Someone (app or client) would need to handle this case anyway, so if there was a general guidance on how to do this, it would help implementors to not fall into this trap.

This looks like a job for Content-Lenght and Content-MD5 on the zdict URL.

Yes, but then this means that the download of the dictionary is not a regular GET: the HTTP server must know to include the Content-MD5 header for these files, and the HTTP client must know to verify it.

Dictionaries are supposed to be used for very small files (a few KB or less), which makes me think that this would probably target REST APIs where you GET /SomeEntity/1234 from a single page or mobile app, and want to reduce the size of the JSON (or other) downloaded. This means that the dictionary stuff should be specified at the level of the API, not at the level of a generic HTTP client implementation. In that case, it seems to me that it specifying these headers (and behavior expected to validate the dictionary, manage their lifetime, ...) would make sense, because then it would be part of the API contract.

If this is targetting APIs, then some of the points above become less of an issue (how to validate a dictionary, how to retry if there is an issue, ...), and also the client -> server case could also be specified as part of this (client knows in advance that server supports zstd with dicts because it is part of the API contract, and there would be some endpoint to discover the list of supported dictionary uris).

KrzysFR on 2 Sep 2016

I think there should be support for Zstd-Dict-less compression as well, for two reasons:

Sometimes, in security-hardened environment, you wouldn't want the server to download a file (or initiate TCP connections, in general), since this increases the server's attack surface.
This allows easy implementation of supporting HTTP clients (since the difference between gzip, deflate and zstd is just a matter of calling the right function).