Couchdb: Add support for MessagePack, or other compact/binary format

Created on 24 Sep 2019 · 8Comments · Source: apache/couchdb

JSON is great for interoperability, but not so much for efficient network communication.

I would love to see a future version of CouchDB that supports a more compact data streaming format, such as MessagePack. (Ideally in both CouchDB and PouchDB, as PouchDB is really where it would be the most useful--but it needs CouchDB support, first).

I haven't looked deeply into it, but I expect MessagePack would be a good choice, as can fairly cleanly be converted 1:1 to/from JSON.

Desired Behaviour

This should not, in any way, replace JSON as the primary content type used by CouchDB. I would see it becoming an optional Accept: and Content-Type: header. The conversion should be trivially handled in a middleware, so it shouldn't need to affect any core logic, and would make for an easy prototype implementation entirely outside of CouchDB.

So for incoming requests (POSTs/PUTs, etc):

If the Content-Type header matches MessagePack, simply convert to JSON, then pass along to normal processing.

If the Accept: header matches MessagePack, perform the normal operation, then before streaming the response, convert from JSON to MessagePack.

Additional context

There may be some corner cases around data conversion that need addressing to ensure 100% compatibility, particularly around large numbers.
There's some confusion about which MIME type to use for MessagePack. If this proposal and format are ultimately selected, a small amount of bike-shedding will need to be done.
There may be other data formats to consider, as well. I'd love to hear other suggestions. (Although, IMO, it should be an existing, open standard, rather than inventing our own.)
BSON might be a candidate, but is not as attractive, at first blush, as it's both larger than MessagePack, and it's not as easily converted 1:1 to/from JSON.
As a first step, writing a stand-alone CouchDB proxy and a PouchDB plugin would probably be worth-while, to allow testing the functionality for compatibility and performance. If there is broader interest, I may try to work on this in time.

enhancement patches-welcome

Source

flimzy

All 8 comments

Previously rejected by the CouchDB team - we explicitly said "no BSON" only. Worth reconsideration, but probably not until after the 4.0 given team bandwidth.

wohali on 24 Sep 2019

👍1

Years ago someone on my team hacked up a MessagePack serialization as an experiment. The one place where it ran into problems was our chunked responses, where we start streaming without knowing the full size of the response body ahead of time. MessagePack (and BSON, and many other binary serializations) like to know the size ahead of time so they can pre-allocate appropriately-sized data structures.

kocolosk on 24 Sep 2019

That said, I'm quite happy to see discussion on this front (and the chunked thing is not an insurmountable issue). JSON is nice and easy but definitely has its weaknesses ...

kocolosk on 24 Sep 2019

The one place where it ran into problems was our chunked responses, where we start streaming without knowing the full size of the response body ahead of time.

Interesting. My understanding was that streaming was supposed to be one of MP's core features/strengths.

flimzy on 25 Sep 2019

@flimzy I think the distinction is between streaming a set of individual objects (like our continuous _changes feed) versus streaming serialization of one really large object (_all_docs or _view or normal _changes feeds). When MessagePack talks about streaming it's referring to the former.

kocolosk on 25 Sep 2019

I recently learned about CBOR, which aims for pure JSON compatibility (unlike MessagePack), so may be a better candidate for this sort of feature.

flimzy on 6 May 2020

👀2

I am interested in developing couchdb/nosql databases for storing hierarchical scientific data - for example, imaging data of different binary types, dimensions, with/without compression - I've developed specifications to enable JSON to encode common scientific data structures (http://openjdata.org/, https://github.com/fangq/jdata/blob/master/JData_specification.md#data-annotation-keywords), but to support binary strongly-typed data, I need to use base64 with JSON, which increases the data file size by ~33%.

To mitigate this, I extended the UBJSON spec (https://ubjson.org/) to support binary data types, see Binary JData (BJData) spec

https://github.com/fangq/bjdata/blob/master/Binary_JData_Specification.md#type_summary

BJData is similar MessagePack/CBOR, but much simpler to encode/decode, and is also quasi-human-readable, despite being a binary format. I will be very happy to contribute if there is an interest to support BJData/UBJSON in CouchDB. I currently have a MATLAB and a Python parser/writer (https://github.com/fangq/pybj, based on py-ubjson, includes both native python code and c-code).

fangq on 26 Sep 2020

😕1

@flimzy I happened upon the CBOR spec for an entirely different reason last week and sat down to read it through. I think there's a lot to like. It handles everything I could think of from my experience working with CouchDB. I like that the authors paid special attention to round-tripping through JSON, and while it's a larger topic I particularly like the approach to extensibility that allows CBOR to handle things like timestamps and arbitrary-precision decimals. I could see a future where CouchDB allowed any JSON, but also supported a particular spec based on CBOR that would enable us to introduce a richer set of datatypes in the database.

@fangq as it turns out, when @rkalla was first working on UBJSON he had several discussions with @davisp and @kxepal from the CouchDB community and floated the idea of having CouchDB adopt UBJSON, first as an encoding on the wire but possibly also as an on-disk representation if the wire format took off. That was almost ten years ago, but I can imagine that a UBJSON encoding would be achievable given that history.

kocolosk on 4 Jan 2021

👀1

Was this page helpful?

0 / 5 - 0 ratings