Couchdb: Streaming API for attachment data

Created on 7 Aug 2018 · 5Comments · Source: apache/couchdb

@nolanlawson:

It would be nice to have a more efficient method of replicating attachments to/from Couch. Currently we use multipart for uploads and GET /db/doc/att for downloading (see pouchdb/pouchdb#3964 (comment) for why). It'd be nice to be able to stream and restart attachment requests.

Emerging browser spec for background uploads/downloads: https://github.com/WICG/background-fetch

/cc @daleharvey @janl

api performance roadmap

Source

wohali

👍6 ❤1

Most helpful comment

@cluxter Right now, large attachments (>16MB of attachments per JSON document) aren't a first order design scenario for CouchDB internal storage or so-called "internal replication" between nodes in a cluster. That needs to be resolved before thinking about any sort of "external" replication enhancements that specifically address large files.

The people who get to make that decision are the people who actually develop CouchDB. If you're an Erlang developer and think you have the chops to tackle this, we'd love to see your patches.

wohali on 8 Oct 2018

👍3

All 5 comments

In an ideal situation, I would like to be able to:
1) upload attachments of unlimited size, ie. only limited by the file system, not by the CouchDB storage system (so nothing like this: https://github.com/apache/couchdb/pull/1253 )
2) have a smooth replication of these attachments between the CouchDB instances, ie. huge attachments replications won't clog up CouchDB in any way (which doesn't mean the replication wouldn't be slowed down, obviously; we don't have unlimited bandwidth).

This desire implies that:
1) being able to store huge attachments in a database is not seen as bad practice. I'm certain some people will come up and say "Hey, ending up storing files of thousands of gigabytes in a database is silly, this means that your storage design is wrong, go fix that now instead of using CouchDB as a file system". Well, in 10 or 15 years, files of hundreds of gigabytes might be normal for some activities and I would like CouchDB to be able to scale by design, not because of the hardware available through time. The idea here is _not_ to use CouchDB as a file system, but being able to have a place in which _all_ data of a software system could fit. I don't like the idea that I have to use one storage system for small files (CouchDB) and one other storage system for big files, especially when the size limit of the files is arbitrary and depends on the bandwidth/CPU available (or some vague notion). Basically putting a maximum size limit on attachments means that we don't want to deal with this issue and that we let it for another system to fix it. Or worse: we make people believe that they can use attachments but... not really actually.
2) we need a strong resilient and reliable replication system which can operate under bad conditions. This would align on the strong resiliency CouchDB already offers with regards to unexpected shutdowns. My instinct tells me that a P2P system similar to Kazaa/eMule/Bittorrent (I'm looking at the multi-sources P2P paradigm, not the protocols per se) would be ideal because it's fast, efficient and resilient. But maybe this is not well suited for CouchDB. Or maybe we are using this already (not what I understood so far though). I'm pretty sure this would require _a lot_ of work, but I would at least like to know that it's somewhere on the long term road-map.

Now this is a personal vision of what CouchDB should look like but maybe this is not shared by many other people. Or maybe it is. Please don't hesitate to (respectfully and constructively) criticize my views and argument on them, I'm eager to learn more about why this should or should not be done.

cluxter on 8 Oct 2018

👍1

The people who get to make that decision are the people who actually develop CouchDB. If you're an Erlang developer and think you have the chops to tackle this, we'd love to see your patches.

wohali on 8 Oct 2018

👍3

@cluxter Right now, large attachments (>16MB of attachments per JSON document) aren't a first order design scenario for CouchDB internal storage or so-called "internal replication" between nodes in a cluster. That needs to be resolved before thinking about any sort of "external" replication enhancements that specifically address large files.

The people who get to make that decision are the people who actually develop CouchDB. If you're an Erlang developer and think you have the chops to tackle this, we'd love to see your patches.

Is there any progress on this?
Or any road map in these direction?

Cannot agree more, for these requirements.
In today's world, streaming must be the first hand support function of any database.