Postgrest: Streaming results

Created on 2 Sep 2015  路  23Comments  路  Source: PostgREST/postgrest

Hi there,

we have a Java API which in principle does the same thing as postgrest, returns results from a query as JSON or text/csv. We'd like to move away from our existing API and use postgrest instead.

I've been doing comparisons against our current API vs postgrest and I'm seeing the performance of postgrest for our needs is around 5x slower than our existing API.

The calls to our API in some cases return millions of rows and our current API streams this data back rather than a download. Running some wgets against postgrest it looks like postgrest returns the full result set to the user rather than streaming, am I correct?

If my assumption is correct, would it be possible to add in a feature to stream the results back?

All 23 comments

Thanks for running the benchmark. The client can request partial results in a range, but optimization is needed there.

Here is some relevant discussion
https://github.com/begriffs/postgrest/issues/254
https://github.com/begriffs/postgrest/issues/273

Can you tell me more about how your java code works and what makes it fast? Are you saying the client opens a web socket to the server and waits for updates and the server uses a postgres cursor to crawl through the database?

@begriffs @gmiddleton: I understood this to imply that postgrest sends the entire result at once and that the java code streams it to the socket as it goes? I realize at the implementation level this probably wasn't output buffering intentionally, however, that may be how it's presenting itself to @gmiddleton.

It would be awesome to be able to listen to a table (does the java code do that ?) and stream new content as it goes. That's exactly how I would like to use postgrest actually so I can test this.

I'm working on something that uses logical decoding to do this but it'd be Node.js based and send just what's updated in the form of diffs or the entire object for newly created objects and ones that do not exist in local storage or memory on the client.

I consider this to be completely separate from an HTTP REST API. (I do however believe that you should be issuing REST like requests over websockets since you already have a connection open.) This way you can use the same deflate compression window and benefit from less HTTP headers (cookies) being sent over the wire, lower latency, and reduced gzip and SSL overhead on the server.

Not to discourage anyone from adding streaming to postgrest, it's a killer feature, I just think without abusing triggers and notify/listen or supporting logical decoding it's a stretch. @begriffs offering all of postgrest over websockets and implementing a logical decoding listener isn't too bad of an idea, deflate compression takes up a good deal of ram though depending on your window size.

The only thing is you'd have to enforce RLS on the diffs... which would be an added layer of complexity. On my end I'm locking that down elsewhere.

It seems to me that two different features are being conflated in this discussion.

The first one, as suggested by @gmiddleton is that result of a query should be streamed to the client as it is being received from the database. This will only be noticeable when the result set is very big.

The second one is listening to changes on a table and sending the diffs over the wire.

We should probably focus on the first one and open a new issue for the other.

Feel free to correct me if I misunderstood anything.

We have something like this:

Statement st = conn.createStatement();
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
System.out.print("a row was returned.");
}
rs.close();

the results of each row are passed in to an output stream which are handled by

javax.ws.rs.core.StreamingOutput

A simple example of this can be found at http://wpcertification.blogspot.se/2011/11/returning-binary-file-from-rest-service.html

however, we're not streaming files, we're streaming rows from PostGres in to the OutputStream.

When returning millions of rows, we're returning text/csv, not JSON.

@PierreR no we're not listening for changes. It's simply returning results from queries.

@joaoportela Hm, you said what I was trying to say in much fewer words, that's what I got out of this too :-)

@joaoportela @gmiddleton @jmealo Thanks for your comment. My bad for hijacking this issue.

It looks like what I want is already on the list of the future features. Watching endpoint changes with sockets and Postgres pubsub.

So let's keep this issue be about query should be streamed to the client as it is being received from the database.

@begriffs I think this may be helpful (unless the buffering is happening on the on the HTTP end):
http://hackage.haskell.org/package/pgstream-0.1.0.3/docs/Database-PostgreSQL-Stream.html

This would certainly be super useful for us for text/csv results. I am not comfortable with haskell but if you need any more information on our implementation in Java let me know. Do you think this would be a worthy feature?

I want to add to the discussion that this could also be useful for JSON parsers accepting line delimeted JSON.

Further reading:

I've done some testing on this with ~10,000 rows and the tests show no significant slowdown and PostgREST automatically doing some streaming. @gmiddleton could we have more details about your tests?

@calebmer the test was to simple pull back some columns from a single table in csv format, the result set was a lot larger at around 3 million rows

Not sure if this is affecting your test but having latency between your database and the postgrest server causes responses to go slower. Postgrest currently makes multiple calls to the db per request and that compounds the problem. I've got a branch in progress where those four or five calls are consolidated into one.

I'm not sure if this helps anyone, but this has been use in production for 3 months and works well if anyone needs something in the meantime:
https://github.com/JarvusInnovations/lapidus

I can provide some insight into real world usage and just how useful this is beyond the most simple of scenarios if anyone is interested.

For this to be truly useful you need a way to combine multi-table changes, apply transformations and apply RLS/ACL.

Simply streaming a table will work in some cases, I just wanted to be upfront in that it's not a silver bullet and may be best handled in a scripting language. If Haskell can use LuaJIT for the transformations that'd be a really great way to do it.

If you were programming your own API there's a bunch of places where at request time it'd be better to emit an event (into your pubsub, websocket, or wherever it needs to go).

The equivalent to this in postgrest would be using triggers, perhaps with redis-fdw or NOTIFY.

PostgREST as written is unable to stream rows as they are received from the db. This is because it has the db build the whole multi-row json (or csv) response internally. PostgREST is returning a big fully-formed string that the db produces.

Sorry I recognize that this is a drawback about using PostgREST on big responses, but it's out of scope to change this behavior I think.

I followed this discussion and at the end I decided to write postgresql2websocket; it is a working prototype based on triggers and pg_notify that can be used independently from the PostgREST (thanks begriffs for this very nice piece of software).

@frafra looks great, I added it to the ecosystem page of the postgrest docs!

Correct me if I am wrong but I believe what @frafra has implemented is the conflated feature discussed in this thread (listening for changes).

Anyhow this recent attempt looks quite interesting: https://github.com/ocharles/streaming-postgresql-simple

Maybe it could be adapted to hasql ?

I classified postgresql2websocket under the External Notification category of the ecosystem. Seem accurate? https://postgrest.readthedocs.io/en/v0.4/intro.html#external-notification

Also you could open an issue on Hasql to discuss the streaming feature.

@begriffs no need to open issues with hasql
https://github.com/nikita-volkov/hasql-cursor-query
https://github.com/nikita-volkov/hasql-cursor-transaction

the ball is in postgrest court :)

Ah nice.

I am cautious about holding cursors open for long-running connections though because each one holds server resources and can interfere with vacuuming the db.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cuducos picture cuducos  路  45Comments

waghanza picture waghanza  路  28Comments

daurnimator picture daurnimator  路  57Comments

isairz picture isairz  路  24Comments

nicklasaven picture nicklasaven  路  22Comments