With at least postgres, you're able to open a connection and request results in specified-sized chunks of rows -- could we have a Stream-like interface for Repo.all() that returns rows one at a time even before the query's done?
I looked a bit into it and I think postgrex might need patching for this to be possible, too (https://github.com/ericmj/postgrex/blob/master/lib/postgrex/protocol.ex#L254 seems to be the kind of thing that would need a bit of extension).
@zkat we could definitely support this! The first step is to support it in postgrex itself. Can you start a discussion on postgrex issues tracker? Pull requests are also welcome!
I will! And I might end up getting around to doing this PR myself. I did the same thing for https://github.com/zkat/epgsql_sane a while back.
EDIT: See ericmj/postgrex#7
Sweet, let us know if we can help with anything! Thank you!
@zkat I am closing this right now because we can't act on it. Please ping us to reopen it or open up a new issue once we have support it in postgrex. Thank you!
Looks like this got added! See #7.
I'm not sure how likely it is I'll do this PR myself anytime soon, but at the very least this is within the realm of possibility for y'all now.
Yes, definitely. Now that we have the socket in the client, it is easier to do this, but on the other hand, I am unsure if this will yield actual benefits because receiving socket data is always going to be faster than whatever processing your are doing and we don't want to hold the socket for all this time. It would likely be better to do it batches (fetch batches of 1000 elements, let's say).
I just want to clarify that postgrex doesn't support this currently, although @fishcakez has done pretty much all of the necessary ground-lying work.
Yes, definitely. Now that we have the socket in the client, it is easier to do this, but on the other hand, I am unsure if this will yield actual benefits because receiving socket data is always going to be faster than whatever processing your are doing and we don't want to hold the socket for all this time. It would likely be better to do it batches (fetch batches of 1000 elements, let's say).
You need to hold the connection for the entire team because there is no guarantee you get the same connection back if you check it in after fetching a subset of rows.
The benefits are orthogonal, this functionality will allow you to fetch a large number of rows without waiting for fetching and decoding all before you get the first result.
Postgrex now supports streaming query results, if anyone would like to do a proposal please feel free to start a discussion on mailing list.
Would be awesome to see this feature in Ecto!
We have an API at work that may be a potential candidate for an Elixir migration at some point, but as it sometimes needs to return results on the order of hundreds of thousands up to millions of rows, streaming is a prerequisite for any alternative approach we might pick.
While one could of course just use Postgrex, the composability of Ecto queries would be great to have.
I needed this feature so I wrote the following method within my Phoenix project. Would love to submit a pull request & it would be great if someone could confirm that I'm on the right track here & point me to the right place in the Ecto codebase to insert something like this.
Thanks!
def stream(statement, rows, callback, timeout) do
{db_pool,_} = Repo.__pool__
:poolboy.transaction db_pool, fn pid ->
Postgrex.transaction pid, fn conn ->
query = Postgrex.prepare!(conn, "", statement)
stream = Postgrex.stream(conn, query, [], max_rows: rows)
callback.(stream)
end, timeout: timeout
end
end
Hi @realdoug. I would recommend sending a proposal to the Ecto mailing list before working on a PR because we would need to consider a path to supporting this for multiple adapters.
To support this nicely in Ecto.Adapters.SQL, and this is not to say we will, I think we would want to extend DBConnection to support streams. Currently Postgrex hacks around DBConnection to make it work. Then Mariaex and Postgrex drivers would need to use the new stream API in DBConnection. Then call the drivers directly from Ecto. It is a substantial piece of work.
Ok, will do, thanks.
Most helpful comment
Postgrex now supports streaming query results, if anyone would like to do a proposal please feel free to start a discussion on mailing list.