Loopback: Streaming the result of a database query

Created on 3 May 2016 · 19Comments · Source: strongloop/loopback

Make a database query returning a stream of results, as opposed to current implementation that returns an all-in-one in-memory array.

Original description

I'm not very experimented with loopback yet but I tried to investigate various methods for the ability to stream the result of a simple query from a client (instead of sending the result over http).

I found that

strong remoting is able to stream data (only files?), but it's not really what I'm looking at.
LiveSet / change-stream would be in some way what I'm looking at, but unfortunately LiveSet is only a subscription to a model which react to changes over time (create/update/...), however it will not react to any get request from the client.
PubSub is not either what I'm looking at. PubSub is more for e.g. a real-time chat.
Socket.io example. I'm a bit confused with this one, but it feels like an example of using PubSub with socket.io?

Did I miss anything else, or does loopback not yet support this feature and I'm potentially left with the implementation myself?

Just to highlight my use case: I have currently an Express.js app working on Heroku that I'm trying to rewrite on Loopback. Heroku has a 30s timeout out of the box (hard limit) and my application is fetching potentially 10s of thousands of records so that some requests may take longer than 30s to respond. Simple solution to avoid the timeout is to stream the result over sockets - it's also more user friendly since the user is able to see the loading going on.

feature major

Source

0cv

👍4

Most helpful comment

The feature request makes sense to me too. I think it has two parts:

1) How to make a database query returning a stream of results, as opposed to current implementation that returns an all-in-one in-memory array.

2) How to get the stream of results to the HTTP client (web browser).

As @ritch pointed out in his comment, 2) should be already available.

The remaining part is 1), which is something that needs to get implemented in loopback-datasource-juggler and then (eventually) in all connectors.

Having wrote that, I am afraid we don't have bandwidth to work on this feature in the near feature (next 3 months at least).

bajtos on 10 May 2016

👍2

All 19 comments

@Krisa : I'm looking into this issue. Will update you when I have something to share. Thanks.

gunjpan on 3 May 2016

❤1

@Krisa : Hi, I guess, Using server-sent events is close to your use case. Let me know your thoughts.

It would be great if you could please share any sample implementation that you might have seen, on web, for your use case.

gunjpan on 3 May 2016

I would be left with the implementation using that.

My current implementation is using express.js with mongoose doing the connection to the MongoDB server (and fetching records) and node ws streaming the result to the client. It's completely bespoke so it's not really worth sharing anything more than that. I could probably do the same in loopback in a hacky way, and especially non loopback-way bypassing probably any ACL, etc., but I would be happy to have something more or less supported natively.

I didn't want to reference any other framework here, but for the sake of avoiding any misunderstanding, this is exactly what I'm looking for...however with Loopback :-) => 1- the client starts a request, which is 2- processed by the server and 3- the server streams the output to the client.

0cv on 3 May 2016

@Krisa : Hey, I guess, LB doesn't look like to have out-of-the-box stream for load events. I was able to implement the stream on an operation hook: loaded and registering a remote method.
Take a look at PR above that I sent to my sandbox repo. I believe that's something close to what you want.

Try cloning the repo and switching to the branch, launch the server and
curl -X GET --header "Accept: application/json" "http://localhost:3000/api/MyModels"

gunjpan on 5 May 2016

This sounds like a feature request for an internal component to add for streaming data (ie. websockets). @ritch Do you have any opinion here? Should this remain in userland or be something that we should be looking at to implement as part of LB?

superkhau on 5 May 2016

@bajtos : PTAL and share your views as well.

gunjpan on 5 May 2016

Heroku has a 30s timeout out of the box (hard limit) and my application is fetching potentially 10s of thousands of records so that some requests may take longer than 30s to respond. Simple solution to avoid the timeout is to stream the result over sockets - it's also more user friendly since the user is able to see the loading going on.

Why aren't you batching / paging the requests? I think making several smaller requests would be a simple solution to this problem.

OTOH I think the feature request is valid and something I've wanted for a while. The ability to create a cursor against a loopback datasource and incrementally respond with data.

We do have support for responding with data incrementally, as @gunjpan pointed out, using server sent events. You can use this without ChangeStreams. Here is a basic example that batches the queries to a page of 10 and streams them to the client using server sent events.

var PassThrough = require('stream').PassThrough;
var clone = require('lodash').clone;

module.exports = function(Users) {
  var DEFAULT_LIMIT = 10;

  User.stream = function(limit, filter, cb) {
    var stream = new PassThrough({ objectMode: true });
    var clonedFilter = clone(filter);
    var page = 0;
    var isDone = false;

    clonedFilter.limit = limit;

    cb(null, stream);

    async.while(function() {
      return isDone;
    }, function() {
      clonedFilter.skip = page * limit;
      User.find(clonedFilter, function(err, users) {
        stream.write(users);
      });
    }, done);

    function done(err) {
      isDone = true;
      stream.write({end: true, error: err});
    }
  };

  User.remoteMethod('stream', {
    description: 'Create a get stream.',
      accessType: 'READ',
      http: [
        { verb: 'get', path: '/stream' }
      ],
      accepts: [{
        arg: 'limit',
        type: 'object',
      }, {
        arg: 'filter',
        type: 'object'
      }],
      returns: {
        arg: 'stream',
        type: 'ReadableStream',
        json: true,
      },
  });

};

ritch on 5 May 2016

👍2

OTOH I think the feature request is valid and something I've wanted for a while. The ability to create a cursor against a loopback datasource and incrementally respond with data.

:+1:

superkhau on 5 May 2016

Thank you @ritch this makes sense and answer some of my initial problems (also I didn't know the stream#PassThrough, it looks interesting...). Yet this solution may not be fully optimal regarding performances by opening every time a new cursor on the database (i.e. db request => records streamed => db request => etc.). Is there eventually a way to make a stream from the request like below?

My current solution (mongoose/ws) is taking indeed advantage of the stream operator on mongo itself. Very simplified example:

var _stream = mongo.Collection.find({...}, 'field1 field2 etc.').lean().batchSize(_batchSize).stream();
_stream.on('data', streamData(_stream));
_stream.on('error', streamError);
_stream.on('end', streamEnd);

function streamData(myStream) {
    return function(data) {
        ws.send(data);
    }
}

0cv on 9 May 2016

The feature request makes sense to me too. I think it has two parts:

1) How to make a database query returning a stream of results, as opposed to current implementation that returns an all-in-one in-memory array.

2) How to get the stream of results to the HTTP client (web browser).

As @ritch pointed out in his comment, 2) should be already available.

The remaining part is 1), which is something that needs to get implemented in loopback-datasource-juggler and then (eventually) in all connectors.

Having wrote that, I am afraid we don't have bandwidth to work on this feature in the near feature (next 3 months at least).

bajtos on 10 May 2016

👍2

@bajtos agree, I could get the example above working (with few amendments to the code), hence the streaming (of anything) is definitely working well.

Regarding 1), I wanted to see whether doing everytime a new query is really slower:

sample table: 21477 records
batch size chosen: 5000 records
=> It will do therefore 5 recursive queries.

I have run each sample 5 times:
Example using Collection.find(...) takes between 5 to 6 seconds (code similar to the one proposed by Ritch)
Example using Collection.find(...)....stream() takes between 2 to 2.5 seconds (similar to the code shown above).

Definitely looking forward for some support regarding query streaming.

EDIT: doing a simple Collection.find({}) (returning my 21477 records in just 1 query) using loopback is still slower than mongodb#stream for some reasons I don't understand. It takes roughly 3s to 3.5s indeed.

0cv on 10 May 2016

@Krisa : Meanwhile, did you have a chance to look at this PR: https://github.com/gunjpan/sandbox/issues/2 . It should help you to move forward while we wait for this feature implementation. Thanks.

gunjpan on 12 May 2016

Thanks @gunjpan for taking the time to share that. An elegant solution to stream from the server to the client, but it's a similar solution to the one proposed by Ritch before and does not help further regarding the stream from the database to the server.

I have not experimented beyond my relatively simple performance test I've shared in my previous post but I'm currently concerned the built-in adapter is noticeably slower than the native MongoDB one, even when we exclude any streaming. I understand streaming is not supported yet, but while you'll implement it, you may want to double check whether there are no issues on raw performance.

0cv on 12 May 2016

Any update on this? Its been 14 months since any activity. I find it very frustrating that loopback doesnt offer any sort of streaming solution. How are we supposed to work with large data sets?