Disclaimer: this may be a gcloud-node problem or a GCS problem - not really sure.
I'm working on a project where I download a lot of images and then upload them in parallel to GCS. When the number of simultaneous operations against GCS trend towards 100-200, I get a bunch of random connection reset failures on /some/ requests:
[Error: Could not authenticate request read ECONNRESET]
When I trend towards 300 simultaneous operations, the app just crashes outright with this error:
stream.js:74
throw er; // Unhandled stream error in pipe.
^
Error: socket hang up
at createHangUpError (_http_client.js:200:15)
at Socket.socketOnEnd (_http_client.js:292:23)
at emitNone (events.js:72:20)
at Socket.emit (events.js:166:7)
at endReadableNT (_stream_readable.js:905:12)
at nextTickCallbackWith2Args (node.js:442:9)
at process._tickCallback (node.js:356:17)
I boiled this down to a fairly small repro:
var request = require('request');
var uuid = require('node-uuid');
var util = require('util');
var gcloud = require('gcloud')({
projectId: 'YOUR PROJECT ID',
keyFilename: 'keyfile.json'
});
var storage = gcloud.storage();
var bucket = storage.bucket('scaletest');
var count = 0;
for (var i=0; i<200; i++) {
(function() {
var idx = count++;
var name = uuid.v4();
var file = bucket.file(name);
console.log('Request #' + idx);
request("http://jbeckwith.com/images/head.png")
.pipe(file.createWriteStream())
.on('finish', () => {
console.log("Success! #" + idx);
file.delete();
}).on('error', (err) => {
console.error("Fail! #" + idx + "\n\t" + util.inspect(err));
});
})();
}
setInterval(()=>{},1000);
And a package.json:
{
"name": "maxconn",
"version": "1.0.0",
"description": "",
"main": "test.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"gcloud": "^0.29.0",
"node-uuid": "^1.4.7",
"request": "^2.69.0"
}
}
I suspect there's an endpoint in GCP throttling the number of active connections from a single client, and I'm bumping up on that limit.
Another thing to throw out there - in a more complex scenario where I'm making other calls to pub/sub and cloud vision, I noticed that other connections start to fail too - not just storage.
Thanks for putting this together. It clearly looks like multiple things become more likely to :boom: after many hundred requests. Do you think there's something we should do, or is this just a matter of not making so many requests at once? We added instructions on how to fight similar errors here (see the last section): https://googlecloudplatform.github.io/gcloud-node/#/docs/v0.29.0/guides/troubleshooting, which basically uses async to throttle the requests.
Thanks for the async tip! That's a convenient work around. My intermittent authentication errors are something that folks can detect and work around. I'm way more concerned with the outright crash that can happen from time to time if you push things too far. Would it be possible for us to handle the async request queuing for the user, preventing folks like me from shooting themselves in the foot?
We do use throttling where we can (such as Bucket#deleteFiles). In that case, we know from the start that what the user wants will require many API requests. In the example case here where multiple File objects are created and immediately written to, I think it would be unexpected to have hidden logic that ignores a user's most recent requests.
For the crash error, can you see if adding an error handler on the request stream picks that up?
request("http://jbeckwith.com/images/head.png")
.on('error', (err) => {
console.error("Fail! #" + idx + "\n\t" + util.inspect(err));
})
// .pipe(...
I ended up using async.parallel as a work around:
https://github.com/JustinBeckwith/cloudcats/blob/master/worker/analyzer.js#L67
I verified the errors weren't coming from request, rather from the pipe to GCS, sometimes pub/sub, and sometimes the vision API. I think I'm bumping up against some invisible quota limit.
I'm not sure if we can make any changes to this library that would help. Any ideas?
I'm running into what seems to be the same exact problem. I'm working on a project where we scrape images (because we have to based on an agreement with a data provider) and sending them to cloud storage. There are also a lot of pub sub messages getting published/received and after a while the process slows and then eventually fails. I notice this issue was closed but there doesn't seem to be a real resolution?
@edclement Have you tried throttling your requests?
Sorry you're running into this @edclement. Back when I was digging into this issue, I was able to reproduce the error, however, it didn't come from our streams, it came from the request stream.
If it's possible to put together a repo that we can clone and make crash, I'm happy to dive back in and find a resolution once and for all.
Yes, we solved this by implementing our own flow control to throttle requests to a level that didn't run into errors. Overall our process takes quite a bit longer now, but at least it doesn't fail.