We have hit an issue where it appears that the stream exposed by createReadStream() fails without emitting an error before all data has been emitted to downstream listeners.
I have a repro case that I spent some time building.
A few caveats:
'use strict';
// Currently using 2.1.50
var AWS = require('aws-sdk');
var util = require('util');
var stream = require('stream');
var s3 = new AWS.S3({accessKeyId: '<AWSACCESSKEYID>', secretAccessKey: '<AWSSECRETACCESSKEY>', region: '<AWSREGION>'});
// Test bucket to get data from
var testBucket = "<bucket>";
// Large file
// I have tested with 1.8GB Uncompressed
// I have also tested with 271MB compressed by adding zlib.createGunzip() in
// pipe() stream below
var testPath = "<path>";
var _chunkSize = 1000000;
function SlowStream(options) {
// Keep track of how many bytes have passed through
this.bytesParsed = 0;
// Keep track of how many _chunkSize chunks we have seen
this.chunkNumber = 0;
options = options || {};
options.objectMode = true;
stream.Writable.call(this, options);
// Output info when done about how many megabytes were parsed
this.once('finish', function() {
console.info('SlowStream finish');
console.info('Megabytes Parsed: ' + this.bytesParsed / (1024 * 1024));
});
}
util.inherits(SlowStream, stream.Writable);
SlowStream.prototype._write = function _write(row, enc, done) {
// Track bytes
this.bytesParsed += row.length;
// Output bytes seen
console.info('Mb Parsed: ' + this.bytesParsed / (1024 * 1024));
// Every 1 MB pause for 5 seconds to simulate writing to a slow
// target and create the need for back pressure
if(Math.floor(this.bytesParsed / _chunkSize) > this.chunkNumber) {
this.chunkNumber = Math.floor(this.bytesParsed / _chunkSize);
console.info('paused');
setTimeout(done, 5000) ;
}
else {
setImmediate(done);
}
};
// Get stream to read from S3
var readStream = s3.getObject({Bucket: testBucket, Key: testPath}).createReadStream();
// Create slow writable stream
var slowStream = new SlowStream();
// Indicate done
slowStream.once('finish', function() {
console.info('Done');
process.exit(0);
});
// Do the piping
readStream.pipe(slowStream);
Ultimately I need to be able to get to a place where this reads the whole file. Swapping out the S3 s3.getObject().createReadStream() for a filestream and everything works fine. Just doesn't help given why we are using S3.
@chriskinsman
Can you share what version of node you're testing with as well?
I will definitely take a look at your reproduction case. Have you tried using the latest version of the SDK and still see the same issue?
@chrisradek I have tested the above on node 0.10.21 and 0.10.40.
Happy to try the latest SDK tonight. I didn't see anything in the changelists that would indicate a change in behavior.
@chrisradek Tried 2.2.21 with no luck
@chrisradek Looks like it works on node v4.2.2 and node 0.12.8
@chriskinsman
Thanks for testing further. Is upgrading your version of node a useable workaround for the time being? I will still look into the issue with node 0.10.
Upgrade was rocky but we have upgraded just the piece of code that is impacted by this. Moving the whole stack was a non-starter.
@chriskinsman
Just wanted to give you an update on my investigation so far.
I haven't seen this issue present itself in version 0.12.x of nodejs or higher. Worth mentioning is that Streams3 is available in 0.12.x and higher, but not 0.10.x. I believe the main difference between the two versions of streams is the latter supports 'push' and 'pull' of data at the same time, whereas streams2 only supports one at a time. Right now I'm still figuring out what we can reasonably do from the SDK to resolve this issue.
Just wanted to chime in and say I'm suddenly experiencing this on 0.10.30 as well.
https://github.com/aws/aws-sdk-js/pull/612 fixes this issue. We have patched our existing sdk version until we can upgrade.
That was not my experience. I was using that version and still had the issue on 0.10.XX
I've been struggling with similar issues.
Not sure if this might help or not, but you might try s3-streams: https://github.com/izaakschroeder/s3-streams
I am now seeing this again with node v4.2.2 and idk version 2.2.48
@chriskinsman
Can you update to at least version 2.4.12 of the SDK?
2.4.12 added a content-length check when using streams. If the amount of data downloaded is less than the content-length specified by S3, then the stream will throw an error.
Just updated to 2.5.3 and running some tests now...
chrisradek
I now see this error:
StreamContentLengthMismatch: Stream content length mismatch. Received 24405573 of 1628095990 bytes.
Great that you warn me but what is the recovery strategy at this point? I have read 24MB of 1.6GB.
In my case this is gzipped data so I can't randomly seek the stream. Thoughts?
Also how do I find out why it closed the stream early so I can try to prevent it. This is an EC2 instance in the same region as the S3 item in this case.
@chriskinsman
I made a module recently might be helpful for you.
https://github.com/tilfin/s3-block-read-stream
@tilfin Hilarious! I built something similar:
https://www.npmjs.com/package/s3-stream-download
I was waiting to attach it to this thread until we had been using it for a week or so. It has fixed the issue for us. I will check out yours also...
Also running into this issue with aws-sdk 2.6.4
I'm attempting to download ~3k files, and with every run (concurrently streaming 20 at a time) between 2-10 streams fail, never on the same files. No error events are triggered on the stream.
To rule out slow writeable streams while piping, I skipped piping and added a data event to each s3 stream that did nothing to get the data as quickly as possible.
Running synchronously (ie 1 download stream at a time) I still hit the issue.
@chriskinsman's lib seems to work though!
I have the same issue
aws-sdk version is 2.6.5
node version is 6.2.1
I'm building list of keys and trying to download their content in series (no concurrency at all)
I didn't see errors if the length of the list is 1000.
With the length of some thousands the program always fails with the error:
===> (1249/55858) Start processing 'archiveHistory/heroes-fb/type/2016/05/21/type-.account.registration.log.zst' of size 312117 ...
ERROR: { StreamContentLengthMismatch: Stream content length mismatch. Received 273853 of 312117 bytes.
at PassThrough.checkContentLengthAndEmit (/home/alex/workspace/stash/utils/node_modules/aws-sdk/lib/request.js:599:15)
at emitNone (events.js:91:20)
at PassThrough.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:926:12)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickDomainCallback (internal/process/next_tick.js:122:9)
message: 'Stream content length mismatch. Received 273853 of 312117 bytes.',
code: 'StreamContentLengthMismatch',
time: 2016-10-03T19:16:34.504Z }
My code:
async.eachSeries(plan, (planItem, next) => {
...
async.waterfall([
s3_loadObject.bind(null, planItem.Key, loadedFilePath),
...
], next);
}, ...);
function s3_loadObject(key, outFileName, callback) {
const readerParams = {
Key: key
};
var reader = s3.getObject(readerParams).createReadStream();
var writer = fs.createWriteStream(outFileName);
reader.pipe(writer);
writer.on("error", callback);
reader.on("error", callback);
reader.on("end", () => {
console.log(`Stored as: '${outFileName}'`);
callback();
});
}
It seems that it was my fault.
I replaced the code
reader.on("end", () => {
console.log(`Stored as: '${outFileName}'`);
callback();
});
with
writer.on("finish", () => {
console.log(`Stored as: '${outFileName}'`);
callback();
});
and my script processed 5120/46536 objects without any errors
RESOLVED
Having this problem with node 6.8 and aws-sdk 2.7.7. Happens occasionally downloading larger files (videos) and writing them to the EBS on an EC2 machine.
I too am getting this error with node 6 - large files are failing with StreamContentLengthMismatch...any ideas how to resolve this?
[email protected] and [email protected] throws same error with large files
Hi all - I'm not sure if this will help others, but it seems my issue is not to do with the code / versions of frameworks, but more to do with the infrastructure. I was trying to download multiple files from S3 from an EC2 instance (actually from AWS ECS containers) and was throughput was bottle necked because I was access S3 via a NAT instance which was quiet small and therefore not allowing all the downloads / uploads to succeed, so please do look at your network and make sure you have enough bandwidth.
Same error. Sounds like a network failure.
I tried to investigate how easy it would be to retry to query a failing chunk. It is right now a bit out of my league, this lib being pretty wide and this error in an fairly abstracted general request class.
+1 on this issue.
+1
We ended up splitting our files in a bunch of smaller files on S3 + reduced the number of parallel downloads we do with a queue.
(aka streaming 10x1GB instead of 10GB).
That didn't remove the issue, but:
Any update on this? @chrisradek ?
We're seeing this on a regular basis, when downloading files from S3 to lambda. The files are ~100MB.
+1
I'm seeing this error on semi-regular basis. Node v8.9.0
An update on our situation: we're seeing this maybe 20-30 times per day, but that out of a total of around 25,000 download per day.
Our "solution" has been to compare the size of the downloaded file to the size of the file on S3, and retry if they don't match. It would be nice if the JS SDK could do this for us...
+1
Is the waitTime parameter the total time to completion or a timeout between bytes or other?
Closing old issues. The SDK correctly checks the content-length of the object against what S3 sends to raise an error when a file is truncated.
To get around this issue, consider using getObject with range or parts to request smaller parts with retries.
@chrisradek Why close this, when people are still having this issue? There's no resolution for this, and making multiple getObject requests in small parts is not workable, as it results in multiple streams which becomes unmanageable when piping to a process or another file.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.
Most helpful comment
@chrisradek Why close this, when people are still having this issue? There's no resolution for this, and making multiple
getObjectrequests in small parts is not workable, as it results in multiple streams which becomes unmanageable when piping to a process or another file.