I'm seeing very slow uploads when sending large numbers of medium-sized files to storage - in the order of it taking 10x the time compared to uploading to S3. I also see excessive CPU usage during this time.
This Gist is a repro in the form of a side-by-side comparison uploading the same set of 750 medium-sized random text files to S3 using knox and to GCS using gcloud-node
gsutil can upload the same number of files just as quickly as anything to S3, so I'm pretty sure it's not the service itself.
I've tried limiting the request module's default maxSockets to 5 or 10 for the request pool but that didn't seem to help. I have a hunch it's the sheer number of outstanding requests or streams that's causing node to spin its wheels and that maybe some form of global queue could fix it, but I haven't been able to validate that yet.
Any help getting this into a usable state would be much appreciated
Thanks for the detailed breakdown. For me, the gcloud uploads have completed between 88-90 seconds consistently. However, there are two things that are on by default with every write stream which could be contributing towards higher memory usage and slower speeds overall:
Shutting these off improved my speeds from 88-90 to 65-67. I'm not sure if the others (knox or gsutil) use these techniques as well.
Ok, so it looks like you're right and a lot of the issue is it being CPU-bound computing the hash - I'm testing on multi-core machines that don't have huge individual processors and I get a 5x improvement turning off the validate functionality. It's still not as fast as the knox implementation but it certainly is a good start for now
Should we consider turning validation to false by default?
I think that's an issue for discussion.
I'd also like to get an idea of what gsutil is doing and how it compares from a performance standpoint. If the only differences are runtime and validation, then I think our work is done and we just need to make a call about whether to validate hashes by default...
@jgeewax gsutil uses the hashlib.md5 (standard library) and a third-party crcmod which uses a C extension for speedups.
Also, in terms of turning off checks, @thobrla filed GoogleCloudPlatform/gcloud-python#547 (via @silvolu) with us and said:
We should allow to override the configuration for the checks (e.g. use only
crc32c, do not perform checks), but it should be strongly discouraged.
It looks to me like gsutil is hashing the entire file at the end, where as we hash it as the contents comes in. You would think that doing it at the end would have worse performance, but doing it all at the end might reduce thrashing and reduce our CPUs context switching.
Interesting... Might be worth a benchmark...?
I ran on my machine and didn't see a huge spike in CPU (~20% max) but certainly was more time consuming to upload with default settings (validation and resumable is on). I'm running on S3 region us-east-1 and I'm located in Ottawa (mid-east Canada). Seems S3 is really sporadic with performance, as you can see by the fluctuation in numbers with no change in settings. GCS seems a little more stable in that regard.
Here's my numbers:
Finished uploading to s3 in 47.792 seconds
Finished uploading to GCS in 91.161 seconds
Finished uploading to s3 in 41.566 seconds
Finished uploading to GCS in 92.455 seconds
Finished uploading to s3 in 47.247 seconds
Finished uploading to GCS in 93.156 seconds
Finished uploading to s3 in 39.286 seconds
Finished uploading to GCS in 54.476 seconds
Finished uploading to s3 in 57.732 seconds
Finished uploading to GCS in 56.977 seconds
Finished uploading to s3 in 42.917 seconds
Finished uploading to GCS in 52.861 seconds
I haven't tried with gsutil yet or moving hashing to run at the end.
Do we have numbers comparing gsutil to GCS versus gcloud-node to GCS?
I'm working on that now.
So I used the following command to upload the directory of pre-extracted files to a bucket on GCS using gsutil. The -m option stands for multi-threading or something and allows the uploads to occur in parallel, greatly speeding up the upload time:
$ time gsutil -m cp -r tmp gs://my-test-bucket/my-test-folder
Results:
22.04s user 2.93s system 71% cpu 34.866 total
21.00s user 2.99s system 67% cpu 35.477 total
21.66s user 2.92s system 70% cpu 35.069 total
When possible, gsutil hashes on the fly. I think there's only one case when using the boto library (which is not the default for interacting with GCS) where it hashes at the end.
Okay, so I won't bother messing with that. I did try and upload up to 10 at a time in parallel and got better results. It uploaded in 46.42s with default settings. It also seems resumable: false greatly speeds things up on my machine, and frankly I don't think resumable on a createWriteStream makes any sense so let's get rid of that. If they want resumable, they will have to use upload.
So, after looking at the different ways we can assist this use case we believe that tweaking the different settings when uploading can help with this specific use case. #400 will remain open in trying to simplify the validation logic, but overall when uploading relatively small files in bulk, it might be best to turn resumable off (which is on by default). We always recommend you keep validation on.
We have no way of specifying to the library that it should upload an arbitrarily large number of files, so it cannot make these decisions for you.
Also, if you want to try and gain more performance (albeit sacrificing a bit of customisability and idiomatic-ness) , you might consider using a leaner and more simplistic equivalent library like knox but for Google Cloud Storage. I would suggest taking a look at googleapis, Google's official Node.JS client library.
Please feel free to re-open if you know of some way that performance could be greatly improved without sacrificing the defaults we have set in place for resumable and validation. Or just send a PR with the fix, that'd be even better and we'd be happy to merge! :smile:
I may not have a fix or a way to drastically improve performance, but that doesn't mean I think this is fixed. Turning off validation may have gotten some significant performance gains; it was 5x faster on my setup without it, however, that's still half the speed of the S3 uploads. Whatever we did (and it wasn't just me, or my setup) we couldn't match how fast knox worked (turning resumable on/off didn't seem to make much difference to me, if any) even though the underlying GCS infrastructure could easily match S3 from the locations we tested using gsutil.
Given all this, and how knox managed to keep the 'customisability and idiomatic-ness' without sacrificing any speed, to me that seems a pretty big indicator there're some significant inefficiencies in this client.
Turning off validation is a very temporary solution to fit the timescale I have to work with for my project, but I really think this should be fixed before any 'Storage Stable' milestone - there's no way that a 10x performance hit is acceptable to anyone switching from S3 in my eyes.
If we're comparing apples to apples, then validation and resumable will need to be set off because knox is fast and great when everything works, but if something goes wrong, you're out of luck. There is no validation or resumable. That being said, not a lot can go wrong when you're uploading tiny 50kb files. You're right, the performance is terrible for this case. Unsure how to proceed. As a Firebase guy, you deal with the tiny tiny micro-transactions a lot, so I understand your pain, but I'm unsure how relevant that will be to the average GCS user. In any case, I'll re-open to acknowledge the issue, I hope in due time we can come up with a compromise (but hopefully a solution).
I spent more time running the tests, including setting up an AWS account, unlike last time.
I remodeled the script a bit to remove duplication. In the process, I used knox.put instead of .putStream. putStream uses put internally.
Here's the new gist, and here are the results:
# 1
Finished uploading to s3 in 84.883 seconds
Finished uploading to google in 65.481 seconds
# 2
Finished uploading to s3 in 87.172 seconds
Finished uploading to google in 74.828 seconds
# 3
Finished uploading to s3 in 94.344 seconds
Finished uploading to google in 64.73 seconds
# 4
Finished uploading to s3 in 102.506 seconds
Finished uploading to google in 69.615 seconds
# 5
Finished uploading to s3 in 87.63 seconds
Finished uploading to google in 66.908 seconds
I can't make s3 work as fast as you reported or gcs as fast as Ryan reported.
I'm getting consistent 60 seconds for uploads when using googleapis library (10 uploads max parallel). I'm seeing slightly better performance (between 55 and 60s uploads) from the gcloud-node library surprisingly but only when resumable is false. If resumable is true, it's super super slow and actually is making my fan start on my computer which never happens. CPU is at 99%. The entire upload took a whopping 487 seconds... :( Knox ran in 42-46 seconds.
Resumable has to do a sync read and a sync write per file passed through. Usually isn't expensive when it's one file, but becomes noticeable (and noisy apparently) with 750, understandably.
Hi all,
The biggest thing I am interested in here is GCS via gsutil, gcloud-node, and [insert another fast library here]. If we are on par with those libraries, I think we have to consider this bug closed.
Comparisons to S3 are great, but they say more about the GCS service speed than the client library, so I would prefer to leave them out of this discussion.
Do we have a snippet showing a performance comparison between gsutil and gcloud-node (and anything else in Node that talks to GCS) ? Did I miss it somewhere?
Just as a follow-up on this - a lot has changed for the better since we filed this. A combination of the following factors lead me to consider this issue able to be close:
I think we're now seeing speeds in excess of our old system when everything works, and we're tracking individual issues elsewhere. Thank you all for your time
Awesome -- thanks for all the help on this!