I'm trying aws s3 cp --recursive for the first time, on a local directory containing nearly 3000 files. The progress indicator prints useless messages like this:
Completed 1470 part(s) with ... file(s) remaining
Shouldn't it tell me a number of files remaining?
It will eventually tell you the number of files remaining, but not at first.
For both local and s3 copying, we don't read the entire list of files into memory so we don't know how many files are initially going to be copied. For a local directory we start walking the directory and immediately start copying files. For the case of S3, once we get the first page of results from the server, which gives 1000 objects at a time, we start downloading the objects.
Eventually we'll walk through all the local files and know how many total files we've queued up to copy. At that point we fill in the ...
with the correct number.
I understand that the files remaining count is behaving as designed, but the design is flawed: For any transfer involving thousands of files, which is arguably the kind of transfer that most needs a progress indicator, the program spends the vast majority of its time displaying a completely useless "..." instead of something informative. This leaves the poor user with little idea of whether the transfer will take hours, days, or weeks. Something better really is needed here.
If you're trying to avoid a large local file system scan for performance reasons, try to understand that anyone transferring thousands of files is probably willing to wait the extra few seconds for the scan to complete if it means he will get useful progress reporting. I would think this should at least be available as an option through a command line switch.
+1 to reopen this task. displaying the number of parts is irrelevant when uploading thousands or millions of files to S3.
I have no way to guess when my s3 sync tasks will finish...
I think one option we have is to bump our maximum queue (buffer) size from 1000 to something much higher by default (maybe 10000). We can also possibly expose this as a config option for users who are willing to trade off memory for a larger queue size. For local->s3
this is much less of an issue. The real issue is with downloads (s3->local
).
For example, to download objects, we need a list of objects. S3's ListObjects
only gives you 1000 keys per response. A ListObjects
roughly takes:
$ time aws s3api list-objects --bucket mybucket --no-paginate > /dev/null
real 0m1.160s
user 0m0.482s
sys 0m0.071s
So let's say 1 second. If you have 10 million objects, then just to perform enough ListObjects
calls to get an entire list of keys to download (memory issues aside), this should take 10000000 / 1000 or 10000 requests. At a rate of 1 second per request this would take 2.8 hours. Even if you _could_ get perfect parallelism across say 10 threads, this would still take around 17 minutes to know exactly how many keys there are.
That being said, I think there are some things we can do to improve this. I've filed https://github.com/aws/aws-cli/issues/699 to track this and #489 is also related.
Hi.
1st remark : counting local files and storing each of their descriptions into memory hasn't the same impact. But, to share personal experience, I played many times with my java apps on hundreds of thousands of file descriptors without any memory problem.
2nd remark : the count of remaining files doesn't need to be absolutely exact at the beginning: it could be an approximation updated frequently, until the whole list has been retrieved.
Of course listing S3 files takes time, it takes I/O time, not CPU, and that's why threads were made :)
ie. if you have a thread that is dedicated to the listing of the files (locally, or remotely), and that continuously appends the new upload/download tasks to a thread pool of, let's say 50 workers.
Then, the UI could progressively increment the counter (and this would increase the up/download speed significantly, have a look at [https://github.com/bloomreach/s4cmd] : it is less stable and only works with S3, but it's faster :-))
so, output could look like this:
completed 12,345 of ~46,000 (at least) files remaining.
then when the "listing" thread has finished, you have an accurate counter:
completed 34,567 of 123,456 files remaining.
does it make sense?
I don't think I explained my previous comment very well. Your suggested architecture is how the AWS CLI s3 commands are currently implemented. My comment about increasing the max queue size (or letting it be user configurable) in #699 is due to the fact that we enforce a maximum size to the task queue, that is, the thread that is listing objects puts tasks onto a task queue. If we hit the max size of the task queue we block until worker threads pull off tasks from the task queue, thereby opening up slots.
This is done to avoid unbounded growth. It is frequently the case that we can add tasks to the task queue much faster than the worker threads can process them. Given a bucket with large enough keys you will eventually run out of memory. For example, even just creating empty classes:
>>> class FakeTask(object):
... pass
...
>>> import psutil
>>> p = psutil.Process()
>>> p.get_memory_info().rss / 1024.0 / 1024.0
6.140625 # <--- starting process is ~6MB
>>> t = [FakeTask() for i in xrange(1000000)]
>>> p.get_memory_info().rss / 1024.0 / 1024.0
68.70703125 # Creating 1 million tasks is ~68MB
>>> t = [FakeTask() for i in xrange(5000000)]
>>> p.get_memory_info().rss / 1024.0 / 1024.0
359.16796875 # Creating 5 million tasks is 359 MB
I think we're both in agreement, we can certainly improve the state of transfer progress for the high level s3 commands. I also think adding the approximate count so far would be a great addition.
Most helpful comment
I understand that the files remaining count is behaving as designed, but the design is flawed: For any transfer involving thousands of files, which is arguably the kind of transfer that most needs a progress indicator, the program spends the vast majority of its time displaying a completely useless "..." instead of something informative. This leaves the poor user with little idea of whether the transfer will take hours, days, or weeks. Something better really is needed here.