Aws-cli: s3 cp says ... file(s) remaining

Created on 15 Nov 2013  路  7Comments  路  Source: aws/aws-cli

I'm trying aws s3 cp --recursive for the first time, on a local directory containing nearly 3000 files. The progress indicator prints useless messages like this:

Completed 1470 part(s) with ... file(s) remaining

Shouldn't it tell me a number of files remaining?

Most helpful comment

I understand that the files remaining count is behaving as designed, but the design is flawed: For any transfer involving thousands of files, which is arguably the kind of transfer that most needs a progress indicator, the program spends the vast majority of its time displaying a completely useless "..." instead of something informative. This leaves the poor user with little idea of whether the transfer will take hours, days, or weeks. Something better really is needed here.

All 7 comments

It will eventually tell you the number of files remaining, but not at first.

For both local and s3 copying, we don't read the entire list of files into memory so we don't know how many files are initially going to be copied. For a local directory we start walking the directory and immediately start copying files. For the case of S3, once we get the first page of results from the server, which gives 1000 objects at a time, we start downloading the objects.

Eventually we'll walk through all the local files and know how many total files we've queued up to copy. At that point we fill in the ... with the correct number.

I understand that the files remaining count is behaving as designed, but the design is flawed: For any transfer involving thousands of files, which is arguably the kind of transfer that most needs a progress indicator, the program spends the vast majority of its time displaying a completely useless "..." instead of something informative. This leaves the poor user with little idea of whether the transfer will take hours, days, or weeks. Something better really is needed here.

If you're trying to avoid a large local file system scan for performance reasons, try to understand that anyone transferring thousands of files is probably willing to wait the extra few seconds for the scan to complete if it means he will get useful progress reporting. I would think this should at least be available as an option through a command line switch.

+1 to reopen this task. displaying the number of parts is irrelevant when uploading thousands or millions of files to S3.
I have no way to guess when my s3 sync tasks will finish...

I think one option we have is to bump our maximum queue (buffer) size from 1000 to something much higher by default (maybe 10000). We can also possibly expose this as a config option for users who are willing to trade off memory for a larger queue size. For local->s3 this is much less of an issue. The real issue is with downloads (s3->local).

For example, to download objects, we need a list of objects. S3's ListObjects only gives you 1000 keys per response. A ListObjects roughly takes:

$ time aws s3api list-objects --bucket mybucket --no-paginate > /dev/null
real    0m1.160s
user    0m0.482s
sys 0m0.071s

So let's say 1 second. If you have 10 million objects, then just to perform enough ListObjects calls to get an entire list of keys to download (memory issues aside), this should take 10000000 / 1000 or 10000 requests. At a rate of 1 second per request this would take 2.8 hours. Even if you _could_ get perfect parallelism across say 10 threads, this would still take around 17 minutes to know exactly how many keys there are.

That being said, I think there are some things we can do to improve this. I've filed https://github.com/aws/aws-cli/issues/699 to track this and #489 is also related.

Hi.
1st remark : counting local files and storing each of their descriptions into memory hasn't the same impact. But, to share personal experience, I played many times with my java apps on hundreds of thousands of file descriptors without any memory problem.

2nd remark : the count of remaining files doesn't need to be absolutely exact at the beginning: it could be an approximation updated frequently, until the whole list has been retrieved.

Of course listing S3 files takes time, it takes I/O time, not CPU, and that's why threads were made :)
ie. if you have a thread that is dedicated to the listing of the files (locally, or remotely), and that continuously appends the new upload/download tasks to a thread pool of, let's say 50 workers.
Then, the UI could progressively increment the counter (and this would increase the up/download speed significantly, have a look at [https://github.com/bloomreach/s4cmd] : it is less stable and only works with S3, but it's faster :-))

so, output could look like this:
completed 12,345 of ~46,000 (at least) files remaining.
then when the "listing" thread has finished, you have an accurate counter:
completed 34,567 of 123,456 files remaining.

does it make sense?

I don't think I explained my previous comment very well. Your suggested architecture is how the AWS CLI s3 commands are currently implemented. My comment about increasing the max queue size (or letting it be user configurable) in #699 is due to the fact that we enforce a maximum size to the task queue, that is, the thread that is listing objects puts tasks onto a task queue. If we hit the max size of the task queue we block until worker threads pull off tasks from the task queue, thereby opening up slots.

This is done to avoid unbounded growth. It is frequently the case that we can add tasks to the task queue much faster than the worker threads can process them. Given a bucket with large enough keys you will eventually run out of memory. For example, even just creating empty classes:

>>> class FakeTask(object):
...   pass
...
>>> import psutil
>>> p = psutil.Process()
>>> p.get_memory_info().rss / 1024.0 / 1024.0
6.140625    # <--- starting process is ~6MB
>>> t = [FakeTask() for i in xrange(1000000)]
>>> p.get_memory_info().rss / 1024.0 / 1024.0
68.70703125   # Creating 1 million tasks is ~68MB
>>> t = [FakeTask() for i in xrange(5000000)]
>>> p.get_memory_info().rss / 1024.0 / 1024.0
359.16796875  # Creating 5 million tasks is 359 MB

699 talks about raising the max size queue limit for users that are willing to trade off memory for better progress indicators. For some people 365MB might not be a big deal. For others I imagine it would be.

I think we're both in agreement, we can certainly improve the state of transfer progress for the high level s3 commands. I also think adding the approximate count so far would be a great addition.

Was this page helpful?
0 / 5 - 0 ratings