Dvc: pull: going completely crazy on large dataset

Created on 6 Dec 2019  路  12Comments  路  Source: iterative/dvc

Version:

DVC version: 0.73.0
Python version: 3.6.8
Platform: Linux-4.15.0-1056-aws-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: None
Filesystem type (cache directory): ('xfs', '/dev/nvme1n1')
Filesystem type (workspace): ('xfs', '/dev/nvme1n1')

Reproduce:

git clone [email protected]:iterative/dataset-registry-private.git
cd dataset-registry-private
dvc pull -j 100 ILSVRC.dvc

Output:

(sorry for the video in this format, was the fastest way to capture it while it was running)

https://www.dropbox.com/s/m821sgc1flk770o/dvc-pull-going-crazy.mov?dl=0

bug p1-important research ui

Most helpful comment

@casperdcl p1 is not p0 (fix asap), p1 can wait. TBH, I see combining threads as top priority in pbars (maybe even UI as a whole) now since it's the highest value, so no point doing something else instead/in the meantime unless it makes dvc unusable.

All 12 comments

Should be fixed by combining pbars. @casperdcl

-j 100 would be the issue (nested bars scrolling off the screen). I should be able to come up with a temp patch for now. P.S. what happened with the old method? No scrolling but lots of flickering and very slow?

@casperdcl what's the point? Let's go for the actual fix, i.e. combining into a single pbar.

The point would be I can downgrade a p1 to p2/p3 today, but don't think I'll have time for a full fix soon.

@casperdcl p1 is not p0 (fix asap), p1 can wait. TBH, I see combining threads as top priority in pbars (maybe even UI as a whole) now since it's the highest value, so no point doing something else instead/in the meantime unless it makes dvc unusable.

Totally agree with @Suor .

Btw dvc push (without any explicit multithread args) also has this issue.

The problem appears when amount of the bars is more then rows in the terminal.
We use threads where Tqdm instance is created. So when we have amount of jobs (threads) are more then rows in the terminal Tqdm draws bars on top each other

  • I saw the solution that pointed in the PR https://github.com/iterative/dvc/pull/3453, with hidden bars:
    limit visible bars to some number, and replace finished one with some from background. Unfortunately solution is not ideal, and interacts with Tqdm internals, I think if needs to apply something like this then better to do it on the library side.
  • Another way is to return from self.display when self.pos >= ROWS, but there are also problems with internals

Hi @casperdcl, I know that you are an author of the Tqdm library (awesome tool :+1: :1st_place_medal: ), could you take a look on my PR and share your thoughts please?

Another approach is to limit bars to only to one and show accumulated progress. It requires to rework few Tqdm calls in dvc pull, dvc push, dvc add(on adding >1Gb files), provide wrapper that can retrieve updates through Queue and draws single bar. I like that approach but it will affect a lot of places of the system

I should ask - are there any opetations where a user would ever benefit from having more than about 5 threads? Surely I/O is a bottleneck for anything more? Maybe we can just set an upper limit to number of threads?

@casperdcl 5 is not a limit for many applications. Downloading millions of tiny files from s3 benefits from dozens of workers, for example.

@casperdcl "limiting number of threads because of the progress bar is a very weird approach, feels like we are not solving the right issue here :smile:

not because of the progress bar, hah. I'm just asking if we can avoid solving the progress bar problem by solving a different problem.

Apparently not. :)

Was this page helpful?
0 / 5 - 0 ratings