Dvc: renaming folders in a 132 GB dataset dvc push waits for 5 days then crashes

Created on 18 Sep 2019  路  19Comments  路  Source: iterative/dvc

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

jmollevi@vr-desktop:~$ dvc --version
0.59.2

installed from pip on debian stretch amd64


When renaming folders in a 132 GB dataset dvc push waits for 5 days then crashes.

the entire folder data/cloud-mask-training-set-1 is added as one dvc file

when renaming some folders directly below that dvc push fails after 5 days

folders in output below

jmollevi@vr-desktop:~/projects/dvctest$ ls data/cloud-mask-training-set-1
final-33UUB-2018-07-04_1  final-34VCM-2018-08-12_1  final-34WDS-2018-06-24_1
final-33VUE-2018-05-18_1  final-34VDM-2018-10-01_1  final-34WDT-2018-07-01_1
final-33VUF-2018-11-09_1  final-34WDS-2018-02-08_1  metadata.csv
final-33VWC-2018-09-26_1  final-34WDS-2018-04-27_1
final-33WXR-2018-03-12_1  final-34WDS-2018-06-01_1


-----

jmollevi@vr-desktop:~/projects/dvctest/data$ time dvc push cloud-mask-training-set-1.dvc 
  1%|          |azure://jmollevid21225/2363465 [34:34<137911:06:44,   212s/file]No handlers could be found for logger "XXX"
ERROR: unexpected error - Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:4c2ab6b1-c01e-0050-77ae-6de43f000000
Time:2019-09-17T23:21:19.0002290Z</Message><AuthenticationErrorDetail>Request date header too old: 'Tue, 17 Sep 2019 23:01:05 GMT'</AuthenticationErrorDetail></Error>

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

real    8093m13.659s
user    8018m2.480s
sys 8m56.291s
jmollevi@vr-desktop:~/projects/dvctest/data$
awaiting response

All 19 comments

@JohanMollevik Thanks for reporting the issue!
Can I ask you for some additional explanaiton?

  • We have 132 GB dataset
  • dataset has been added under dvc control

Now, do you change sub-folders names during push?
or you change sub-folders names and then tried to commit the changes?

I change subfolder names, do a commit and then run into problems when trying to push that.

@JohanMollevik Have you tried to run dvc push again? It will only try to upload missing files.

@JohanMollevik Also, which remote are you using? azure?

Yes, it was still taking hours (then I had to abort it)
How does it identify files? Hash or path?
Yes I am using Azure.

@JohanMollevik Was it uploading anything when you've aborted it? If so, next dvc push will take less time, because it has to push fewer files.

So the way it works is as follows: when you add your directory to dvc as a whole, dvc will save all individual files to .dvc/cache with keys that correspond to their md5 hash plus special .dir cache file for the directory structure itself. When you dvc push, it checks if it needs to upload any files to the remote or if they already exist there. And after that uploads all locally used files that are missing on the remote. That is why if your dvc push fails, sequential dvc pushes won't upload same files again, but will rather upload the difference that is left.

Btw, how many files does your directory have approx?

@efiop judjing by progress bar it's 100 billion files )

It looks like it was collecting and pushing files so long that authentication headers expired. Azure client failed to handle that situation so it's their bug, we can work it around by recreating BlockBlobService instance most probably.

P.S. There is no progress bar around dir cache collection in ._collect_used_dir_cache(). I wonder how long does that take.

The folders contains roughly 2.5 million files.

I am trying push again, we will see after the weekend if it went any different.

@JohanMollevik 2.5 million files should be bearable. So there is two steps to push operation:

  • collecting a list of files to upload
  • uploading those files

How long does it take before you see a progress bar? Are all the files about the same size?

@efiop @casperdcl why does progress bar not have total amount of files? Did we remove it?

聽聽1%|聽聽聽聽聽聽聽聽聽聽|azure://jmollevid21225/2363465聽[34:34<137911:06:44,聽聽聽212s/file]

looks like total is 2363465

also 137911:06:44 ETA? That's insane. Even if it didn't crash after 5 days it would take over 15 years to complete.

Yes, Azure upload is very slow here. @JohanMollevik does uploading with azure cli is slow for you too?

@Suor I have not tried that, only setup that azure data store for DVC and have not used it for anything else. It did manage to push the data initially thou in about 4 days.

I will try to make time to test azure cli upload speed and get back on this.

@Suor Still no progress bar 2 hours later

@JohanMollevik You can also try adding a single file to a new dvc repo and pushing it, then remove cache and pull the file back. It should be simpler than setting up azure-cli. 60kb file should do.

You can remove local cache by executing the following in the repo root:

rm -rf .dvc/cache

@Suor I aborted the second push now after having it run over the weekend with no output and 1 core at 100% cpu.
When creating a new dvc repo and pushing a 60kb file it completed in 2 seconds.
Pulling the file back after removing cache took 1.5 seconds.

@JohanMollevik Any update on this? Are you still experiencing the same issue?

Closing due to inactivity. For the record: I was testing a ~300G dataset with millions of images by pushing/pulling//adding it to s3 and it worked slow but fine. Please ping us if you are still experiencing this issue. Thank you!

I have been testing this some more and it seems gone, now completing in 4hours.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

prihoda picture prihoda  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

mfrata picture mfrata  路  3Comments

dmpetrov picture dmpetrov  路  3Comments