Dvc: Consistently getting broken pipe when syncing (uploading) large file

Created on 25 Jun 2017 · 24Comments · Source: iterative/dvc

traceback (most recent call last):
File "/Users/ophir/anaconda3/envs/p2/bin/dvc", line 11, in <module> sys.exit(main()) File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/main.py", line 63, in main Runtime.run(CmdDataSync) File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/runtime.py", line 41, in run sys.exit(instance.run()) File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/command/data_sync.py", line 47, in run pool.map(cloud.sync, targets) File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/multiprocessing/pool.py", line 567, in get raise self._value socket.error: [Errno 32] Broken pipe

p.s.
is syncing the file manually using the aws cli a viable workaround, or are there other things done during the sync (updating status file or something similar)

bug

Source

ophiry

All 24 comments

Hi!

Thank you for reporting this to us. Could you please share some additional info? This is dvc v8.5.0 from pip, right? How big of a file are we talking about?

There is additional info(i.e. hash) being used during the sync, so using aws cli directly in a naive way probably wouldn't work =(.

efiop on 26 Jun 2017

i's indeed 0.85 from pip
the file is ~8GB

copying the file didn't work, even though I copied the file from the cache directory (that includes the hash as part of the name)

ophiry on 26 Jun 2017

Hm... I just tried to sync 20G files and it worked just fine. @ophiry could you please try upstream dvc? I.e. clone it somewhere and do "pip uninstall dvc && ./build_package.sh && pip install dist/dvc-0.8.5.tar.gz".

efiop on 28 Jun 2017

@efiop it looks like concurrency issue, not file size.

dmpetrov on 28 Jun 2017

Closed by 059fcc0
Package version: 0.8.6

@ophiry please update the package pip -U dvc
Btw... thank you for reporting the bug.

dmpetrov on 28 Jun 2017

@dmpetrov Were you able to reproduce?

efiop on 28 Jun 2017

Yes, it was easy to reproduce on Mac. The issue was related to zero file size data/empty.
I guess Mac OS has a special version of some library (multiprocessing probably).

dmpetrov on 28 Jun 2017

Oh, those Macs... Thanks for the info.

efiop on 28 Jun 2017

😄1

still have the same problem after upgrading to 0.8.6

(1/10): [                                        ] 0% data.mdb_ded3948489cTraceback (most recent call last):
  File "/Users/ophir/anaconda3/envs/p2/bin/dvc", line 11, in <module>
    sys.exit(main())
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/main.py", line 64, in main
    Runtime.run(CmdDataSync)
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/runtime.py", line 41, in run
    sys.exit(instance.run())
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/command/data_sync.py", line 45, in run
    map_progress(cloud.sync, targets, self.parsed_args.jobs)
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/site-packages/dvc/utils.py", line 67, in map_progress
    p.map(func, targets)
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/Users/ophir/anaconda3/envs/p2/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
socket.error: [Errno 32] Broken pipe

ophiry on 29 Jun 2017

Wow.
Thank you for letting us know. Reopening...

dmpetrov on 29 Jun 2017

@ophiry As we've discovered with @dmpetrov , the issue looks to be related to network connection loss when your laptop goes into standby mode while syncing big files. I just merged a temporary fix https://github.com/dataversioncontrol/dvc/pull/109 that will at least notify us if something goes wrong, but a proper fix would be to implement partial download/upload as huge files are not something unusual for dvc scenarios and it would be great if we could continue download/upload from where we've left: https://github.com/dataversioncontrol/dvc/issues/108.

@ophiry could you please try out new dvc once again so we could confirm that the issue is indeed caused by lost network connection? Thank you.

efiop on 3 Jul 2017

this is the latest log (install from master version)

Checksum miss-match. Re-uploading is required.
.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489c.cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c: 0.0B transferred out of 8.6GB
(1/10): [                                        ] 0% data.mdb_ded3948489cFailed to upload ".cache/quality_model/training/rdb.lmdb/data.mdb_ded3948489c": [Errno 32] Broken pipe

ophiry on 10 Jul 2017

@ophiry could you please clarify. Did your laptop fell into sleep mode during the downloading? It looks like the issue happens after sleep mode (most likely when network turns on for 5+ minutes).

If not - when this issue happened: right after dvc sync ... or in 10/20 minutes?

dmpetrov on 10 Jul 2017

it wasn't during sleep, it was a few minutes after the sync started.
the strange thing is that from the progress bar there was no data transferred at all
communicating with s3 through the aws cli worked

ophiry on 11 Jul 2017

Hi @ophiry ! Sorry for such a long delay, I'm back on this issue again.
Unfortunately, I'm still not able to reproduce it, but after a bit for googling, I found https://github.com/boto/boto/issues/621 which sounds very similar, but we actually do explicitly specify host when creating S3Connection so this bug should not occur. However, I see that we construct host as 's3.%s' and not as 's3-%s'(notice the '-' instead of '.') which actually doesn't always formally apply as one could see in http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region(i.e. us-west-2 is only listed with s3-us-west-2 and without s3.us-west-2, though both links are ping-able). What is your s3 region from dvc.conf?

efiop on 6 Aug 2017

the region is us-east-1, which is the region of the bucket

not sure it's relevant, but the issue is with an lmdb file

ophiry on 6 Aug 2017

Sorry again for such a delay. I managed to reproduce this issue(only on Mac, other platforms work fine) for files that are >5G(tested on 8G and 4G, the former reproduced the issue and the latter was uploaded just fine), which makes sense as aws actually mentions this limitation in their docs, but the strange part is that on Linux this limit doesn't seem to result in anything). So this issue should be fixed with https://github.com/dataversioncontrol/dvc/issues/163. I'm working on implementing it right now and expect to deliver it in 24h hopefully =). Thank you for your patience.

efiop on 20 Aug 2017

👍1

@ophiry I merged #178. Tested it on my mac, everything seems fine now. Could you confirm that it works for you too? Also note that you can now use 'dvc push data/DATA' command for pushing data to the cloud ;) . Feel free to reopen this issue if anything is still wrong.

efiop on 20 Aug 2017

Oh, actually, just a second. Seems like I broke it.

efiop on 20 Aug 2017

Looks like md5 got screwed. Reverted #178 . Reopening this issue for now.

efiop on 20 Aug 2017

do you mean that etag doesn't contain now the md5 of the full file?
A possible workaround is to store the md5 in git (in the state file) when importing a file, and use this value as the "ground truth" for md5

ophiry on 20 Aug 2017

do you mean that etag doesn't contain now the md5 of the full file?

Yes, precisely, multipart uploads add a giant hustle with the md5.

A possible workaround is to store the md5 in git (in the state file) when importing a file, and use this value as the "ground truth" for md5

That's a brilliant idea! Thank you! I will implement it shortly.

efiop on 20 Aug 2017

Actually, the problem with storing md5 in state file is that even thought it will help us determine if local data has changed, we will still have to download data from the cloud to verify it, because getting md5 of a multi-part object stored in the cloud is still a hustle.

A better solution would be to store original md5 in metadata when uploading file, so we have an easy access to it without actually having to download full file. Ok, I will try this out and will get back soon =)

Thank you!

efiop on 20 Aug 2017

Right, storing md5 in metadata worked out great. We should be set now.

Feel free to reopen this issue if anything is still wrong.

efiop on 20 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings