Dvc: Got doubled data with `dvc add`

Created on 16 Dec 2018  Â·  3Comments  Â·  Source: iterative/dvc

Information about the environment

dvc --version
0.22.0

Installed with macOS package to macOS Mojave.

Bug report

I tried this tutorial to reproduce and got some error.
During the tutorial, dvc add command should not copy cashed data. But as you can see below the total folder size is doubled.

aiml/tutorial_dvc/classify  master  
â–¶ du -sh data 
 41M    data

aiml/tutorial_dvc/classify  master  
â–¶ du -sh .dvc/cache 
 41M    .dvc/cache

aiml/tutorial_dvc/classify  master 
â–¶ du -sh .
 82M    .

Reflinks to nodes are different. Maybe it's the case. Let me know if you need some extra information about my environment.

aiml/tutorial_dvc/classify  master
â–¶ ls -i data/Posts.xml.zip
4690717 data/Posts.xml.zip

aiml/tutorial_dvc/classify  master
â–¶ ls -i .dvc/cache/ec/
4688793 1d2935f811b77cc49b031b999cbf17
question

Most helpful comment

Hi @TezRomacH !

Your system supports reflinks, so dvc used them to create a link from cache to your workspace. No data duplication has occurred. Unlike hardlink, reflink to a file has different inode, so it is a bit harder to see it working for yourself. Also, du utility still counts them as two separate full-blown files, even though there is no duplication on the filesystem level. We should definitely make it more clear in the documentation. Created https://github.com/iterative/dvc.org/issues/139 .

Also, to be sure that no duplication occurs, you could take a look at the free space on your drive using df utility, which will show that your free space didn't go down once again by that file size after you've dvc add-ed it.

Thanks for the feedback!

All 3 comments

I've checked issue #942 but fix in 0.14.0 seems to be not working

Hi @TezRomacH !

Your system supports reflinks, so dvc used them to create a link from cache to your workspace. No data duplication has occurred. Unlike hardlink, reflink to a file has different inode, so it is a bit harder to see it working for yourself. Also, du utility still counts them as two separate full-blown files, even though there is no duplication on the filesystem level. We should definitely make it more clear in the documentation. Created https://github.com/iterative/dvc.org/issues/139 .

Also, to be sure that no duplication occurs, you could take a look at the free space on your drive using df utility, which will show that your free space didn't go down once again by that file size after you've dvc add-ed it.

Thanks for the feedback!

Oh, thanks, now it's clearer!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  Â·  3Comments

ghost picture ghost  Â·  3Comments

gregfriedland picture gregfriedland  Â·  3Comments

anotherbugmaster picture anotherbugmaster  Â·  3Comments

tc-ying picture tc-ying  Â·  3Comments