DVC uses reflinks for the cache, but it does not seem to use reflinks for local remotes. In the following example, I will create a large file, add it to DVC, and push it to a local remote. Disk space is consumed twice: when the file is created and when it is pushed to a local remote. When I remove the content of the local remote and manually reflink the content of the cache to the local remote, the disk space is reclaimed:
$ dvc --version
0.40.2
$ mkdir test remote_cache
$ cd test
$ dvc init --no-scm
```sh
$ df -h
/dev/sdb6 932G 756G 172G 82% /home
$ dd if=/dev/zero of=file.txt count=$((10*1024)) bs=1048576
$ ls -lh file.txt
-rw-r--r-- 1 witiko witiko 10G May 27 19:03 file.txt
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
```sh
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ dvc add file.txt
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
```sh
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
$ dvc remote add -d local ../remote_cache
$ dvc push
$ df -h
/dev/sdb6 932G 776G 152G 84% /home
```sh
$ df -h
/dev/sdb6 932G 776G 152G 84% /home
$ rm -rf ../remote_cache/*
$ cp -r --reflink=always .dvc/cache/* ../remote_cache
$ df -h
/dev/sdb6 932G 766G 162G 83% /home
I can circumvent the issue by setting cache.dir
to ../../remote_cache
, but that will affect users who download Git repositories with my experiments. Therefore, my preferred workaround is to symlink .dvc/cache
to ../../remote_cache
:
$ cd .dvc
$ rm -rf cache
$ ln -s ../../remote_cache cache
However, this workaround does not work when you have several local remotes, in which case you would need to symlink in the other direction (from the local remotes to .dvc/cache
). The workaround also makes it impossible to defer publishing the latest changes in your cache.
In conclusion, it would be useful if the local remotes supported reflinks, like the cache does.
Cc @michal-ruzicka.
Hi @Witiko !
Great suggestion! It should be quite easy to implement: we'll just need to try using System.reflink()
in download/upload
methods([1], [2]) of RemoteLOCAL before falling back to copyfile. Maybe you would be willing to contribute a PR for it? :slightly_smiling_face: We will be happy to help you get started.
[1] https://github.com/iterative/dvc/blob/master/dvc/remote/local/__init__.py#L298
[2] https://github.com/iterative/dvc/blob/master/dvc/remote/local/__init__.py#L257
@Witiko is using --local
of even --global
with dvc cache dir
can be a workaround for you? This way you could override cache location only on your local environment.
Also, you can consider enabling the an optimization with hardlinks/symlinks for the whole project, please check this two links:
https://dvc.org/doc/user-guide/cache-file-linking
https://dvc.org/doc/user-guide/update-tracked-file
It works very well if your data is mostly immutable - you don't edit _files_ (you can still add/remov more files in directories) manually and you don't overwrite them in your scripts.
Please, let me know if some of these workaround can work for you.
@Witiko is using
--local
of even--global
withdvc cache dir
can be a workaround for you? This way you could override cache location only on your local environment.
These are equivalent to symlinking .dvc/cache
to ../../remote_cache
, i.e. no more than one local remote can be configured. For my current use case (publishing my dvc cache to the web using Apache), using --global
is actually ideal, since all projects can share a single HTTPS remote, but it is not a general solution.
Also, you can consider enabling the an optimization with hardlinks/symlinks for the whole project, please check this two links:
https://dvc.org/doc/user-guide/cache-file-linking
https://dvc.org/doc/user-guide/update-tracked-fileIt works very well if your data is mostly immutable - you don't edit _files_ (you can still add/remov more files in directories) manually and you don't overwrite them in your scripts.
Please, let me know if some of these workaround can work for you.
I use reflinks for the whole project (they are enabled by default), but this optimization does not extend to local remotes, i.e. files are always copied on dvc push
, not symlinked, hardlinked, or reflinked.
@Witiko sure, this is more or less equivalent. you didn't mention that you need multiple local remotes ... could you elaborate a little bit on this use case? how are you going to utilize them in your workflow?
Btw, HTTPS remotes do not support publishing - you can do only fetch/pull
from them.
@Witiko sure, this is more or less equivalent. you didn't mention that you need multiple local remotes ...
could you elaborate a little bit on this use case? how are you going to utilize them in your workflow?
Even with a single local remote, using dvc cache dir --local
ties the cache directly to the local remote, i.e. any screw-ups (such as editing hardlinked files) will be immediately published.
For my current workflow, using dvc cache dir --global /www/some/directory
is sufficient. However, a user may want to set up several local remotes, and they will expect that the copying is optimized like the cache access. At the very least, the absence of these optimizations should be documented, so that it does not bite the unsuspecting user.
Btw, HTTPS remotes do not support publishing - you can do only
fetch/pull
from them.
I include a HTTPS remote in the DVC config, so that others can pull my data. I publish to a local remote, which is an Apache directory that corresponds to the HTTPS remote. The remote contains hundreds of gigabytes of data, so optimized copying is not an option, but a necessity.
Hi @Witiko !
Great suggestion! It should be quite easy to implement: we'll just need to try using
System.reflink()
indownload/upload
methods([1], [2]) of RemoteLOCAL before falling back to copyfile. Maybe you would be willing to contribute a PR for it? slightly_smiling_face We will be happy to help you get started.[1] https://github.com/iterative/dvc/blob/master/dvc/remote/local/__init__.py#L298
[2] https://github.com/iterative/dvc/blob/master/dvc/remote/local/__init__.py#L257
Thank you for the response, @efiop. I will take a look at the code and get back to you!
@Witiko sure, please, don't get me wrong. I'm not trying to argue that this local remotes optimization support and documenting the absence of certain optimizations is necessary! And thank you for your feedback on this. We really appreciate it! I'm just trying to better understand your feedback and find some workaround.
Even with a single local remote, using
dvc cache dir --local
ties the cache directly to the local remote, i.e. any screw-ups (such as editing hardlinked files) will be immediately published.
That's true! I'm not sure though that it can prevent screw-ups like editing manually hardlinked files though (it won't be immediately published, but dvc push
won't detect the problem as far as I remember, @efiop can correct me if I'm wrong)- if hardlinks or symlinks are used we recommend to keep your data read only with a protected mode.
Again, thank you for your feedback and valuable discussion!
sure, please, don't get me wrong. I'm not trying to argue that this local remotes optimization support and documenting the absence of certain optimizations is necessary! And thank you for your feedback on this. We really appreciate it! I'm just trying to better understand your feedback and find some workaround.
@shcheklein I understand that. The DVC tool is a godsend for me, and I appreciate that the contributors of the project are so active! 馃檪
That's true! I'm not sure though that it can prevent screw-ups like editing manually hardlinked files though (it won't be immediately published, but
dvc push
won't detect the problem as far as I remember, @efiop can correct me if I'm wrong)- if hardlinks or symlinks are used we recommend to keep your data read only with a protected mode.
Even if dvc push
does not detect the problem, the users will not have to suffer the consequences until the data is pushed. It gives a useful buffer between the workspace and what is published.
Again, thank you for your feedback and valuable discussion!
And to you.
@shcheklein dvc push
will detect cache corruption. It manually goes and checks every file, so that you don't accidentally push bad cache files.
We might want to discuss whether we should be trying to symlink and hardlink to the local remote when the filesystem does not support reflinks. I think we should not, because even protected symlinks and hardlinks can be easily deleted and modified, which creates a dependency between the remote and the project. If we would, then we might want to make this behavior configurable. WDYT?
We might want to discuss whether we should be trying to symlink and hardlink to the local remote when the filesystem does not support reflinks. I think we should not, because even protected symlinks and hardlinks can be easily deleted and modified, which creates a dependency between the remote and the project. If we would, then we might want to make this behavior configurable.
@Witiko I agree 100% with that. We should not use hardlink/symlink there, because they don't protect the link from corruption. We should only use reflink
if it is available and fallback to copy
if it is not. That could be done by default, without any config options to enable/disable that.
Basically, we just need to implement our own copyfile
that would try reflink
inside. It could be used by default in any other part of our code base later.
@efiop Great! I use Btrfs whenever possible so reflinking works for me as is extremely convenient. Thanks @Witiko!
Most helpful comment
@efiop Great! I use Btrfs whenever possible so reflinking works for me as is extremely convenient. Thanks @Witiko!