Dvc: Separate workspace and cache directory to different hard drives

Created on 4 Apr 2018 · 5Comments · Source: iterative/dvc

In some scenarios, the dvc-cache directory is large (and have to stay large) and should be extracted to a separate hard drive (HDD) while the workspace should stay in a major drive (SDD) for the performance reason.

In the current, hardlink-based implementation it is impossible to support.

Open questions:

Can DVC support git-lfs like "copy" semantics instead of hardlinks for this case specifically?
The copy semantics will affect checkout performance. Is there a better option?
How does this align with reflinks?

More details: https://www.reddit.com/r/MachineLearning/comments/89a2jn/p_data_version_control_machine_learning_time/dwqek0n/

question

Source

dmpetrov

All 5 comments

Eventually we will support all of these:

reflink
hardlink
symlink
copy

We will auto-detect the best way but will also allow users to choose themselves in config. With the suggested scenario, we will have to resort to symlinks, as they work between the drives and have good performance impact.

efiop on 4 Apr 2018

👍1

I like the idea of the data file layer generalization when we can support all of these.

Also, we should think how to extract the cache directory in the config level. CachDir=.dvc/cache and DataFileType={reflink, hardlink, symlink, copy} parameter should be enough.

dmpetrov on 4 Apr 2018

👍1

Yeah, we will figure out the exact naming scheme as we go. I have a patch for our config coming soon to support 'remote' concept and those changes will be useful here as well.

efiop on 4 Apr 2018

👍1

https://github.com/dataversioncontrol/dvc/issues/676

efiop on 24 Apr 2018

Fixed with #681 .

efiop on 28 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings