Dvc: Separate workspace and cache directory to different hard drives

Created on 4 Apr 2018  路  5Comments  路  Source: iterative/dvc

In some scenarios, the dvc-cache directory is large (and have to stay large) and should be extracted to a separate hard drive (HDD) while the workspace should stay in a major drive (SDD) for the performance reason.

In the current, hardlink-based implementation it is impossible to support.

Open questions:

  1. Can DVC support git-lfs like "copy" semantics instead of hardlinks for this case specifically?
  2. The copy semantics will affect checkout performance. Is there a better option?
  3. How does this align with reflinks?

More details: https://www.reddit.com/r/MachineLearning/comments/89a2jn/p_data_version_control_machine_learning_time/dwqek0n/

question

All 5 comments

Eventually we will support all of these:

  1. reflink
  2. hardlink
  3. symlink
  4. copy

We will auto-detect the best way but will also allow users to choose themselves in config. With the suggested scenario, we will have to resort to symlinks, as they work between the drives and have good performance impact.

I like the idea of the data file layer generalization when we can support all of these.

Also, we should think how to extract the cache directory in the config level. CachDir=.dvc/cache and DataFileType={reflink, hardlink, symlink, copy} parameter should be enough.

Yeah, we will figure out the exact naming scheme as we go. I have a patch for our config coming soon to support 'remote' concept and those changes will be useful here as well.

Fixed with #681 .

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

dmpetrov picture dmpetrov  路  3Comments

shcheklein picture shcheklein  路  3Comments