In some scenarios, the dvc-cache directory is large (and have to stay large) and should be extracted to a separate hard drive (HDD) while the workspace should stay in a major drive (SDD) for the performance reason.
In the current, hardlink-based implementation it is impossible to support.
Open questions:
checkout performance. Is there a better option?More details: https://www.reddit.com/r/MachineLearning/comments/89a2jn/p_data_version_control_machine_learning_time/dwqek0n/
Eventually we will support all of these:
We will auto-detect the best way but will also allow users to choose themselves in config. With the suggested scenario, we will have to resort to symlinks, as they work between the drives and have good performance impact.
I like the idea of the data file layer generalization when we can support all of these.
Also, we should think how to extract the cache directory in the config level. CachDir=.dvc/cache and DataFileType={reflink, hardlink, symlink, copy} parameter should be enough.
Yeah, we will figure out the exact naming scheme as we go. I have a patch for our config coming soon to support 'remote' concept and those changes will be useful here as well.
Fixed with #681 .