I am tracking versions of a set of datasets in this repository.
The directory structure is as shown below.
.
|- datasets // Root directory for all datasets
| |
| |- dataset1 // Directories are named after the datasets
| | |
| | |- TRAIN.tsv
| | |- ...
| |
| |- ...
Each version of the datasets adds additional subdirectories to the datasets directory.
When checking out an older version (say v1) using git checkout v1 followed by a dvc checkout, DVC leaves empty subdirectories instead of removing them.
Specifically, if dataset1 was present in v1 but dataset2 was added by v2, checking out v1 leaves behind an empty dataset2 directory.
.
|- datasets // Root directory for all datasets
| |
| |- dataset1 // Directories are named after the datasets
| | |
| | |- TRAIN.tsv // Train/Test sets are actually from the correct version
| | |- ...
| |
| |- dataset2 // Empty directory
| |
| |- ...
Output of dvc version:
$ dvc version -v
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-4.9.0-0.bpo.6-amd64-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc, git
2020-08-06 00:01:05,317 DEBUG: Analytics is enabled.
2020-08-06 00:01:05,410 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
2020-08-06 00:01:05,411 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
I was able to reproduce it with this:
mkdir datasets
mkdir datasets/dir1
echo "file1" > datasets/dir1/file1
dvc add datasets
git add .
git commit -a -m "add datasets v1"
mkdir datasets/dir2
echo "file2" > datasets/dir2/file2
dvc add datasets
git add .
git commit -a -m "add datasets v2"
git checkout HEAD^
dvc checkout
tree datasets
outputs:
datasets
โโโ dir1
โย ย โโโ file1
โโโ dir2
Most helpful comment
I was able to reproduce it with this:
outputs: