Dvc: Empty subdirectories left in place while checking out previous version of the datasets

Created on 6 Aug 2020  ยท  1Comment  ยท  Source: iterative/dvc

Bug Report

I am tracking versions of a set of datasets in this repository.
The directory structure is as shown below.

.
|- datasets             // Root directory for all datasets
|  |
|  |- dataset1          // Directories are named after the datasets
|  |  |
|  |  |- TRAIN.tsv
|  |  |- ...
|  |
|  |- ...

Each version of the datasets adds additional subdirectories to the datasets directory.
When checking out an older version (say v1) using git checkout v1 followed by a dvc checkout, DVC leaves empty subdirectories instead of removing them.

Specifically, if dataset1 was present in v1 but dataset2 was added by v2, checking out v1 leaves behind an empty dataset2 directory.

.
|- datasets             // Root directory for all datasets
|  |
|  |- dataset1          // Directories are named after the datasets
|  |  |
|  |  |- TRAIN.tsv      // Train/Test sets are actually from the correct version
|  |  |- ...
|  |
|  |- dataset2          // Empty directory
|  |
|  |- ...

Output of dvc version:

$ dvc version -v

DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-4.9.0-0.bpo.6-amd64-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc, git
2020-08-06 00:01:05,317 DEBUG: Analytics is enabled.
2020-08-06 00:01:05,410 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
2020-08-06 00:01:05,411 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmppp9x3ynw']'
bug p2-medium

Most helpful comment

I was able to reproduce it with this:

mkdir datasets
mkdir datasets/dir1
echo "file1" > datasets/dir1/file1
dvc add datasets
git add .
git commit -a -m "add datasets v1"
mkdir datasets/dir2
echo "file2" > datasets/dir2/file2
dvc add datasets
git add .
git commit -a -m "add datasets v2"
git checkout HEAD^
dvc checkout
tree datasets

outputs:

datasets
โ”œโ”€โ”€ dir1
โ”‚ย ย  โ””โ”€โ”€ file1
โ””โ”€โ”€ dir2

>All comments

I was able to reproduce it with this:

mkdir datasets
mkdir datasets/dir1
echo "file1" > datasets/dir1/file1
dvc add datasets
git add .
git commit -a -m "add datasets v1"
mkdir datasets/dir2
echo "file2" > datasets/dir2/file2
dvc add datasets
git add .
git commit -a -m "add datasets v2"
git checkout HEAD^
dvc checkout
tree datasets

outputs:

datasets
โ”œโ”€โ”€ dir1
โ”‚ย ย  โ””โ”€โ”€ file1
โ””โ”€โ”€ dir2
Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  ยท  3Comments

shcheklein picture shcheklein  ยท  3Comments

siddygups picture siddygups  ยท  3Comments

robguinness picture robguinness  ยท  3Comments

mdscruggs picture mdscruggs  ยท  3Comments