UPDATE: Skip to https://github.com/iterative/dvc/issues/3538#issuecomment-669603698
$ dvc version
DVC version: 0.90.2
Python version: 3.7.5
Platform: Windows-10-10.0.18362-SP0
Binary: True
Package: exe
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Filesystem type (workspace): ('NTFS', 'C:\\')
Having data
in .dvcignore makes DVC commands ignore all file names starting with that string e.g data2
, dataset
. https://dvc.org/doc/user-guide/dvcignore links to gitignore patterms but Git doesn't do that.
UPDATE: Skip to https://github.com/iterative/dvc/issues/3538#issuecomment-621947830
@jorgeorpinel So having data
, right? Git does the same thing, hence why we use /data
in .gitignore
when we dvc add data
.
Maybe it behaves differently in distinct OSs? I'm on Windows (Cmder/ Git Bash):
位 git init
位 touch data data2 data3 data4
位 echo data > .gitignore
位 ls
data data2 data3 data4
位 git status
...
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
data2
data3
data4
位 git add data
The following paths are ignored by one of your .gitignore files:
data
Use -f if you really want to add them.
位 git add .
位 git status
...
new file: .gitignore
new file: data2
new file: data3
new file: data4
@jorgeorpinel Oh, I misunderstood your report. Need to check that, indeed. @pared Maybe you have any ideas?
Will check this, sorry for late response
@jorgeorpinel that was happening on windows right?
following test is passing on my linux machine for 0.90.2
:
def test_ignore_data(tmp_dir, dvc):
tmp_dir.gen({"data": "data", "data1": "data1", "data2": "data2",
"data3":"data3", ".dvcignore": "data"})
files = []
for r, d, fs in dvc.tree.walk("."):
files.extend([os.path.join(r, f) for f in fs])
assert set(files) == set(["./data1", "./data2", "./data3", "./.dvcignore"])
Sorry guys. I don't remember which DVC commands I noticed originally that don't respect files in .dvcignore. It wasn't a great report... But I'm QAing this and I think there's definitely some funny stuff happening:
$ echo data > data
$ echo data > .dvcignore
$ dvc add data
100% Add|鈻堚枅... # Works and puts it in .gitignore. Should ignore? E.g.:
$ git add data
The following paths are ignored by one of your .gitignore files:
data
Use -f if you really want to add them.
I'm on 0.93.0 from exe installer now. Yes, Windows (Git Bash on Cmder)
Also, the first example in https://dvc.org/doc/user-guide/dvcignore#examples doesn't work for me (all files inside data/ dir are added and cached). I didn't check the others.
@jorgeorpinel hmm, that does not seem right.
I think we should behave in a similar way as git does, which is informing the user that his output collides with an entry of some .dvcignore
file.
This functionality can be implemented probably after we introduce something like git check-ignore
for .dvcignore
. I think there was an issue for that already, but I am unable to find it.
Hello! Update here, I'm on DVC 1.3 and this seems solved, although the message I get,
$ dvc add data
Adding...
ERROR: output 'data' does not exist
could be improved, similar to what git add
uses for files in .gitignore patterns:
The following paths are ignored by one of your .gitignore files:
data
Use -f if you really want to add them.
And, should we have a -f
/--force
option for add as well?
Thanks
@jorgeorpinel Thanks for the update! Indeed, that message could be improved.
And, should we have a -f/--force option for add as well?
I would rather not do that. We've seen it causing very odd bugs in git, wouldn't want that for dvc unless there are some very good scenarios. Would wait for someone to ask for it.
It seems the conflation of concepts "exists" and "ignored" leads to confusing error messages. In my case one of "run" outputs was by mistake ignored in .dvcignore, but the respective message was confusingly saying the file "does not exist". Possibly related source: dvc/tree/local.py#L79.
@autayeu hi!
Could you share some info on your setup?
.dvcignore
contained your run
output before you actually dvc run
it, right?
Sure. Attached dvc.message.repro.zip is a minimum setup to reproduce the issue. To reproduce:
Unpack, change into dvc.message.repro directory and run:
dvc run -n preprocess -d preprocess.py -o data.txt -o data.log "python -u preprocess.py | tee -i data.log"
You'll see:
Running stage 'preprocess' with command:
python -u preprocess.py | tee -i data.log
screen content
ERROR: output 'data.log' does not exist
However:
(dvc) zzz:dvc.message.repro zzz$ ls -l
total 24
-rw-r--r-- 1 zzz zzz 15 Aug 12 15:45 data.log
-rw-r--r-- 1 zzz zzz 12 Aug 12 15:45 data.txt
-rw-r--r-- 1 zzz zzz 84 Aug 12 15:39 preprocess.py
So data.log does exist. A bit unexpectedly, dvc also created empty .gitignore file:
(dvc) zzz:dvc.message.repro zzz$ ls -l .gitignore
-rw-r--r-- 1 zzz zzz 0 Aug 12 16:05 .gitignore
Other relevant outputs:
(dvc) zzz:dvc.message.repro zzz$ cat .dvcignore
*.log
(qac) zzz:dvc.message.repro zzz$ dvc version
DVC version: 1.4.0+12c701
---------------------------------
Platform: Python 3.8.5 on macOS-10.15.6-x86_64-i386-64bit
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: reflink, hardlink, symlink
Repo: dvc, git
@autayeu Thank you! It seems to me that we should rethink the message.
@autayeu, @jorgeorpinel Fix for this issue is already on the master, will be available on the next release.
馃憤 Thank you!
Thanks guys.
Most helpful comment
@jorgeorpinel Thanks for the update! Indeed, that message could be improved.
I would rather not do that. We've seen it causing very odd bugs in git, wouldn't want that for dvc unless there are some very good scenarios. Would wait for someone to ask for it.