Please provide information about your setup
ubuntu 18.04, dvc==0.59.2
, pip
install with miniconda python 3.7
For deeply nested dependencies, it looks like dvc is not tracking them properly in the .dvc files
Following script reproduces the issue:
#!/bin/bash
set -x
set -e
rm -rf dvc_test
mkdir dvc_test && cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py
# Run works
dvc run -y \
-w $PWD/data/recommended/dataset1/dataset1_proc\
-f ./data/recommended/dataset1/dataset1_proc/v1.dvc\
-d ../../../../scripts/script.py \
-o v1 \
"mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"
# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc
# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
Inspecting the v1.dvc
file shows:
md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
path: v1
cache: true
metric: false
persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
Which indicates that the script.py
dependency is indeed missing one ../
Hi @tdeboissiere ! Are you sure this is the same command? I see you've specified -w
in dvc run
,but there is no wdir:
field in the dvc file.
PS: sorry, jumped the gun, trying to reproduce right now.
@efiop : Sorry, this was run with version 0.57 (which still logged wdir
in the .dvc
file even if it's the default .
).
If you run it with 0.59, wdir
is absent indeed
@tdeboissiere Sorry, I just missed the -f
option in your command and was surprised about wdir
missing. But it makes sense now.
Ok, so running your reproduction script I get:
+ cat data/recommended/dataset1/dataset1_proc/v1.dvc
md5: 1f92bcfcf1ef2766da1ea3c373289dca
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
path: ../../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
path: v1
cache: true
metric: false
persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' didn't change.
Data and pipelines are up to date.
which has correct path in the dvcfile and dvc repro
reports that there is nothing to reproduce, so it sees all the dependencies. So I'm not able to reproduce this issue.
@tdeboissiere Are you still able to reproduce it with your script? Am I missing something?
@efiop Yes, I still get (version 0.59.2
)
ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/scripts/script.py
This is what I get in the dvc_test/data/recommended/dataset1/dataset1_proc/v1.dvc
file:
md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
path: v1
cache: true
metric: false
persist: false
I am still missing one ../
in the path
@tdeboissiere That is really weird. Could you please try this one once more as is:
(3.7.0-dvc) ➜ dvc git:(master) ✗ cat ./test_2483.sh
#!/bin/bash
set -x
set -e
pip uninstall -y dvc; pip install dvc==0.59.2
rm -rf dvc_test
mkdir dvc_test
cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py
# Run works
dvc run -y \
-w $PWD/data/recommended/dataset1/dataset1_proc\
-f ./data/recommended/dataset1/dataset1_proc/v1.dvc \
-d ../../../../scripts/script.py \
-o v1 \
"mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"
# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc
# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
@efiop : Same results, am still missing a ../
(I also tried it on multiple ubuntu machines)
Edit:
I ran it on a Mac, and this time it worked correctly...
@tdeboissiere That is extremely odd. I've noticed $PWD in your command, which evaluates to nothing on my ubuntu. Maybe you have it defined on yours? How about:
#!/bin/bash
set -x
set -e
pip uninstall -y dvc; pip install dvc==0.59.2
rm -rf dvc_test
mkdir dvc_test
cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py
# Run works
dvc run \
-w ./data/recommended/dataset1/dataset1_proc \
-f ./data/recommended/dataset1/dataset1_proc/v1.dvc \
-d ../../../../scripts/script.py \
-o v1 \
"mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"
# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc
# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
NOTE: removed $PWD
and -y
.
@efiop Nope, still got the same error, on multiple machines
Can you point me to where in the codebase the .dvc file with the deps is filled and I'll have a look ?
@tdeboissiere Asked two more guys to try out, and they are getting the same results as me (one on linux and another one on mac). We are probably missing something here. Are you sure you are running that precise script? Which directory are you running it from? Does it have symlinks as parent dirs?
Can you point me to where in the codebase the .dvc file with the deps is filled and I'll have a look ?
Sure, this is the entry point https://github.com/iterative/dvc/blob/0.59.2/dvc/stage.py#L471 . Then the wdir and path for the dep gets resolved here https://github.com/iterative/dvc/blob/0.59.2/dvc/output/local.py#L20 and dumped here https://github.com/iterative/dvc/blob/0.59.2/dvc/output/local.py#L53 .
@tdeboissiere Another question, are you using default shell that comes with ubuntu? Or something else? bash? zsh? or maybe something as exotic as fish?
zsh could be an issue.
However, I changed to bash doing:
/bin/bash
# bash shell is activated
bash script_name.sh
but go the same results
I am running that precise script, from the same directory as that same script.
No symlink as parent dirs
@tdeboissiere Thanks for the info! zsh is ok. I'm using it too.
I am running that precise script, from the same directory as that same script.
What do you mean "from the same directory"? The script is location agnostic. Are you running it from your original project dir?
@tdeboissiere I've tried to run it from zsh, still the same result for me - Data and pipelines are up to date.
. But on the other hand, we have an ongoing issue on Discord with zsh that I can't reproduce and it's clear that it's related to env management - https://discordapp.com/channels/485586884165107732/485596304961962003/620980290978054158
@efiop : I meant that in my working dir, I create a script bug.sh
containing the above shell script. Then, from the same working dir, I call bash bug.sh
@tdeboissiere may be run env
and `dvc run -o test "env > test" and compare if it preserves it?
@shcheklein : can you elaborate on what you mean by if it preserves it
?
@tdeboissiere my understanding is that when you run a command with dvc run
/dvc repro
it should keep your environment variables unchanged (obvious example $PATH that is used to find python
and other binaries). It's not what I see in some cases (that link on Discord). I don't have an explanation yet - is it DVC, some zsh settings, some specific machine settings - we don't know yet. But my thinking was - can it be the case here as well? some changes to the environment when you run commands with DVC.
@tdeboissiere Btw, if you can provide us with temporary access to a box where you are able to reproduce your issue, we will be happy to take a closer look ourselves.
@tdeboissiere do you have .zshrc
or .zsh_***
files by chance that change any stuff?
Ran the following script on ubuntu 18.04 laptopt in /home/user/debug
with bash debug.sh
#!/bin/bash
set -x
set -e
rm -rf dvc_test
mkdir dvc_test
cd dvc_test
env > env_before.txt
git init
dvc init
dvc run -o env_after.txt "env > env_after.txt"
The only line which is different between env_before.txt
and env_after.txt
is
OLDPWD=/home/user/debug # before
OLDPWD=/home/user/debug/dvc_test # after
@tdeboissiere OLDPWD is also set to that second value for me during run and repro, that is normal behavior for shells. I'll try to reproduce with docker today. Btw, are you launching that docker image from the same machine you have issues on?
@efiop Yes
@tdeboissiere Ok, I am able to reproduce this on docker with:
docker pull python
docker run --rm -v $(pwd):/test -w /test python ./test_2483.sh
Investigating. Thank you for your patience :slightly_smiling_face:
EDIT: Interesting detail is that dvcfile has even less ../ now:
+ cat data/recommended/dataset1/dataset1_proc/v1.dvc
md5: 3e62014bd6e65b4e9de50e642c13bd19
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
path: ../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
path: v1
cache: true
metric: false
persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc
WARNING: Dependency 'data/recommended/scripts/script.py' of 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed because it is 'd
eleted'.
WARNING: Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed.
Running command:
mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/recommended/scripts/script.py
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython's fault.
EDIT3: confirmed pretty old regression, investigating closer...
@tdeboissiere FYI: Found a regression in our code, working on a fix right now.
@tdeboissiere Ok, the patch is taking a bit longer, because the bug is quite deep and the proper solution breaks other parts of the code temporarily. Basically, the issue is os.relpath
that we are using in PathInfo.__str__
, which in turn gets used in PathInfo.as_posix()
when we are dumping the dvc file after dvc run
. So depending on where you are located, it might resolve relative path differently. E.g. if you are in /home/user
and run os.path.relpath("../path")
you'll get ../path
, but if you are in /
then you'll get path
. That is where your ../
went missing. The difference between my and your machines is that you were running from /home/user/subdir
and I was running from /home/user/git/dvc/subdir
. So a workaround would be to simply move your root directory a few levels deeper.
ETA for a fixed release is tomorrow.
@tdeboissiere Merged a fix for this into master, will release a new dvc version with it ASAP. In the meanwhile, you could try installing from master to check if that works for you too. I.e.
pip uninstall -y dvc; pip install https://github.com/iterative/dvc
Thank you so much for reporting this issue and helping us investigate it! We really appreciate that :slightly_smiling_face:
@efiop My pleasure, it's always a treat to get my problems solved here !
Most helpful comment
@tdeboissiere Ok, I am able to reproduce this on docker with:
Investigating. Thank you for your patience :slightly_smiling_face:
EDIT: Interesting detail is that dvcfile has even less ../ now:
EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython's fault.
EDIT3: confirmed pretty old regression, investigating closer...