dvc run : possible bug with deeply nested dependencies

Created on 10 Sep 2019  Â·  29Comments  Â·  Source: iterative/dvc

Please provide information about your setup

ubuntu 18.04, dvc==0.59.2, pip install with miniconda python 3.7

For deeply nested dependencies, it looks like dvc is not tracking them properly in the .dvc files

Following script reproduces the issue:

#!/bin/bash

set -x
set -e

rm -rf dvc_test
mkdir dvc_test && cd dvc_test
mkdir scripts
mkdir -p data/recommended/dataset1/dataset1_proc
echo bar > data/recommended/dataset1/v1.txt
git init
dvc init
echo -e "import sys\
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py

# Run works
dvc run -y \
    -w $PWD/data/recommended/dataset1/dataset1_proc\
    -f ./data/recommended/dataset1/dataset1_proc/v1.dvc\
    -d ../../../../scripts/script.py \
    -o v1 \
    "mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"

# Inspecting v1.dvc shows that the script.py dependency is missing on ../
cat data/recommended/dataset1/dataset1_proc/v1.dvc

# Because of the, repro does not work
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc

Inspecting the v1.dvc file shows:

md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
  path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
  path: v1
  cache: true
  metric: false
  persist: false
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc

Which indicates that the script.py dependency is indeed missing one ../

p0-critical

Most helpful comment

@tdeboissiere Ok, I am able to reproduce this on docker with:

docker pull python
docker run --rm -v $(pwd):/test -w /test python ./test_2483.sh

Investigating. Thank you for your patience :slightly_smiling_face:

EDIT: Interesting detail is that dvcfile has even less ../ now:

+ cat data/recommended/dataset1/dataset1_proc/v1.dvc                                                                                 
md5: 3e62014bd6e65b4e9de50e642c13bd19                                                                                                
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                            
deps:                                                                                                                                
- md5: 791b9c74b1d9308a3226b93a36689dad                                                                                              
  path: ../../scripts/script.py                                                                                                      
outs:                                                                                                                                
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir                                                                                          
  path: v1                                                                                                                           
  cache: true                                                                                                                        
  metric: false                                                                                                                      
  persist: false                                                                                                                     
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc                                                                           
WARNING: Dependency 'data/recommended/scripts/script.py' of 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed because it is 'd
eleted'.                                                                                                                             
WARNING: Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed.                                                             
Running command:                                                                                                                     
        mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                         
ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/recommended/scripts/script.py  

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!                                             

EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython's fault.

EDIT3: confirmed pretty old regression, investigating closer...

All 29 comments

Hi @tdeboissiere ! Are you sure this is the same command? I see you've specified -w in dvc run ,but there is no wdir: field in the dvc file.

PS: sorry, jumped the gun, trying to reproduce right now.

@efiop : Sorry, this was run with version 0.57 (which still logged wdir in the .dvc file even if it's the default .).

If you run it with 0.59, wdir is absent indeed

@tdeboissiere Sorry, I just missed the -f option in your command and was surprised about wdir missing. But it makes sense now.

Ok, so running your reproduction script I get:

+ cat data/recommended/dataset1/dataset1_proc/v1.dvc                      
md5: 1f92bcfcf1ef2766da1ea3c373289dca                                     
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data 
deps:                                                                     
- md5: 791b9c74b1d9308a3226b93a36689dad                                   
  path: ../../../../scripts/script.py                                     
outs:                                                                     
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir                               
  path: v1                                                                
  cache: true                                                             
  metric: false                                                           
  persist: false                                                          
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc                
Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' didn't change.     
Data and pipelines are up to date.                                        

which has correct path in the dvcfile and dvc repro reports that there is nothing to reproduce, so it sees all the dependencies. So I'm not able to reproduce this issue.

@tdeboissiere Are you still able to reproduce it with your script? Am I missing something?

@efiop Yes, I still get (version 0.59.2)

ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/scripts/script.py

This is what I get in the dvc_test/data/recommended/dataset1/dataset1_proc/v1.dvc file:

md5: 4ae72e168a0a6a2f1aaadfb5628640f7
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data
deps:
- md5: 791b9c74b1d9308a3226b93a36689dad
  path: ../../../scripts/script.py
outs:
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir
  path: v1
  cache: true
  metric: false
  persist: false

I am still missing one ../ in the path

@tdeboissiere That is really weird. Could you please try this one once more as is:

(3.7.0-dvc) ➜  dvc git:(master) ✗ cat ./test_2483.sh                         
#!/bin/bash                                                                  

set -x                                                                       
set -e                                                                       

pip uninstall -y dvc; pip install dvc==0.59.2                                

rm -rf dvc_test                                                              
mkdir dvc_test                                                               
cd dvc_test                                                                  
mkdir scripts                                                                
mkdir -p data/recommended/dataset1/dataset1_proc                             
echo bar > data/recommended/dataset1/v1.txt                                  
git init                                                                     
dvc init                                                                     
echo -e "import sys\                                                         
\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py

# Run works                                                                  
dvc run -y \                                                                 
    -w $PWD/data/recommended/dataset1/dataset1_proc\                         
    -f ./data/recommended/dataset1/dataset1_proc/v1.dvc \                    
    -d ../../../../scripts/script.py \                                       
    -o v1 \                                                                  
    "mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data"   

# Inspecting v1.dvc shows that the script.py dependency is missing on ../    
cat data/recommended/dataset1/dataset1_proc/v1.dvc                           

# Because of the, repro does not work                                        
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc                     

@efiop : Same results, am still missing a ../

(I also tried it on multiple ubuntu machines)

Edit:

I ran it on a Mac, and this time it worked correctly...

@tdeboissiere That is extremely odd. I've noticed $PWD in your command, which evaluates to nothing on my ubuntu. Maybe you have it defined on yours? How about:

#!/bin/bash                                                                  

set -x                                                                       
set -e                                                                       

pip uninstall -y dvc; pip install dvc==0.59.2                                

rm -rf dvc_test                                                              
mkdir dvc_test                                                               
cd dvc_test                                                                  
mkdir scripts                                                                
mkdir -p data/recommended/dataset1/dataset1_proc                             
echo bar > data/recommended/dataset1/v1.txt                                  
git init                                                                     
dvc init                                                                     
echo -e "import sys\nwith open(sys.argv[1], 'w') as f: f.write(sys.argv[2])" > scripts/script.py

# Run works                                                                  
dvc run \
    -w ./data/recommended/dataset1/dataset1_proc \
    -f ./data/recommended/dataset1/dataset1_proc/v1.dvc \
    -d ../../../../scripts/script.py \
    -o v1 \
    "mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data" 

# Inspecting v1.dvc shows that the script.py dependency is missing on ../    
cat data/recommended/dataset1/dataset1_proc/v1.dvc                           

# Because of the, repro does not work                                        
dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc  

NOTE: removed $PWD and -y.

@efiop Nope, still got the same error, on multiple machines

Can you point me to where in the codebase the .dvc file with the deps is filled and I'll have a look ?

@tdeboissiere Asked two more guys to try out, and they are getting the same results as me (one on linux and another one on mac). We are probably missing something here. Are you sure you are running that precise script? Which directory are you running it from? Does it have symlinks as parent dirs?

Can you point me to where in the codebase the .dvc file with the deps is filled and I'll have a look ?

Sure, this is the entry point https://github.com/iterative/dvc/blob/0.59.2/dvc/stage.py#L471 . Then the wdir and path for the dep gets resolved here https://github.com/iterative/dvc/blob/0.59.2/dvc/output/local.py#L20 and dumped here https://github.com/iterative/dvc/blob/0.59.2/dvc/output/local.py#L53 .

@tdeboissiere Another question, are you using default shell that comes with ubuntu? Or something else? bash? zsh? or maybe something as exotic as fish?

zsh could be an issue.

However, I changed to bash doing:

/bin/bash
# bash shell is activated
bash script_name.sh

but go the same results

I am running that precise script, from the same directory as that same script.
No symlink as parent dirs

@tdeboissiere Thanks for the info! zsh is ok. I'm using it too.

I am running that precise script, from the same directory as that same script.

What do you mean "from the same directory"? The script is location agnostic. Are you running it from your original project dir?

@tdeboissiere I've tried to run it from zsh, still the same result for me - Data and pipelines are up to date.. But on the other hand, we have an ongoing issue on Discord with zsh that I can't reproduce and it's clear that it's related to env management - https://discordapp.com/channels/485586884165107732/485596304961962003/620980290978054158

@efiop : I meant that in my working dir, I create a script bug.sh containing the above shell script. Then, from the same working dir, I call bash bug.sh

@tdeboissiere may be run env and `dvc run -o test "env > test" and compare if it preserves it?

@shcheklein : can you elaborate on what you mean by if it preserves it ?

@tdeboissiere my understanding is that when you run a command with dvc run/dvc repro it should keep your environment variables unchanged (obvious example $PATH that is used to find python and other binaries). It's not what I see in some cases (that link on Discord). I don't have an explanation yet - is it DVC, some zsh settings, some specific machine settings - we don't know yet. But my thinking was - can it be the case here as well? some changes to the environment when you run commands with DVC.

@tdeboissiere Btw, if you can provide us with temporary access to a box where you are able to reproduce your issue, we will be happy to take a closer look ourselves.

@tdeboissiere do you have .zshrc or .zsh_*** files by chance that change any stuff?

  • Ran the same code in a basic docker image on my latpop, same error (miniconda3 python 3.7 ubuntu 18.04, no zsh), same error
  • Ran the same code in a basic docker image on my laptop, (sytem python3 and ubuntu 18.04, no zsh), same error
  • Ran the same code in an ubuntu 18.04 azure VM (no zsh, system python), same error

Ran the following script on ubuntu 18.04 laptopt in /home/user/debug with bash debug.sh

#!/bin/bash                                                                  

set -x                                                                       
set -e                                                                       

rm -rf dvc_test
mkdir dvc_test
cd dvc_test
env > env_before.txt
git init
dvc init
dvc run -o env_after.txt "env > env_after.txt"

The only line which is different between env_before.txt and env_after.txt is

OLDPWD=/home/user/debug # before
OLDPWD=/home/user/debug/dvc_test # after

@tdeboissiere OLDPWD is also set to that second value for me during run and repro, that is normal behavior for shells. I'll try to reproduce with docker today. Btw, are you launching that docker image from the same machine you have issues on?

@efiop Yes

@tdeboissiere Ok, I am able to reproduce this on docker with:

docker pull python
docker run --rm -v $(pwd):/test -w /test python ./test_2483.sh

Investigating. Thank you for your patience :slightly_smiling_face:

EDIT: Interesting detail is that dvcfile has even less ../ now:

+ cat data/recommended/dataset1/dataset1_proc/v1.dvc                                                                                 
md5: 3e62014bd6e65b4e9de50e642c13bd19                                                                                                
cmd: mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                            
deps:                                                                                                                                
- md5: 791b9c74b1d9308a3226b93a36689dad                                                                                              
  path: ../../scripts/script.py                                                                                                      
outs:                                                                                                                                
- md5: 188ed6cb603658d01ef7ba8fb7c434fe.dir                                                                                          
  path: v1                                                                                                                           
  cache: true                                                                                                                        
  metric: false                                                                                                                      
  persist: false                                                                                                                     
+ dvc repro data/recommended/dataset1/dataset1_proc/v1.dvc                                                                           
WARNING: Dependency 'data/recommended/scripts/script.py' of 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed because it is 'd
eleted'.                                                                                                                             
WARNING: Stage 'data/recommended/dataset1/dataset1_proc/v1.dvc' changed.                                                             
Running command:                                                                                                                     
        mkdir v1 && python ../../../../scripts/script.py v1/v1.txt proc_data                                                         
ERROR: failed to reproduce 'data/recommended/dataset1/dataset1_proc/v1.dvc': missing dependency: data/recommended/scripts/script.py  

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!                                             

EDIT2: with --no-scm everything stays the same, so it is unlikely Gitpython's fault.

EDIT3: confirmed pretty old regression, investigating closer...

@tdeboissiere FYI: Found a regression in our code, working on a fix right now.

@tdeboissiere Ok, the patch is taking a bit longer, because the bug is quite deep and the proper solution breaks other parts of the code temporarily. Basically, the issue is os.relpath that we are using in PathInfo.__str__, which in turn gets used in PathInfo.as_posix() when we are dumping the dvc file after dvc run. So depending on where you are located, it might resolve relative path differently. E.g. if you are in /home/user and run os.path.relpath("../path") you'll get ../path, but if you are in / then you'll get path. That is where your ../ went missing. The difference between my and your machines is that you were running from /home/user/subdir and I was running from /home/user/git/dvc/subdir. So a workaround would be to simply move your root directory a few levels deeper.

ETA for a fixed release is tomorrow.

@tdeboissiere Merged a fix for this into master, will release a new dvc version with it ASAP. In the meanwhile, you could try installing from master to check if that works for you too. I.e.

pip uninstall -y dvc; pip install https://github.com/iterative/dvc

Thank you so much for reporting this issue and helping us investigate it! We really appreciate that :slightly_smiling_face:

@efiop My pleasure, it's always a treat to get my problems solved here !

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Suor picture Suor  Â·  39Comments

mdekstrand picture mdekstrand  Â·  43Comments

gcoter picture gcoter  Â·  38Comments

gvyshnya picture gvyshnya  Â·  36Comments

ChrisHowlin picture ChrisHowlin  Â·  35Comments