Dvc: Duplicate hash computation

Created on 16 Jan 2020  Â·  13Comments  Â·  Source: iterative/dvc

Please provide information about your setup
DVC 0.81.3
Installed via pip
Ubuntu 16.04


Does this look right? I don't think it should be computing hashes here the second time I run dvc status. Also, it takes ~25 minutes each time to compute the hashes (so this is a non-trivial issue):

$ dvc status
Computing hashes (only done once) ...
$ dvc commit some.dvc
...
$ dvc status
Computing hashes (only done once) ...
awaiting response

Most helpful comment

Thanks for following up on this. I think you have given me enough tools to
diagnose this further. I might not get a chance to look at it until Monday.
Please don’t spend any more time on it until I can try to narrow it down
and get back to you with what I find. Thank you for the thoughtful comments
and ideas!

On Thu, Jan 23, 2020 at 7:03 AM Paweł Redzyński notifications@github.com
wrote:

Tried to reproduce the issue on amazon instance with following setup:

DVC version: 0.81.3
Python version: 3.7.6
Platform: Linux-4.4.0-1092-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')

After reboot, status shows
Data and pipelines are up to date.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc/issues/3171?email_source=notifications&email_token=AABNMQ54WFOL7LUBZO4ICSLQ7GPSRA5CNFSM4KHYXQHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXOV7I#issuecomment-577694461,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABNMQ6UVS375FTOMJEMLSTQ7GPSRANCNFSM4KHYXQHA
.

All 13 comments

@colllin , could you give more details about some.dvc, does it track any directory as output? how did you create it?

Did you change any output of some.dvc before dvc commit?

I tried writing a script to replicate this issue but couldn't do much, it would be great if you can provide something that will help us to thinker with the problem.

Here's what I have so far:

$ dvc init --no-scm

$ # Create a large file, it takes ~10 seconds to compute the checksum
$ dvc run -o file "fallocate -l 5G file"
Computing hashes (only done once) ...

$ dvc status
file.dvc:
        always changed

$ dvc commit file.dvc
$ dvc status
file.dvc:
        always changed

Thank you for investigating. Yes, it is a directory.

some.dvc:

md5: 0512850123456789de7043
wdir: ..
outs:
- md5: f537260123456789dac386.dir
  path: datasets/big_dir_of_imgs
  cache: true
  metric: false
  persist: false

I did not change the directory on disk (or anything within it) between the status/commit/status commands I showed.

Note that I reported a similar issue in the past, in case it's related at all.
https://github.com/iterative/dvc/issues/1991

Also, every time I reboot the machine, it re-computes the hash... in case that's a clue.

@colllin , I tried to reproduce it with the following script but again no recomputation :thinking:

$ dvc init --no-scm

$ # Create a directory with some large files.
$ # It should take ~10 seconds to compute the checksums
$ mkdir data

$ for i in {1..5}; do
  local file="data/$i.txt"
  fallocate -l 1G $file
  echo $i >> $file
done

$ dvc add data/
Computing md5 for a large file 'data/1.txt'. This is only done once.
Computing md5 for a large file 'data/2.txt'. This is only done once.
Computing md5 for a large file 'data/3.txt'. This is only done once.
Computing md5 for a large file 'data/4.txt'. This is only done once.
Computing md5 for a large file 'data/5.txt'. This is only done once.

$ dvc status
Data and pipelines are up to date.

$ dvc commit

$ dvc status
Data and pipelines are up to date.

Not sure if I'm missing something. Could you confirm?

Also, every time I reboot the machine, it re-computes the hash... in case that's a clue.

That's pretty weird :thinking:

@colllin , as @pared mentioned in a private conversation, it could be that _somehow_ the mtime is getting updated, thus, triggering the recomputation.

@colllin we still don't know enough about your setup. Can you perform this inside your dvc repo:

pip install psutil
dvc version

And paste the output of dvc version here please. Also if you have something to say about your setup, especially, about your fs, then it might also help.

To check, whether you have mtimes constant you may use the following:

echo "some text" > test.txt
python -c 'import os;print(os.stat("test.txt").st_mtime)'

# wait a bit then
python -c 'import os;print(os.stat("test.txt").st_mtime)'

# reboot
python -c 'import os;print(os.stat("test.txt").st_mtime)'

Compare the printed mtimes of the same file. It would be crazy to have them different, but we've seen crazy things with fs before. At least it will help us diagnose your case.

DVC version: 0.81.3
Python version: 3.7.3
Platform: Linux-4.4.0-1100-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ echo "some text" > test.txt
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ sleep 3; python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ sudo reboot
# ...
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ 

Like that? 👆

@colllin Btw, let's try to run dvc status some.dvc instead of dvc status to narrow down the issue. Could you please run

dvc status some.dvc
dvc status some.dvc # sanity check
dvc commit some.dvc
dvc status some.dvc -v

and show us full log?

Tried to reproduce the issue on amazon instance with following setup:

DVC version: 0.81.3
Python version: 3.7.6
Platform: Linux-4.4.0-1092-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')

After reboot, status shows
Data and pipelines are up to date.

Thanks for following up on this. I think you have given me enough tools to
diagnose this further. I might not get a chance to look at it until Monday.
Please don’t spend any more time on it until I can try to narrow it down
and get back to you with what I find. Thank you for the thoughtful comments
and ideas!

On Thu, Jan 23, 2020 at 7:03 AM Paweł Redzyński notifications@github.com
wrote:

Tried to reproduce the issue on amazon instance with following setup:

DVC version: 0.81.3
Python version: 3.7.6
Platform: Linux-4.4.0-1092-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')

After reboot, status shows
Data and pipelines are up to date.

—

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc/issues/3171?email_source=notifications&email_token=AABNMQ54WFOL7LUBZO4ICSLQ7GPSRA5CNFSM4KHYXQHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXOV7I#issuecomment-577694461,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABNMQ6UVS375FTOMJEMLSTQ7GPSRANCNFSM4KHYXQHA
.

I'm not seeing this issue lately. Maybe there was something weird going on with my system. Thanks again for helping me out.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danfischetti picture danfischetti  Â·  41Comments

dmpetrov picture dmpetrov  Â·  64Comments

yukw777 picture yukw777  Â·  45Comments

dmpetrov picture dmpetrov  Â·  35Comments

Suor picture Suor  Â·  39Comments