Please provide information about your setup
DVC 0.81.3
Installed via pip
Ubuntu 16.04
Does this look right? I don't think it should be computing hashes here the second time I run dvc status
. Also, it takes ~25 minutes each time to compute the hashes (so this is a non-trivial issue):
$ dvc status
Computing hashes (only done once) ...
$ dvc commit some.dvc
...
$ dvc status
Computing hashes (only done once) ...
@colllin , could you give more details about some.dvc
, does it track any directory as output? how did you create it?
Did you change any output of some.dvc
before dvc commit
?
I tried writing a script to replicate this issue but couldn't do much, it would be great if you can provide something that will help us to thinker with the problem.
Here's what I have so far:
$ dvc init --no-scm
$ # Create a large file, it takes ~10 seconds to compute the checksum
$ dvc run -o file "fallocate -l 5G file"
Computing hashes (only done once) ...
$ dvc status
file.dvc:
always changed
$ dvc commit file.dvc
$ dvc status
file.dvc:
always changed
Thank you for investigating. Yes, it is a directory.
some.dvc:
md5: 0512850123456789de7043
wdir: ..
outs:
- md5: f537260123456789dac386.dir
path: datasets/big_dir_of_imgs
cache: true
metric: false
persist: false
I did not change the directory on disk (or anything within it) between the status/commit/status commands I showed.
Note that I reported a similar issue in the past, in case it's related at all.
https://github.com/iterative/dvc/issues/1991
Also, every time I reboot the machine, it re-computes the hash... in case that's a clue.
@colllin , I tried to reproduce it with the following script but again no recomputation :thinking:
$ dvc init --no-scm
$ # Create a directory with some large files.
$ # It should take ~10 seconds to compute the checksums
$ mkdir data
$ for i in {1..5}; do
local file="data/$i.txt"
fallocate -l 1G $file
echo $i >> $file
done
$ dvc add data/
Computing md5 for a large file 'data/1.txt'. This is only done once.
Computing md5 for a large file 'data/2.txt'. This is only done once.
Computing md5 for a large file 'data/3.txt'. This is only done once.
Computing md5 for a large file 'data/4.txt'. This is only done once.
Computing md5 for a large file 'data/5.txt'. This is only done once.
$ dvc status
Data and pipelines are up to date.
$ dvc commit
$ dvc status
Data and pipelines are up to date.
Not sure if I'm missing something. Could you confirm?
Also, every time I reboot the machine, it re-computes the hash... in case that's a clue.
That's pretty weird :thinking:
@colllin , as @pared mentioned in a private conversation, it could be that _somehow_ the mtime
is getting updated, thus, triggering the recomputation.
@colllin we still don't know enough about your setup. Can you perform this inside your dvc repo:
pip install psutil
dvc version
And paste the output of dvc version
here please. Also if you have something to say about your setup, especially, about your fs, then it might also help.
To check, whether you have mtimes constant you may use the following:
echo "some text" > test.txt
python -c 'import os;print(os.stat("test.txt").st_mtime)'
# wait a bit then
python -c 'import os;print(os.stat("test.txt").st_mtime)'
# reboot
python -c 'import os;print(os.stat("test.txt").st_mtime)'
Compare the printed mtimes of the same file. It would be crazy to have them different, but we've seen crazy things with fs before. At least it will help us diagnose your case.
DVC version: 0.81.3
Python version: 3.7.3
Platform: Linux-4.4.0-1100-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ echo "some text" > test.txt
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ sleep 3; python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ sudo reboot
# ...
(proj1) ubuntu@ip-172-31-41-72:~/proj1$ python -c 'import os;print(os.stat("test.txt").st_mtime)'
1579551441.1852922
(proj1) ubuntu@ip-172-31-41-72:~/proj1$
Like that? 👆
@colllin Btw, let's try to run dvc status some.dvc
instead of dvc status
to narrow down the issue. Could you please run
dvc status some.dvc
dvc status some.dvc # sanity check
dvc commit some.dvc
dvc status some.dvc -v
and show us full log?
Tried to reproduce the issue on amazon instance with following setup:
DVC version: 0.81.3
Python version: 3.7.6
Platform: Linux-4.4.0-1092-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')
After reboot, status shows
Data and pipelines are up to date.
Thanks for following up on this. I think you have given me enough tools to
diagnose this further. I might not get a chance to look at it until Monday.
Please don’t spend any more time on it until I can try to narrow it down
and get back to you with what I find. Thank you for the thoughtful comments
and ideas!
On Thu, Jan 23, 2020 at 7:03 AM Paweł Redzyński notifications@github.com
wrote:
Tried to reproduce the issue on amazon instance with following setup:
DVC version: 0.81.3
Python version: 3.7.6
Platform: Linux-4.4.0-1092-aws-x86_64-with-Ubuntu-16.04-xenial
Binary: False
Package: pip
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/xvda1')
Filesystem type (workspace): ('ext4', '/dev/xvda1')After reboot, status shows
Data and pipelines are up to date.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc/issues/3171?email_source=notifications&email_token=AABNMQ54WFOL7LUBZO4ICSLQ7GPSRA5CNFSM4KHYXQHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXOV7I#issuecomment-577694461,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABNMQ6UVS375FTOMJEMLSTQ7GPSRANCNFSM4KHYXQHA
.
I'm not seeing this issue lately. Maybe there was something weird going on with my system. Thanks again for helping me out.
Most helpful comment
Thanks for following up on this. I think you have given me enough tools to
diagnose this further. I might not get a chance to look at it until Monday.
Please don’t spend any more time on it until I can try to narrow it down
and get back to you with what I find. Thank you for the thoughtful comments
and ideas!
On Thu, Jan 23, 2020 at 7:03 AM Paweł Redzyński notifications@github.com
wrote: