If we add file sizes in DVC-files (when we calculate checksum - so, no extra reads) it will help us to show this info in dvc diff
/dvc list
and other commands with no I/O or computational overhead.
Related to #2982
I like the idea! On the other hand it might complicate the merge, diffs will become bigger? Also, are there any other fields that potentially could be useful (names? modes? type?).
Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.
Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.
Yes :slightly_smiling_face:
does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.
馃挴
For the record, might be solved in https://github.com/iterative/dvc/issues/1871 .
From #2982
we should store file size in addition to hashes. It will give us an ability to show file sizes in
dvc diff
.
I've noticed that the size of the whole tracked dir is registered in .dvc file. but there is no data for individual files inside the tracked directory. I would like to have this information so I can display a list of files&size inside the directory without downloading any file.
is it possible to add the size to the json file generated to track contents inside the directory?
Hi @MetalBlueberry !
Great question! Indeed, we are thinking about adding size to the .dir cache file, but adding those right now will result in older dvc versions registering it as a cache corruption and also us not being able to self-validate .dir files without filtering them first (md5 of them shouldn't depend on size
fields) https://github.com/iterative/dvc/issues/4841
We will also add support for these to dvc diff/list/status
.
Most helpful comment
I like the idea! On the other hand it might complicate the merge, diffs will become bigger? Also, are there any other fields that potentially could be useful (names? modes? type?).
Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.