Dvc: Incorporate file sizes in dvc file

Created on 30 Jan 2020  路  7Comments  路  Source: iterative/dvc

If we add file sizes in DVC-files (when we calculate checksum - so, no extra reads) it will help us to show this info in dvc diff/dvc list and other commands with no I/O or computational overhead.

Related to #2982

discussion enhancement product research

Most helpful comment

I like the idea! On the other hand it might complicate the merge, diffs will become bigger? Also, are there any other fields that potentially could be useful (names? modes? type?).

Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

All 7 comments

I like the idea! On the other hand it might complicate the merge, diffs will become bigger? Also, are there any other fields that potentially could be useful (names? modes? type?).

Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

Just a though to consider - does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

Yes :slightly_smiling_face:

does it makes sense to revisit the discussion around taking md5 files hashes out? Potentially all this meta information can go into the same place.

馃挴

For the record, might be solved in https://github.com/iterative/dvc/issues/1871 .

From #2982

we should store file size in addition to hashes. It will give us an ability to show file sizes in dvc diff.

  • [ ] store file size together with checksums (it might be a separate issue)
  • [ ] display file size in diff (like metrics diff). It might be disabled by default.

I've noticed that the size of the whole tracked dir is registered in .dvc file. but there is no data for individual files inside the tracked directory. I would like to have this information so I can display a list of files&size inside the directory without downloading any file.

is it possible to add the size to the json file generated to track contents inside the directory?

Hi @MetalBlueberry !

Great question! Indeed, we are thinking about adding size to the .dir cache file, but adding those right now will result in older dvc versions registering it as a cache corruption and also us not being able to self-validate .dir files without filtering them first (md5 of them shouldn't depend on size fields) https://github.com/iterative/dvc/issues/4841

We will also add support for these to dvc diff/list/status.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

ghost picture ghost  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments