Dvc: remote storage auto-compression

Created on 18 Oct 2018  路  7Comments  路  Source: iterative/dvc

Do you consider to add an auto-compression feature when sending something to remote storage?
It could save a lot of memory and speed up sending files to remote.

enhancement feature request p2-medium

Most helpful comment

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

All 7 comments

Hi @AratorField !

Great idea! We are thinking about ways to optimize cache size(e.g. https://github.com/iterative/dvc/issues/829), but just didn't get to working on it yet. Let's leave this issue opened as a remainder.

Ivan and I were just talking about this, while reviewing the current https://dvc.org/doc/get-started/example-versioning which dvc get's a couple different ZIP files containing images: data.zip with 1000 images and new-labels.zip with another 1000 images of the same kind (so they're actually parts of the same dataset).

For a new document we're writing, we will reuse this image dataset but as an uncompressed directory with 2 versions. the first one with the contents of data.zip only (1000 images) and the 2nd version with data.zip + new-labels.zip (2000 images).

Besides the fact that compression probably doesn't help much with these images anyway (but it would with tabular data e.g. CSV or JSON), this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta.

Anyway, this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs (except for backups perhaps).

Thoughts, @iterative/engineering ?

Sorry @jorgeorpinel didn't notice it before.
What do you mean by

this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta

Isn't it already happening for directories?

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

I would like to try to rephrase this issue to make sure I understand it.

Given large directories can take a long time to upload, they should be automatically be compressed without the user's intervention? So the process would be something like:

  1. User does dvc add my_big_folder

  2. Some DVC settings determine under what circumstances a folder should be compressed and how.

  3. User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

  4. If the user modifies my_big_folder, the md5 hash will be different, thus before updating, the whole folder will need to be compressed again? Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

It turns out modifying a zip-file without re-compressing is documented in the man-page of zip.

  1. User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

DVC actually uploads files individually so it would probably not compress entire directories, but your 2 previous steps seem good.

  1. If the user modifies my_big_folder, the md5 will be different, thus before updating, the whole folder will need to be compressed again?

Again, since files are checked and down/uploaded individually, this shouldn't be such a big deal. But thanks for the reference on updating zipped content.

I'm not sure what compression algorithm would be best but TBH I have the impression ZIP won't be the top pick.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

shcheklein picture shcheklein  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments