Dvc: remote storage auto-compression

Created on 18 Oct 2018 · 7Comments · Source: iterative/dvc

Do you consider to add an auto-compression feature when sending something to remote storage?
It could save a lot of memory and speed up sending files to remote.

enhancement feature request p2-medium

Source

BartekRoszak

👍8

Most helpful comment

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

jorgeorpinel on 9 Oct 2019

👍2

All 7 comments

Hi @AratorField !

Great idea! We are thinking about ways to optimize cache size(e.g. https://github.com/iterative/dvc/issues/829), but just didn't get to working on it yet. Let's leave this issue opened as a remainder.

efiop on 18 Oct 2018

Ivan and I were just talking about this, while reviewing the current https://dvc.org/doc/get-started/example-versioning which dvc get's a couple different ZIP files containing images: data.zip with 1000 images and new-labels.zip with another 1000 images of the same kind (so they're actually parts of the same dataset).

For a new document we're writing, we will reuse this image dataset but as an uncompressed directory with 2 versions. the first one with the contents of data.zip only (1000 images) and the 2nd version with data.zip + new-labels.zip (2000 images).

Besides the fact that compression probably doesn't help much with these images anyway (but it would with tabular data e.g. CSV or JSON), this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta.

Anyway, this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs (except for backups perhaps).

Thoughts, @iterative/engineering ?

jorgeorpinel on 25 Sep 2019

👀2

Sorry @jorgeorpinel didn't notice it before.
What do you mean by

this enables the DVC feature of file deduplication between directory versions, so that when you download the new revision it will get only the delta

Isn't it already happening for directories?

pared on 9 Oct 2019

👍2

Yes, that's my point, deduplication already happens but not if you track/version compressed datasets. (My comment above has a long preamble, sorry.) The main idea/suggestion is in the last paragraph:

...this is all to say that automatic compression would be a very natural feature to have (I would call this p2-important), and then we should also discourage the tracking of compressed archives all over our docs.

jorgeorpinel on 9 Oct 2019

👍2

I would like to try to rephrase this issue to make sure I understand it.

Given large directories can take a long time to upload, they should be automatically be compressed without the user's intervention? So the process would be something like:

User does dvc add my_big_folder
Some DVC settings determine under what circumstances a folder should be compressed and how.
User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.
If the user modifies my_big_folder, the md5 hash will be different, thus before updating, the whole folder will need to be compressed again? Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

Seanny123 on 19 May 2020

👍1

Is there some way to do incremental compression with some sort of processing/efficiency trade-off?

It turns out modifying a zip-file without re-compressing is documented in the man-page of zip.

Seanny123 on 19 May 2020

👍1

User dvc push my_big_folder, which either uploads the individual files of my_big_folder or my_big_folder.zip depending on the setting in step 2.

DVC actually uploads files individually so it would probably not compress entire directories, but your 2 previous steps seem good.

If the user modifies my_big_folder, the md5 will be different, thus before updating, the whole folder will need to be compressed again?

Again, since files are checked and down/uploaded individually, this shouldn't be such a big deal. But thanks for the reference on updating zipped content.

I'm not sure what compression algorithm would be best but TBH I have the impression ZIP won't be the top pick.

jorgeorpinel on 19 May 2020

Was this page helpful?

0 / 5 - 0 ratings