Dvc: push: OOM when uploading 1.5TB(3.5Mfiles) dataset to local remote

Created on 30 Jun 2020  路  8Comments  路  Source: iterative/dvc

Bug Report

Output of dvc version:

DVC version: 1.0.1
Python version: 3.7.1
Platform: Linux-4.19.0-9-amd64-x86_64-with-debian-10.4
Binary: True
Package: deb
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('xfs', '/dev/mapper/vg_data-lv_data')
Repo: dvc, git
Filesystem type (workspace): ('xfs', '/dev/mapper/vg_data-lv_data')

Please provide information about your setup

We've got a single server with 1 big partition of 2 TO.

We are using git + dvc to manage our datasets, training, etc.

For now, we try to push 1,5TB (3,18M files) on 1 single repo to create our 1st original dataset

We are using reflink and xfs for storage.

SSH + local usage for : repo / git / training

But when we are pushing dvc crashed :

2020-06-29 21:36:34,255 DEBUG: Uploading '.dvc/cache/db/4b681a2fcb2d45436c2bc9420cb116' to '../../../../datasets/upper/idl/db/4b681a2fcb2d45436c2bc9420cb116'
2020-06-29 21:36:34,260 DEBUG: Uploading '.dvc/cache/8f/9081ae7243d5d9b2f63c6d18b27678' to '../../../../datasets/upper/idl/8f/9081ae7243d5d9b2f63c6d18b27678'
2020-06-29 21:36:34,603 DEBUG: Uploading '.dvc/cache/fd/7a74c5c5bc38a23845b0e79671b81f' to '../../../../datasets/upper/idl/fd/7a74c5c5bc38a23845b0e79671b81f'
2020-06-29 21:36:35,078 DEBUG: Uploading '.dvc/cache/9c/b4e48c89859fe5ded008d8f1e0e74a' to '../../../../datasets/upper/idl/9c/b4e48c89859fe5ded008d8f1e0e74a'
2020-06-29 21:36:35,095 DEBUG: Uploading '.dvc/cache/f7/6bd859740457c7b561d5fd5425083d' to '../../../../datasets/upper/idl/f7/6bd859740457c7b561d5fd5425083d'
2020-06-29 21:36:35,226 DEBUG: Uploading '.dvc/cache/42/41c18d87d54a4d6600119da7696261' to '../../../../datasets/upper/idl/42/41c18d87d54a4d6600119da7696261'
2020-06-29 21:36:35,227 DEBUG: Uploading '.dvc/cache/4c/c093f27f05614413e86628c6eec537' to '../../../../datasets/upper/idl/4c/c093f27f05614413e86628c6eec537'
2020-06-29 21:36:35,366 DEBUG: Uploading '.dvc/cache/77/d4e0c59a0be6de0d095593d2866710' to '../../../../datasets/upper/idl/77/d4e0c59a0be6de0d095593d2866710'
2020-06-29 21:36:35,366 DEBUG: Uploading '.dvc/cache/2e/387c8af8732930d3717037c1ce1826' to '../../../../datasets/upper/idl/2e/387c8af8732930d3717037c1ce1826'
2020-06-29 21:36:35,408 DEBUG: Uploading '.dvc/cache/7f/838c5ede1e6e430a786e5e735f01de' to '../../../../datasets/upper/idl/7f/838c5ede1e6e430a786e5e735f01de'
Killed
Jun 29 21:36:43 server kernel: [299632.013388] Out of memory: Kill process 31523 (dvc) score 962 or sacrifice child
Jun 29 21:36:43 server kernel: [299632.013456] Killed process 31523 (dvc) total-vm:12977932kB, anon-rss:7808560kB, file-rss:0kB, shmem-rss:0kB
Jun 29 21:36:43 server kernel: [299632.564083] oom_reaper: reaped process 31523 (dvc), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
p2-medium performance research

All 8 comments

Maybe some forced garbage collection should be necessary with processes?
https://stackoverflow.com/questions/32167386/force-garbage-collection-in-python-to-free-memory

Hi,
Does this mean that this issue is now put on hold?
Thanks

@dmidge8 regretfully we don't have capacity to take care of this issue during this sprint. We do not intend to abandon it but we need to postpone it. So, yes, it is put on hold as of today.

So to clarify about what is going on here: status stage in push analyses present hashes by loading locally present hashes and remote ones and comparing them in memory, which seem to take a lot of memory (iirc user has ~8G of RAM present) and then we start uploading and use a bit more ram which results in user running out of RAM. So far doesn't look like we are leaking memory, but we could definitely optimize the usage. So the current workaround is to run dvc push again (it will pick up from where it left off) or dvc push particular dirs/subdirs individually, to reduce the ram footprint.

I fear that the same error occurs during a dvc fetch with a similar dataset, using dvc 1.2.0. :(
However, it is not possible to fetch file by file to bypass that issue...

@dmidge8 How much ram do you have and what size of dataset are we talking about?

However, it is not possible to fetch file by file to bypass that issue...

You could specify targets for dvc pull. Even subdirs in your dataset.

@efiop Thank you for the info about the dvc pull.
The dataset is exactly the same as @lenrys29. And I have about 16GB of ram, to be shared with other tasks running, so it would be closer to 10GB and the same in swap.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

luchoPipe87 picture luchoPipe87  路  69Comments

pared picture pared  路  73Comments

JoeyCarson picture JoeyCarson  路  53Comments

Casyfill picture Casyfill  路  56Comments

shcheklein picture shcheklein  路  36Comments