Dvc: push: OOM when uploading 1.5TB(3.5Mfiles) dataset to local remote

Created on 30 Jun 2020 · 8Comments · Source: iterative/dvc

Bug Report

Output of dvc version:

DVC version: 1.0.1
Python version: 3.7.1
Platform: Linux-4.19.0-9-amd64-x86_64-with-debian-10.4
Binary: True
Package: deb
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('xfs', '/dev/mapper/vg_data-lv_data')
Repo: dvc, git
Filesystem type (workspace): ('xfs', '/dev/mapper/vg_data-lv_data')

Please provide information about your setup

We've got a single server with 1 big partition of 2 TO.

We are using git + dvc to manage our datasets, training, etc.

For now, we try to push 1,5TB (3,18M files) on 1 single repo to create our 1st original dataset

We are using reflink and xfs for storage.

SSH + local usage for : repo / git / training

But when we are pushing dvc crashed :

2020-06-29 21:36:34,255 DEBUG: Uploading '.dvc/cache/db/4b681a2fcb2d45436c2bc9420cb116' to '../../../../datasets/upper/idl/db/4b681a2fcb2d45436c2bc9420cb116'
2020-06-29 21:36:34,260 DEBUG: Uploading '.dvc/cache/8f/9081ae7243d5d9b2f63c6d18b27678' to '../../../../datasets/upper/idl/8f/9081ae7243d5d9b2f63c6d18b27678'
2020-06-29 21:36:34,603 DEBUG: Uploading '.dvc/cache/fd/7a74c5c5bc38a23845b0e79671b81f' to '../../../../datasets/upper/idl/fd/7a74c5c5bc38a23845b0e79671b81f'
2020-06-29 21:36:35,078 DEBUG: Uploading '.dvc/cache/9c/b4e48c89859fe5ded008d8f1e0e74a' to '../../../../datasets/upper/idl/9c/b4e48c89859fe5ded008d8f1e0e74a'
2020-06-29 21:36:35,095 DEBUG: Uploading '.dvc/cache/f7/6bd859740457c7b561d5fd5425083d' to '../../../../datasets/upper/idl/f7/6bd859740457c7b561d5fd5425083d'
2020-06-29 21:36:35,226 DEBUG: Uploading '.dvc/cache/42/41c18d87d54a4d6600119da7696261' to '../../../../datasets/upper/idl/42/41c18d87d54a4d6600119da7696261'
2020-06-29 21:36:35,227 DEBUG: Uploading '.dvc/cache/4c/c093f27f05614413e86628c6eec537' to '../../../../datasets/upper/idl/4c/c093f27f05614413e86628c6eec537'
2020-06-29 21:36:35,366 DEBUG: Uploading '.dvc/cache/77/d4e0c59a0be6de0d095593d2866710' to '../../../../datasets/upper/idl/77/d4e0c59a0be6de0d095593d2866710'
2020-06-29 21:36:35,366 DEBUG: Uploading '.dvc/cache/2e/387c8af8732930d3717037c1ce1826' to '../../../../datasets/upper/idl/2e/387c8af8732930d3717037c1ce1826'
2020-06-29 21:36:35,408 DEBUG: Uploading '.dvc/cache/7f/838c5ede1e6e430a786e5e735f01de' to '../../../../datasets/upper/idl/7f/838c5ede1e6e430a786e5e735f01de'
Killed

Jun 29 21:36:43 server kernel: [299632.013388] Out of memory: Kill process 31523 (dvc) score 962 or sacrifice child
Jun 29 21:36:43 server kernel: [299632.013456] Killed process 31523 (dvc) total-vm:12977932kB, anon-rss:7808560kB, file-rss:0kB, shmem-rss:0kB
Jun 29 21:36:43 server kernel: [299632.564083] oom_reaper: reaped process 31523 (dvc), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

p2-medium performance research

Source

lenrys29

👍3

All 8 comments

Maybe some forced garbage collection should be necessary with processes?
https://stackoverflow.com/questions/32167386/force-garbage-collection-in-python-to-free-memory

dmidge8 on 30 Jun 2020

👀1

Discord context: https://discordapp.com/channels/485586884165107732/485596304961962003/727520764240068719

efiop on 30 Jun 2020

Hi,
Does this mean that this issue is now put on hold?
Thanks

dmidge8 on 15 Jul 2020

@dmidge8 regretfully we don't have capacity to take care of this issue during this sprint. We do not intend to abandon it but we need to postpone it. So, yes, it is put on hold as of today.

pared on 15 Jul 2020

👍1

So to clarify about what is going on here: status stage in push analyses present hashes by loading locally present hashes and remote ones and comparing them in memory, which seem to take a lot of memory (iirc user has ~8G of RAM present) and then we start uploading and use a bit more ram which results in user running out of RAM. So far doesn't look like we are leaking memory, but we could definitely optimize the usage. So the current workaround is to run dvc push again (it will pick up from where it left off) or dvc push particular dirs/subdirs individually, to reduce the ram footprint.

efiop on 15 Jul 2020

I fear that the same error occurs during a dvc fetch with a similar dataset, using dvc 1.2.0. :(
However, it is not possible to fetch file by file to bypass that issue...

dmidge8 on 24 Jul 2020

@dmidge8 How much ram do you have and what size of dataset are we talking about?

However, it is not possible to fetch file by file to bypass that issue...

You could specify targets for dvc pull. Even subdirs in your dataset.

efiop on 24 Jul 2020

@efiop Thank you for the info about the dvc pull.
The dataset is exactly the same as @lenrys29. And I have about 16GB of ram, to be shared with other tasks running, so it would be closer to 10GB and the same in swap.

dmidge8 on 29 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings