Detectron: Process gets killed with large dataset

Created on 12 Mar 2018 · 9Comments · Source: facebookresearch/Detectron

Hello,
I am trying to train a network with a large dataset (1M images) and the Python process gets Killed while trying to load the dataset. Both ram memory and swap go to 100% and then it just outputs "Killed".
If I set USE_FLIPPED = False it works, but I would like to used the flipped images too.

I am using Python 2.7.14 (anaconda 2) and ubuntu 14

What is the best way to approach this? Any changes I could do to loader.py?

Thanks

Source

dmasmont

Most helpful comment

Hello @ir413. I believe I have a memory leak while loading the JsonDataset. I think that at roidb.py a large amount of memory is used and never gets released. I will check if this happens when using COCO dataset json files. I will update you if I find where the problem is located

dmasmont on 16 Mar 2018

👍3

All 9 comments

Hi @dmasmont, I assume you're training with pre-computed proposals? Proposals for train datasets are loaded in memory and stored as part of the roidb, which can cause memory issues for very large datasets. The easiest way to overcome this would be to switch to end-to-end training.

ir413 on 12 Mar 2018

Hi @ir413, I am using end-to-end training, I don't use any proposal file (I just set TRAIN.DATASETS and leave TRAIN.PROPOSAL_FILES with the default value).

p.d: I'm trying to train Faster R-CNN-FPN-50

dmasmont on 12 Mar 2018

👍1

Hi @ir413, I have realized it also happens when trying to train multiple networks at different GPUs. Memory and swap goes to 100% and the process get killed. Could it be caused by a memory leak?

edit: The size of the images folder is 29GB and the json file is 9.4GB. After starting the training process in 2 GPUs this is the state of the memory:
total used free shared buffers cached
Mem: 62G 55G 6.9G 115M 273M 923M
-/+ buffers/cache: 54G 8.1G
Swap: 9.3G 238M 9.1G

dmasmont on 13 Mar 2018

Hi @dmasmont, thanks for the update. We were able to successfully perform end-to-end training using Detectron on datasets of similar size (with a memory budged of at most 220 GB and no swap). Unfortunately, I'm unable to check what the actual memory usage was.

Have you tried using one of the memory debugging/profiling tools to diagnose the problem? If you could reproduce anything suspicious using COCO, that would be great as I could then help out with debugging from our side too.

ir413 on 13 Mar 2018

dmasmont on 16 Mar 2018

👍3

@ir413 Hi, I met the same problem as @dmasmont.

I am trying to train Faster R-CNN on COCO with config e2e_faster_rcnn_X-101-64x4d-FPN_1x.yaml. The NUM_GPUS is set to 4, since I have access to only 4 GPUs. The training process got killed after 89980 and then 176260 iterations (after being resumed) due to run out of RAM memory and swap space.

BTW, I am using: Ubuntu 14.04, 64G memory, 64G swap, 12G Titan GPU.

II-Matto on 22 Mar 2018

After updating the caffe2 and detectron version and creating again the json files, the initial memory leak when loading the datasets is gone.

Now I'm experiencing a small memory increase (around 5-10% every 24h in a 126G RAM memory). This will probably kill my process after several days of training. Is the same problem you are facing, @II-Matto?

dmasmont on 18 May 2018

@ir413 : Hi, I'm training a model using coco and i face similar issue. After some iterations , process gets killed in my google colab. Following is the message :
2018-05-23 20:24:18.458363: W tensorflow/core/framework/allocator.cc:101] Allocation of 27648000 exceeds 10% of system memory.
Killed.

Any help on this asap is appreciated.
Thanks