Insightface: MXNetError:cudaMalloc failed: out of memory

Created on 19 Jun 2018 · 9Comments · Source: deepinsight/insightface

I follow your steps，but i meet this problem ,can anybody give me some solutions.

Traceback (most recent call last):
File "train_softmax.py", line 485, in
main()
File "train_softmax.py", line 482, in main
train_net(args)
File "train_softmax.py", line 476, in train_net
epoch_end_callback = epoch_cb )
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 512, in fit
self.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 651, in update
self._kvstore, self._exec_group.param_names)
File "/usr/local/lib/python2.7/dist-packages/mxnet/model.py", line 134, in _update_params_on_kvstore
kvstore.push(name, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet/kvstore.py", line 232, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:41:11] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

Source

vanhelsing18

Most helpful comment

decrease your batch size until it can run successfully.

nttstar on 19 Jun 2018

👍5

All 9 comments

perhaps you can adjust the batch size, decreasing to 64 and it may fix the problem

jackytu256 on 19 Jun 2018

@jackytu256 thank you, i have tried to , but it does no help

vanhelsing18 on 19 Jun 2018

May I know how many GPU memory you have as well as which one of algos you are trying to train?

jackytu256 on 19 Jun 2018

decrease your batch size until it can run successfully.

nttstar on 19 Jun 2018

👍5

@jackytu256

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 45C P0 130W / 250W | 3709MiB / 12193MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 28C P0 32W / 250W | 10MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 34W / 250W | 10MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 33W / 250W | 10MiB / 16276MiB | 0% Default

vanhelsing18 on 19 Jun 2018

decrease your batch size until it can run successfully.

@nttstar
Hi, nttstar
减小batch size后，确实可以work了。
但是还有问题想请教你：
为什么减小batch size可以work呢？是因为default batch size =128时，加载训练集到GPU缓存太大了吗？还是因为GPU资源调度问题呢？
谢谢！

clhne on 9 Jan 2019

👎11

I saw otherwhere someone tried to use monger to solve the memory issue, that might be a choice, but I haven't try. Just FYI.

skyuuka on 20 Nov 2019

I had the same issue. Decreasing the batch size fixed the problem

hyderit on 23 Apr 2020

Still i get same error, I decreased batch size to 2 from 32. I don't think this is the solve of problem.

aliyevorkhan on 6 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

TVM device_type need to be 1 Error

mdv3101 · 5Comments

How to do batch inference?

nmzszxsl01 · 4Comments

retinaface inference time

yja1 · 4Comments

lfw.bin from downloaded glint360k is not aligned the same as the one from Lightweight Face Recognition Challenge & Workshop (ICCV2019))

PavitFaaiz · 4Comments

why acc on the Training Set improve so slow?

weihua04 · 5Comments