I follow your steps,but i meet this problem ,can anybody give me some solutions.
Traceback (most recent call last):
File "train_softmax.py", line 485, in
main()
File "train_softmax.py", line 482, in main
train_net(args)
File "train_softmax.py", line 476, in train_net
epoch_end_callback = epoch_cb )
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 512, in fit
self.update()
File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 651, in update
self._kvstore, self._exec_group.param_names)
File "/usr/local/lib/python2.7/dist-packages/mxnet/model.py", line 134, in _update_params_on_kvstore
kvstore.push(name, grad_list, priority=-index)
File "/usr/local/lib/python2.7/dist-packages/mxnet/kvstore.py", line 232, in push
self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:41:11] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory
perhaps you can adjust the batch size, decreasing to 64 and it may fix the problem
@jackytu256 thank you, i have tried to , but it does no help
May I know how many GPU memory you have as well as which one of algos you are trying to train?
decrease your batch size until it can run successfully.
@jackytu256
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 45C P0 130W / 250W | 3709MiB / 12193MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 28C P0 32W / 250W | 10MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 34W / 250W | 10MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 33W / 250W | 10MiB / 16276MiB | 0% Default
decrease your batch size until it can run successfully.
@nttstar
Hi, nttstar
减小batch size后,确实可以work了。
但是还有问题想请教你:
为什么减小batch size可以work呢?是因为default batch size =128时,加载训练集到GPU缓存太大了吗?还是因为GPU资源调度问题呢?
谢谢!
I saw otherwhere someone tried to use monger to solve the memory issue, that might be a choice, but I haven't try. Just FYI.
I had the same issue. Decreasing the batch size fixed the problem
Still i get same error, I decreased batch size to 2 from 32. I don't think this is the solve of problem.
Most helpful comment
decrease your batch size until it can run successfully.