With CUDA 9.0 on GTX 980 Ti with 6GB memory:
File "cupy/cuda/memory.pyx", line 828, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 380551680 bytes (total 6127429120 bytes)
Obviously, the recipe was designed for card(s) with more memory. Any recommendations to run on a lower end card? I adjusted batch size to 20 and it runs. Does learning rate need to be adjusted to compensate? I'm not too optimistic but will know in 3 days and 18 hours...
You don't have to adjust the learning rate. You can just adjust the batch size.
Eventually, it ran out of memory again. I attempted to resume from the latest checkpoint:
./run.sh --ngpu 1 --stage 4 --resume exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results/snapshot_iter_14760
But it crashes right away with a seemingly unrelated problem.
# asr_train.py --ngpu 1 --backend chainer --outdir exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/train_trim_units.txt --debugdir exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150 --minibatches 0 --verbose 0 --resume exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results/snapshot_iter_14760 --train-json dump/train_trim/deltafalse/data.json --valid-json dump/dev_trim/deltafalse/data.json --etype vggblstmp --elayers 6 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --adim 320 --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 20 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 15
# Started at Tue Jun 19 07:09:08 PDT 2018
#
2018-06-19 07:09:08,350 (asr_train:146) WARNING: Skip DEBUG/INFO messages
2018-06-19 07:09:08,384 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set.
2018-06-19 07:09:08,638 (e2e_asr_attctc:123) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.
Exception in main training loop: 'module' object has no attribute 'pyplot'
Traceback (most recent call last):
File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 176, in __call__
self._plot_and_save_attention(att_w, filename.format(trainer))
File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 186, in _plot_and_save_attention
matplotlib.pyplot.imshow(att_w, aspect="auto")
Will finalize trainer extensions and updater before reraising the exception.
[JTraceback (most recent call last):
File "/home/mdeisher/espnet-20180613/egs/tedlium/asr1/../../../src/bin/asr_train.py", line 224, in <module>
main()
File "/home/mdeisher/espnet-20180613/egs/tedlium/asr1/../../../src/bin/asr_train.py", line 215, in main
train(args)
File "/home/mdeisher/espnet-20180613/src/asr/asr_chainer.py", line 476, in train
trainer.run()
File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 176, in __call__
self._plot_and_save_attention(att_w, filename.format(trainer))
File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 186, in _plot_and_save_attention
matplotlib.pyplot.imshow(att_w, aspect="auto")
AttributeError: 'module' object has no attribute 'pyplot'
# Accounting: time=70 threads=1
# Ended (code 1) at Tue Jun 19 07:10:18 PDT 2018, elapsed time 70 seconds
I'm not sure why matplotlib is not working, but Is matplotlib really required for resume?
this is solved recently #240
Yes, that seems to have corrected it for me too. Thank you for pointing it out.
Most helpful comment
You don't have to adjust the learning rate. You can just adjust the batch size.