Espnet: Memory limitations

Created on 18 Jun 2018 · 4Comments · Source: espnet/espnet

With CUDA 9.0 on GTX 980 Ti with 6GB memory:

File "cupy/cuda/memory.pyx", line 828, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 380551680 bytes (total 6127429120 bytes)

Obviously, the recipe was designed for card(s) with more memory. Any recommendations to run on a lower end card? I adjusted batch size to 20 and it runs. Does learning rate need to be adjusted to compensate? I'm not too optimistic but will know in 3 days and 18 hours...

Source

mdeisher

Most helpful comment

You don't have to adjust the learning rate. You can just adjust the batch size.

sw005320 on 18 Jun 2018

👍2

All 4 comments

You don't have to adjust the learning rate. You can just adjust the batch size.

sw005320 on 18 Jun 2018

👍2

Eventually, it ran out of memory again. I attempted to resume from the latest checkpoint:

./run.sh --ngpu 1 --stage 4 --resume exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results/snapshot_iter_14760

But it crashes right away with a seemingly unrelated problem.

# asr_train.py --ngpu 1 --backend chainer --outdir exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/train_trim_units.txt --debugdir exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150 --minibatches 0 --verbose 0 --resume exp/train_trim_vggblstmp_e6_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_adim320_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs20_mli800_mlo150/results/snapshot_iter_14760 --train-json dump/train_trim/deltafalse/data.json --valid-json dump/dev_trim/deltafalse/data.json --etype vggblstmp --elayers 6 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --adim 320 --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 20 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 15 
# Started at Tue Jun 19 07:09:08 PDT 2018
#
2018-06-19 07:09:08,350 (asr_train:146) WARNING: Skip DEBUG/INFO messages
2018-06-19 07:09:08,384 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set.
2018-06-19 07:09:08,638 (e2e_asr_attctc:123) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.
Exception in main training loop: 'module' object has no attribute 'pyplot'
Traceback (most recent call last):
  File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
    entry.extension(self)
  File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 176, in __call__
    self._plot_and_save_attention(att_w, filename.format(trainer))
  File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 186, in _plot_and_save_attention
    matplotlib.pyplot.imshow(att_w, aspect="auto")
Will finalize trainer extensions and updater before reraising the exception.
[JTraceback (most recent call last):
  File "/home/mdeisher/espnet-20180613/egs/tedlium/asr1/../../../src/bin/asr_train.py", line 224, in <module>
    main()
  File "/home/mdeisher/espnet-20180613/egs/tedlium/asr1/../../../src/bin/asr_train.py", line 215, in main
    train(args)
  File "/home/mdeisher/espnet-20180613/src/asr/asr_chainer.py", line 476, in train
    trainer.run()
  File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/home/mdeisher/espnet-20180613/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
    entry.extension(self)
  File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 176, in __call__
    self._plot_and_save_attention(att_w, filename.format(trainer))
  File "/home/mdeisher/espnet-20180613/src/asr/asr_utils.py", line 186, in _plot_and_save_attention
    matplotlib.pyplot.imshow(att_w, aspect="auto")
AttributeError: 'module' object has no attribute 'pyplot'
# Accounting: time=70 threads=1
# Ended (code 1) at Tue Jun 19 07:10:18 PDT 2018, elapsed time 70 seconds

I'm not sure why matplotlib is not working, but Is matplotlib really required for resume?

mdeisher on 20 Jun 2018

this is solved recently #240

sw005320 on 20 Jun 2018

👍1

Yes, that seems to have corrected it for me too. Thank you for pointing it out.

mdeisher on 20 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings