RNN op gradient computation on CPU is broken for the source built mxnet.
I was running a language model training script on my ec2 instance. I tested the script with the latest source built mxnet. During training, I ran into the following log:
ubuntu@ip-172-31-23-26:~/bug_fix/gluon-nlp/scripts/language_model$ python word_language_model.py --log-interval=1
/home/ubuntu/clean_mxnet/incubator-mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
)
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 1150, TNC)
)
(decoder): HybridSequential(
(0): Dense(None -> 33278, linear)
)
)
word_language_model.py:382: UserWarning: nan or inf is detected. Clipping results will be undefined.
gluon.utils.clip_global_norm(grads, args.clip)
[Epoch 0 Batch 1/372] current loss 20.83, ppl 1107330721.37, throughput 0.71 samples/s, lr 29.14
[Epoch 0 Batch 2/372] current loss 10.41, ppl 33276.90, throughput 1.39 samples/s, lr 29.57
[Epoch 0 Batch 3/372] current loss 10.41, ppl 33276.58, throughput 1.42 samples/s, lr 28.71
[Epoch 0 Batch 4/372] current loss 10.41, ppl 33276.17, throughput 1.33 samples/s, lr 30.43
[Epoch 0 Batch 5/372] current loss 10.41, ppl 33276.86, throughput 1.40 samples/s, lr 29.14
The loss value does not change any more. If I use the mxnet build by pip installation pip install https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-15/dist/mxnet_cu100-1.6.0b20191215-py2.py3-none-manylinux1_x86_64.whl . The log is normal. There is no gradient Inf warning and the loss keeps changing:
python word_language_model.py --log-interval=1 /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
)
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 1150, TNC)
)
(decoder): HybridSequential(
(0): Dense(None -> 33278, linear)
)
)
[Epoch 0 Batch 1/372] current loss 20.50, ppl 796093229.98, throughput 2.13 samples/s, lr 29.14
[Epoch 0 Batch 2/372] current loss 9.57, ppl 14283.55, throughput 4.20 samples/s, lr 29.57
[Epoch 0 Batch 3/372] current loss 17.85, ppl 56261658.19, throughput 4.32 samples/s, lr 28.71
[Epoch 0 Batch 4/372] current loss 9.50, ppl 13370.27, throughput 4.08 samples/s, lr 30.43
[Epoch 0 Batch 5/372] current loss 14.46, ppl 1903888.17, throughput 4.26 samples/s, lr 29.14
The training script can be found at https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/word_language_model.py. To reproduce the log message, I ran the script with the following command:
python word_language_model.py --log-interval=1
The problem occurred when computing gradients (https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/word_language_model.py#L381)
Some gradient values are of order 10^34. Normally the gradient value should be within [-10, 10].
Thanks to @leezu , he found the error was introduced because of the MKLDNN option. If we use mxnet built from source with MKLDNN turned on, i.e., -DUSE_MKLDNN=ON, the gradient error appears whereas the problem is gone when MKLDNN=OFF. Therefore the issue can be traced back to the MKLDNN. @zixuanweeei @ciyongch @pengzhao-intel
My environment specs can be found here
----------Python Info----------
Version : 3.6.6
Compiler : GCC 7.2.0
Build : ('default', 'Jun 28 2018 17:14:51')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 19.2.3
Directory : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version : 1.6.0
Directory : /home/ubuntu/clean_mxnet/incubator-mxnet/python/mxnet
Num GPUs : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.15.0-1056-aws-x86_64-with-debian-buster-sid
system : Linux
node : ip-172-31-23-26
release : 4.15.0-1056-aws
version : #58-Ubuntu SMP Tue Nov 26 15:14:34 UTC 2019
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2702.241
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.12
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Thanks for reporting this issue. We will take a look.
@liuzh91 I can reproduce the result. The gradients of bias in LSTM are very large as you mentioned, and the gradient of weights in LSTM looks normal. Currently, we can export MXNET_USE_MKLDNN_RNN=0 for a workaround. It would be highly appreciated if you could give a try upon 52c9a45abc8d1f8228ec9ed1bad7f137137fd96b.
@zixuanweeei I previously confirmed the issue still happens with 897f4fae075c5261958de1a07cded0b5b3058a7a
@zixuanweeei I previously confirmed the issue still happens with 897f4fa
Thanks for the reply. Can MXNET_USE_MKLDNN_RNN=0 work with 897f4fa and 52c9a45?
Yes, MXNET_USE_MKLDNN_RNN=0 on 52c9a45 is a workaround for the issue.
52c9a45 is applied to max pooling. I think it has nothing to do with RNN?
52c9a45 is applied to max pooling. I think it has nothing to do with RNN?
Nah. It didn't fix the gradient explosion issue.
Hi, @liuzh91 @szhengac. We have posted https://github.com/apache/incubator-mxnet/pull/17183 to fix the gradient explosion issue in RNN Backward. Thanks for reporting this issue again. And it would be greatly appreciated if you could give a test on this patch. Thanks.
BTW, we got the below training log:
โฏ python word_language_model.py --log-interval=1
/path/to/mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
)
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 1150, TNC)
)
(decoder): HybridSequential(
(0): Dense(None -> 33278, linear)
)
)
[Epoch 0 Batch 1/372] current loss 20.50, ppl 796977445.38, throughput 18.37 samples/s, lr 30.86
[Epoch 0 Batch 2/372] current loss 9.51, ppl 13511.50, throughput 39.56 samples/s, lr 28.29
[Epoch 0 Batch 3/372] current loss 17.53, ppl 41003388.51, throughput 40.65 samples/s, lr 27.43
[Epoch 0 Batch 4/372] current loss 9.45, ppl 12761.47, throughput 40.39 samples/s, lr 27.43
[Epoch 0 Batch 5/372] current loss 14.34, ppl 1695623.66, throughput 35.59 samples/s, lr 31.71
[Epoch 0 Batch 6/372] current loss 9.40, ppl 12113.46, throughput 35.10 samples/s, lr 32.14
[Epoch 0 Batch 7/372] current loss 8.56, ppl 5232.00, throughput 37.62 samples/s, lr 30.00
[Epoch 0 Batch 8/372] current loss 9.32, ppl 11163.67, throughput 42.00 samples/s, lr 26.57
[Epoch 0 Batch 9/372] current loss 8.44, ppl 4642.37, throughput 61.95 samples/s, lr 17.14
[Epoch 0 Batch 10/372] current loss 8.92, ppl 7494.76, throughput 41.39 samples/s, lr 27.00
Hi, @liuzh91 @szhengac. We have posted #17183 to fix the gradient explosion issue in RNN Backward. Thanks for reporting this issue again. And it would be greatly appreciated if you could give a test on this patch. Thanks.
BTW, we got the below training log:
โฏ python word_language_model.py --log-interval=1 /path/to/mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB Optimizer.opt_registry[name].__name__)) Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5) Use AWDRNN AWDRNN( (embedding): HybridSequential( (0): Embedding(33278 -> 400, float32) (1): Dropout(p = 0.65, axes=(0,)) ) (encoder): HybridSequential( (0): LSTM(400 -> 1150, TNC) (1): LSTM(1150 -> 1150, TNC) (2): LSTM(1150 -> 1150, TNC) ) (decoder): HybridSequential( (0): Dense(None -> 33278, linear) ) ) [Epoch 0 Batch 1/372] current loss 20.50, ppl 796977445.38, throughput 18.37 samples/s, lr 30.86 [Epoch 0 Batch 2/372] current loss 9.51, ppl 13511.50, throughput 39.56 samples/s, lr 28.29 [Epoch 0 Batch 3/372] current loss 17.53, ppl 41003388.51, throughput 40.65 samples/s, lr 27.43 [Epoch 0 Batch 4/372] current loss 9.45, ppl 12761.47, throughput 40.39 samples/s, lr 27.43 [Epoch 0 Batch 5/372] current loss 14.34, ppl 1695623.66, throughput 35.59 samples/s, lr 31.71 [Epoch 0 Batch 6/372] current loss 9.40, ppl 12113.46, throughput 35.10 samples/s, lr 32.14 [Epoch 0 Batch 7/372] current loss 8.56, ppl 5232.00, throughput 37.62 samples/s, lr 30.00 [Epoch 0 Batch 8/372] current loss 9.32, ppl 11163.67, throughput 42.00 samples/s, lr 26.57 [Epoch 0 Batch 9/372] current loss 8.44, ppl 4642.37, throughput 61.95 samples/s, lr 17.14 [Epoch 0 Batch 10/372] current loss 8.92, ppl 7494.76, throughput 41.39 samples/s, lr 27.00
Thank you for the commit. We will double check whether the bug still exists.
@liuzh91 Did you have any chance to try the patch? Happy new year~
@liuzh91 Did you have any chance to try the patch? Happy new year~
I haven't. I'll take a try this week.
@zixuanweeei @TaoLv I can confirm the new patch works correctly on the language model script. Thanks for the patch.
@zixuanweeei @TaoLv I can confirm the new patch works correctly on the language model script. Thanks for the patch.
Thanks for your kindness. Very happy to receive the good news. Happy new year~
@liuzh91 Nice to hear that and thank you for trying it out. Would you mind approving #17183 if it looks good to you? Also please propose if you feel the issue is critical and hope the fix to be included into 1.6 release. Thanks!
@TaoLv I checkd and this issue affects 1.6. As it was recently decided to distribute the MKL builds by default this fix must be backported.
CC @ptrendx
Most helpful comment
@TaoLv I checkd and this issue affects 1.6. As it was recently decided to distribute the MKL builds by default this fix must be backported.
CC @ptrendx