Incubator-mxnet: [MKLDNN] RNN Op gradient computation is broken

Created on 16 Dec 2019  ยท  16Comments  ยท  Source: apache/incubator-mxnet

Description

RNN op gradient computation on CPU is broken for the source built mxnet.

I was running a language model training script on my ec2 instance. I tested the script with the latest source built mxnet. During training, I ran into the following log:

Log Message

ubuntu@ip-172-31-23-26:~/bug_fix/gluon-nlp/scripts/language_model$ python word_language_model.py --log-interval=1
/home/ubuntu/clean_mxnet/incubator-mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): HybridSequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 1150, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 33278, linear)
  )
)
word_language_model.py:382: UserWarning: nan or inf is detected. Clipping results will be undefined.
  gluon.utils.clip_global_norm(grads, args.clip)
[Epoch 0 Batch 1/372] current loss 20.83, ppl 1107330721.37, throughput 0.71 samples/s, lr 29.14
[Epoch 0 Batch 2/372] current loss 10.41, ppl 33276.90, throughput 1.39 samples/s, lr 29.57
[Epoch 0 Batch 3/372] current loss 10.41, ppl 33276.58, throughput 1.42 samples/s, lr 28.71
[Epoch 0 Batch 4/372] current loss 10.41, ppl 33276.17, throughput 1.33 samples/s, lr 30.43
[Epoch 0 Batch 5/372] current loss 10.41, ppl 33276.86, throughput 1.40 samples/s, lr 29.14

The loss value does not change any more. If I use the mxnet build by pip installation pip install https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-15/dist/mxnet_cu100-1.6.0b20191215-py2.py3-none-manylinux1_x86_64.whl . The log is normal. There is no gradient Inf warning and the loss keeps changing:

python word_language_model.py --log-interval=1                                                         /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): HybridSequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 1150, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 33278, linear)
  )
)
[Epoch 0 Batch 1/372] current loss 20.50, ppl 796093229.98, throughput 2.13 samples/s, lr 29.14
[Epoch 0 Batch 2/372] current loss 9.57, ppl 14283.55, throughput 4.20 samples/s, lr 29.57
[Epoch 0 Batch 3/372] current loss 17.85, ppl 56261658.19, throughput 4.32 samples/s, lr 28.71
[Epoch 0 Batch 4/372] current loss 9.50, ppl 13370.27, throughput 4.08 samples/s, lr 30.43
[Epoch 0 Batch 5/372] current loss 14.46, ppl 1903888.17, throughput 4.26 samples/s, lr 29.14

To Reproduce

The training script can be found at https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/word_language_model.py. To reproduce the log message, I ran the script with the following command:
python word_language_model.py --log-interval=1

What have you tried to solve it?

The problem occurred when computing gradients (https://github.com/dmlc/gluon-nlp/blob/master/scripts/language_model/word_language_model.py#L381)

Some gradient values are of order 10^34. Normally the gradient value should be within [-10, 10].

Thanks to @leezu , he found the error was introduced because of the MKLDNN option. If we use mxnet built from source with MKLDNN turned on, i.e., -DUSE_MKLDNN=ON, the gradient error appears whereas the problem is gone when MKLDNN=OFF. Therefore the issue can be traced back to the MKLDNN. @zixuanweeei @ciyongch @pengzhao-intel

Environment

My environment specs can be found here

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/clean_mxnet/incubator-mxnet/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-1056-aws-x86_64-with-debian-buster-sid
system       : Linux
node         : ip-172-31-23-26
release      : 4.15.0-1056-aws
version      : #58-Ubuntu SMP Tue Nov 26 15:14:34 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2702.241
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.12
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Bug MKLDNN

Most helpful comment

@TaoLv I checkd and this issue affects 1.6. As it was recently decided to distribute the MKL builds by default this fix must be backported.
CC @ptrendx

All 16 comments

Thanks for reporting this issue. We will take a look.

@liuzh91 I can reproduce the result. The gradients of bias in LSTM are very large as you mentioned, and the gradient of weights in LSTM looks normal. Currently, we can export MXNET_USE_MKLDNN_RNN=0 for a workaround. It would be highly appreciated if you could give a try upon 52c9a45abc8d1f8228ec9ed1bad7f137137fd96b.

@zixuanweeei I previously confirmed the issue still happens with 897f4fae075c5261958de1a07cded0b5b3058a7a

@zixuanweeei I previously confirmed the issue still happens with 897f4fa

Thanks for the reply. Can MXNET_USE_MKLDNN_RNN=0 work with 897f4fa and 52c9a45?

Yes, MXNET_USE_MKLDNN_RNN=0 on 52c9a45 is a workaround for the issue.

52c9a45 is applied to max pooling. I think it has nothing to do with RNN?

52c9a45 is applied to max pooling. I think it has nothing to do with RNN?

Nah. It didn't fix the gradient explosion issue.

@szhengac @liuzh91 The current gradient explosion issue is not fixed yet, we're working on this. For now, the MXNET_USE_MKLDNN_RNN=0 is the workaround after the commit 897f4fa, and is also applicable to 52c9a45 .

Hi, @liuzh91 @szhengac. We have posted https://github.com/apache/incubator-mxnet/pull/17183 to fix the gradient explosion issue in RNN Backward. Thanks for reporting this issue again. And it would be greatly appreciated if you could give a test on this patch. Thanks.

BTW, we got the below training log:

โฏ python word_language_model.py --log-interval=1
/path/to/mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): HybridSequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 1150, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 33278, linear)
  )
)
[Epoch 0 Batch 1/372] current loss 20.50, ppl 796977445.38, throughput 18.37 samples/s, lr 30.86
[Epoch 0 Batch 2/372] current loss 9.51, ppl 13511.50, throughput 39.56 samples/s, lr 28.29
[Epoch 0 Batch 3/372] current loss 17.53, ppl 41003388.51, throughput 40.65 samples/s, lr 27.43
[Epoch 0 Batch 4/372] current loss 9.45, ppl 12761.47, throughput 40.39 samples/s, lr 27.43
[Epoch 0 Batch 5/372] current loss 14.34, ppl 1695623.66, throughput 35.59 samples/s, lr 31.71
[Epoch 0 Batch 6/372] current loss 9.40, ppl 12113.46, throughput 35.10 samples/s, lr 32.14
[Epoch 0 Batch 7/372] current loss 8.56, ppl 5232.00, throughput 37.62 samples/s, lr 30.00
[Epoch 0 Batch 8/372] current loss 9.32, ppl 11163.67, throughput 42.00 samples/s, lr 26.57
[Epoch 0 Batch 9/372] current loss 8.44, ppl 4642.37, throughput 61.95 samples/s, lr 17.14
[Epoch 0 Batch 10/372] current loss 8.92, ppl 7494.76, throughput 41.39 samples/s, lr 27.00

Hi, @liuzh91 @szhengac. We have posted #17183 to fix the gradient explosion issue in RNN Backward. Thanks for reporting this issue again. And it would be greatly appreciated if you could give a test on this patch. Thanks.

BTW, we got the below training log:

โฏ python word_language_model.py --log-interval=1
/path/to/mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
  Optimizer.opt_registry[name].__name__))
Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5)
Use AWDRNN
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): HybridSequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 1150, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 33278, linear)
  )
)
[Epoch 0 Batch 1/372] current loss 20.50, ppl 796977445.38, throughput 18.37 samples/s, lr 30.86
[Epoch 0 Batch 2/372] current loss 9.51, ppl 13511.50, throughput 39.56 samples/s, lr 28.29
[Epoch 0 Batch 3/372] current loss 17.53, ppl 41003388.51, throughput 40.65 samples/s, lr 27.43
[Epoch 0 Batch 4/372] current loss 9.45, ppl 12761.47, throughput 40.39 samples/s, lr 27.43
[Epoch 0 Batch 5/372] current loss 14.34, ppl 1695623.66, throughput 35.59 samples/s, lr 31.71
[Epoch 0 Batch 6/372] current loss 9.40, ppl 12113.46, throughput 35.10 samples/s, lr 32.14
[Epoch 0 Batch 7/372] current loss 8.56, ppl 5232.00, throughput 37.62 samples/s, lr 30.00
[Epoch 0 Batch 8/372] current loss 9.32, ppl 11163.67, throughput 42.00 samples/s, lr 26.57
[Epoch 0 Batch 9/372] current loss 8.44, ppl 4642.37, throughput 61.95 samples/s, lr 17.14
[Epoch 0 Batch 10/372] current loss 8.92, ppl 7494.76, throughput 41.39 samples/s, lr 27.00

Thank you for the commit. We will double check whether the bug still exists.

@liuzh91 Did you have any chance to try the patch? Happy new year~

@liuzh91 Did you have any chance to try the patch? Happy new year~

I haven't. I'll take a try this week.

@zixuanweeei @TaoLv I can confirm the new patch works correctly on the language model script. Thanks for the patch.

@zixuanweeei @TaoLv I can confirm the new patch works correctly on the language model script. Thanks for the patch.

Thanks for your kindness. Very happy to receive the good news. Happy new year~

@liuzh91 Nice to hear that and thank you for trying it out. Would you mind approving #17183 if it looks good to you? Also please propose if you feel the issue is critical and hope the fix to be included into 1.6 release. Thanks!

@TaoLv I checkd and this issue affects 1.6. As it was recently decided to distribute the MKL builds by default this fix must be backported.
CC @ptrendx

Was this page helpful?
0 / 5 - 0 ratings