Bert: run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values

Created on 8 Nov 2018 · 21Comments · Source: google-research/bert

hi, I get an InvalidArgumentError when running run_pretraining.py, it shows:

using my own data, I set paraments as follows:
train batch size: 32
max seq length: 64 (99% article less equal 46 word)
max predictions per seq: 10
learning rate: 2e-5

at the begging, I google it, someone said, use smaller learning rate, I find it just delay the coming of InvalidArgumentError, I thought learning rate is not the key reason. alse, I try tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y) as it says, saddly, I still get the same error.

tracing the error, (grads, _) = tf.clip_by_gblobal_norm(grads, clip_norm=1.0) -> clip_ops.py line 259 , it shows global_norm calculation error.

what do you think the error happens ? didn't you meet yourself ?

Source

xwzhong

Most helpful comment

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

jacobdevlin-google on 11 Nov 2018

👍12

All 21 comments

set lower batch_size, will run ok

zkailinzhang on 9 Nov 2018

@zkl99999 did you know what is the reason make the error happen ?

xwzhong on 9 Nov 2018

I am seeing the same error.
I tried bumping up the batch size to 8192, however it just delays the error.
lowering the batch size makes it happen faster.
any idea what's happening?

mleonrivas on 11 Nov 2018

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

jacobdevlin-google on 11 Nov 2018

👍12

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

xwzhong on 12 Nov 2018

Cool, I will make sure to add this in bold font in the pre-training section of the README.

jacobdevlin-google on 12 Nov 2018

Hi! I get exactly the same error after global_step=110000 (and therefore, I guess, problem with misconfiguration is very unlikely).

I did shrink my vocabulary to 16k tokens. However, I did fix bert_config.json appropriately and still get the error.

```Caused by op u'VerifyFinite/CheckNumerics', defined at:
File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 547, in
tf.app.run()
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 509, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
features, labels, mode, config)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
model_fn_results = self._model_fn(features=features, *kwargs)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
estimator_spec = self._model_fn(features=features, *kwargs)
File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 192, in model_fn
loss_scale)
File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/optimization.py", line 86, in create_optimizer
(grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/clip_ops.py", line 259, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/numerics.py", line 45, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(args, *kwargs)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]```

ohwe on 12 Jan 2019

I have the same error.
I use my own vocab with size 51722, and revise it in config.
When is use mixed float from nvidia pr, this error will happened. When is not use mixed float, this error will not happened!

eunicechen1987 on 16 Jan 2019

👍2

it's odd, in my experiment, after fixed the config, the error don't happen again(train more than 600w step)

xwzhong on 16 Jan 2019

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

minmummax on 18 Jan 2019

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

i change the "vocab_size" in bert_config.json.

xwzhong on 18 Jan 2019

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

minmummax on 18 Jan 2019

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

xwzhong on 18 Jan 2019

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

all right thanks !

minmummax on 18 Jan 2019

still not understand! you change what parameter ?

{
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}

yunchaosuper on 20 Jan 2019

@yunchaosuper I change "vocab_size"

xwzhong on 20 Jan 2019

@xwzhong so you change the vocab_size from 21128 to what? kindly help on that

yunchaosuper on 20 Jan 2019

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

Hi Jacob, I am using pretrained BERT together with other networks, but during finetuning, I also met this problem of NaN global norm. I wonder what do you mean by out-of-bounds lookups? The dataset I use does have OOV words, but what causes NaN global norm? Only when all the tokens in the sentence are unknown words?

Thanks in advance.

PeterPanUnderhill on 23 Aug 2019

@ohwe were you able to solve the problem? after 110000 steps, NaN error happend.

brightmart on 4 Sep 2019

I am pre-training Bert with large amount of data, after 110000 steps, loss is around 1.4
But after stop the trained and try to resume again( by set init point and restore from checkpoint of 110000), NaN error happend. I run serveral times, in other times, Nan came from other layer, not layer_0.

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Gradient for bert/encoder/layer_0/output/dense/bias:0 is NaN : Tensor had NaN values
[[{{node CheckNumerics_18}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_pretraining.py", line 497, in
tf.app.run()

Can some one help, or have any idea?

@xwzhong @zkl99999 @mleonrivas @jacobdevlin-google @ohwe

brightmart on 4 Sep 2019

I faced the same error (Tensor had NaN values) in the begging of learning.
It was caused by invalid learning rate.
I wrongly set it to 5e5 instead of 5e-5.

It might be related to learning rate.