Bert: Plan to release 'run_scorer.py' ?

Created on 7 Nov 2018  路  18Comments  路  Source: google-research/bert

Your script run_classifier.py run perfectly well.
However, some dataset (for example STS-B) use a score as output (relatedness of sentences, with a score from 1 to 5), and not classes.

Are you going to release run_scorer.py script for this kind of datasets ?
If no, how can I change run_classifier.py to reproduce results of the STS-B dataset with fine-tuning ?

Most helpful comment

We don't plan to release more code, but it's a pretty trivial change. First change the label type to a float instead of int, then pass in the label. Then we just trained mean squad error vs the labels, i.e:

per_example_loss = tf.square(logits - label_scores)

All 18 comments

We don't plan to release more code, but it's a pretty trivial change. First change the label type to a float instead of int, then pass in the label. Then we just trained mean squad error vs the labels, i.e:

per_example_loss = tf.square(logits - label_scores)

Should I also modify the shape ?
I changed :

   "label_ids":
         tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),

to

   "label_ids":
         tf.constant(all_label_ids, shape=[num_examples, 0], dtype=tf.float32),

(among other changes).

However if I do this I got a NaN loss :

tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

_Note : I had to change the shape, because without this change I had a shape error when computing the loss :_
per_example_loss = tf.square(logits - labels)


Also, in your code snippet you are using label_scores, but it is directly the variable labels, right ? Since it's already float I don't need to do any treatment. (For classifier, the treatment was : one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32))

Finally I should not change the shape ^^

After a day I found the problem (really dumb..) : I was setting num_labels to 0 (because there is no label anymore). Doing this messed up the tensors, and give NaN after the loss computation.


However, my results are not good _at all_ on the STS-B dataset : 9.2% accuracy... (with a loss of 2.657)
Any idea from where it can come from @jacobdevlin-google ?

Accuracy is not a sensible measure for this set. The easiest way is to write out the prediction scores (use the existing mew prediction code in run_classifier.py as a guide) and then compute the correlation metric outside of TensorFlow as described by the GLUE website.

My results are still not understandable :

INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309

Here is how I changed the prediction score :

  def metric_fn(per_example_loss, label_ids, logits):        
      pearson = tf.contrib.metrics.streaming_pearson_correlation(logits, label_ids)
      mse = tf.metrics.mean(per_example_loss)       
      return {'pearson': pearson, 'MSE': mse}

EDIT

The easiest way is to write out the prediction scores

I managed to do this with the tf.contrib.metrics.streaming_concat metric, and here is my result :

INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [3.0696278 3.066192 3.0656776 ... 3.0800402 3.0793345 3.0729654]
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309

Looks like the model is always predicting ~3 score... Where did I messed up ? x)


EDIT 2

I was finally using the checkpoints of previous (non-working) model. Fine-tuning from scratch (lol, more from pre-training) with the changes made it work perfectly :

INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 0.54966164
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [4.9498835 4.8293953 4.8901978 ... 2.251821 0.45117265 1.3045875 ]
INFO:tensorflow: loss = 0.5502653
INFO:tensorflow: pearson = 0.8859678

@jacobdevlin-google But my results seems higher than the ones reported in the paper (86.5 for BERT-Large) and in the leaderboard of GLUE (87.6 / 86.5). I am using, mere mortal that I am, BERT-Base.

Is this higher score significant ? Does it means I made a mistake somewhere ?

maybe you should run multi time. https://github.com/google-research/bert/issues/113#issuecomment-438540380

That's dev, which might be higher than test. But everything looks good so I'll close.

@Colanim Could you please share the code for STS-B? Thank you!

Please refer to #160, I uploaded the code there

@Colanim Thank you very much!

I am trying to make this switch from classification to regression with BERT, but I basically get the same output no matter what (e.g. I'm trying to predict scores on a range from 1-10, and everything is given 5.5). Does anybody know why this may be happening?

My results are still not understandable :

INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309

Here is how I changed the prediction score :

  def metric_fn(per_example_loss, label_ids, logits):        
      pearson = tf.contrib.metrics.streaming_pearson_correlation(logits, label_ids)
      mse = tf.metrics.mean(per_example_loss)       
      return {'pearson': pearson, 'MSE': mse}

EDIT

The easiest way is to write out the prediction scores

I managed to do this with the tf.contrib.metrics.streaming_concat metric, and here is my result :

INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [3.0696278 3.066192 3.0656776 ... 3.0800402 3.0793345 3.0729654]
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309

Looks like the model is always predicting ~3 score... Where did I messed up ? x)

EDIT 2

I was finally using the checkpoints of previous (non-working) model. Fine-tuning from scratch (lol, more from pre-training) with the changes made it work perfectly :

INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 0.54966164
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [4.9498835 4.8293953 4.8901978 ... 2.251821 0.45117265 1.3045875 ]
INFO:tensorflow: loss = 0.5502653
INFO:tensorflow: pearson = 0.8859678

@jacobdevlin-google But my results seems higher than the ones reported in the paper (86.5 for BERT-Large) and in the leaderboard of GLUE (87.6 / 86.5). I am using, mere mortal that I am, BERT-Base.

Is this higher score significant ? Does it means I made a mistake somewhere ?

From EDIT to EDIT2 ,what you changed? I have same trouble as you described in EDIT, all my predications are around 2.7. How to fine-tuning it , the size of STS-B is just 5000. ,I try 60 epochs last night,but it didn't work.

@fudanchenjiahao
From Edit1 to Edit2 I just removed the previous checkpoint : it was loading this non-working checkpoint whenever I train. Be sure you train from scratch.
I uploaded the working code as mentioned previously.

I create a pr. It will output a single float value. In my use case, I use it to calculate the similarity of two sentences.

@Colanim can you point me to your run script for reproducing 88.5 pearson on STS-B dataset?

@smr97 I uploaded it here.

However it's really old code, I didn't update it and didn't use it for so long...
You might need to modify it to make it work (?).

Thanks @Colanim for the quick response! This is really helpful. Could you also provide the hyperparams? Learning rate, batch size, number of epochs and sequence length?

@smr97 Sorry, I don't remember at all...

Try your luck with the default param or the one in the README

Was this page helpful?
0 / 5 - 0 ratings