Your script run_classifier.py run perfectly well.
However, some dataset (for example STS-B) use a score as output (relatedness of sentences, with a score from 1 to 5), and not classes.
Are you going to release run_scorer.py script for this kind of datasets ?
If no, how can I change run_classifier.py to reproduce results of the STS-B dataset with fine-tuning ?
We don't plan to release more code, but it's a pretty trivial change. First change the label type to a float instead of int, then pass in the label. Then we just trained mean squad error vs the labels, i.e:
per_example_loss = tf.square(logits - label_scores)
Should I also modify the shape ?
I changed :
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
to
"label_ids":
tf.constant(all_label_ids, shape=[num_examples, 0], dtype=tf.float32),
(among other changes).
However if I do this I got a NaN loss :
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
_Note : I had to change the shape, because without this change I had a shape error when computing the loss :_
per_example_loss = tf.square(logits - labels)
Also, in your code snippet you are using label_scores, but it is directly the variable labels, right ? Since it's already float I don't need to do any treatment. (For classifier, the treatment was : one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32))
Finally I should not change the shape ^^
After a day I found the problem (really dumb..) : I was setting num_labels to 0 (because there is no label anymore). Doing this messed up the tensors, and give NaN after the loss computation.
However, my results are not good _at all_ on the STS-B dataset : 9.2% accuracy... (with a loss of 2.657)
Any idea from where it can come from @jacobdevlin-google ?
Accuracy is not a sensible measure for this set. The easiest way is to write out the prediction scores (use the existing mew prediction code in run_classifier.py as a guide) and then compute the correlation metric outside of TensorFlow as described by the GLUE website.
My results are still not understandable :
INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309
Here is how I changed the prediction score :
def metric_fn(per_example_loss, label_ids, logits):
pearson = tf.contrib.metrics.streaming_pearson_correlation(logits, label_ids)
mse = tf.metrics.mean(per_example_loss)
return {'pearson': pearson, 'MSE': mse}
EDIT
The easiest way is to write out the prediction scores
I managed to do this with the tf.contrib.metrics.streaming_concat metric, and here is my result :
INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [3.0696278 3.066192 3.0656776 ... 3.0800402 3.0793345 3.0729654]
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309
Looks like the model is always predicting ~3 score... Where did I messed up ? x)
EDIT 2
I was finally using the checkpoints of previous (non-working) model. Fine-tuning from scratch (lol, more from pre-training) with the changes made it work perfectly :
INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 0.54966164
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [4.9498835 4.8293953 4.8901978 ... 2.251821 0.45117265 1.3045875 ]
INFO:tensorflow: loss = 0.5502653
INFO:tensorflow: pearson = 0.8859678
@jacobdevlin-google But my results seems higher than the ones reported in the paper (86.5 for BERT-Large) and in the leaderboard of GLUE (87.6 / 86.5). I am using, mere mortal that I am, BERT-Base.
Is this higher score significant ? Does it means I made a mistake somewhere ?
maybe you should run multi time. https://github.com/google-research/bert/issues/113#issuecomment-438540380
That's dev, which might be higher than test. But everything looks good so I'll close.
@Colanim Could you please share the code for STS-B? Thank you!
Please refer to #160, I uploaded the code there
@Colanim Thank you very much!
I am trying to make this switch from classification to regression with BERT, but I basically get the same output no matter what (e.g. I'm trying to predict scores on a range from 1-10, and everything is given 5.5). Does anybody know why this may be happening?
My results are still not understandable :
INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309Here is how I changed the prediction score :
def metric_fn(per_example_loss, label_ids, logits): pearson = tf.contrib.metrics.streaming_pearson_correlation(logits, label_ids) mse = tf.metrics.mean(per_example_loss) return {'pearson': pearson, 'MSE': mse}EDIT
The easiest way is to write out the prediction scores
I managed to do this with the
tf.contrib.metrics.streaming_concatmetric, and here is my result :INFO:tensorflow: MSE = 2.757229
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [3.0696278 3.066192 3.0656776 ... 3.0800402 3.0793345 3.0729654]
INFO:tensorflow: loss = 2.7695546
INFO:tensorflow: pearson = -0.03796309Looks like the model is always predicting ~3 score... Where did I messed up ? x)
EDIT 2
I was finally using the checkpoints of previous (non-working) model. Fine-tuning from scratch (lol, more from pre-training) with the changes made it work perfectly :
INFO:tensorflow: Eval results
INFO:tensorflow: MSE = 0.54966164
INFO:tensorflow: global_step = 538
INFO:tensorflow: label_ids = [5. 4.75 5. ... 2. 0. 0. ]
INFO:tensorflow: logits = [4.9498835 4.8293953 4.8901978 ... 2.251821 0.45117265 1.3045875 ]
INFO:tensorflow: loss = 0.5502653
INFO:tensorflow: pearson = 0.8859678@jacobdevlin-google But my results seems higher than the ones reported in the paper (86.5 for BERT-Large) and in the leaderboard of GLUE (87.6 / 86.5). I am using, mere mortal that I am, BERT-Base.
Is this higher score significant ? Does it means I made a mistake somewhere ?
From EDIT to EDIT2 ,what you changed? I have same trouble as you described in EDIT, all my predications are around 2.7. How to fine-tuning it , the size of STS-B is just 5000. ,I try 60 epochs last night,but it didn't work.
@fudanchenjiahao
From Edit1 to Edit2 I just removed the previous checkpoint : it was loading this non-working checkpoint whenever I train. Be sure you train from scratch.
I uploaded the working code as mentioned previously.
I create a pr. It will output a single float value. In my use case, I use it to calculate the similarity of two sentences.
@Colanim can you point me to your run script for reproducing 88.5 pearson on STS-B dataset?
@smr97 I uploaded it here.
However it's really old code, I didn't update it and didn't use it for so long...
You might need to modify it to make it work (?).
Thanks @Colanim for the quick response! This is really helpful. Could you also provide the hyperparams? Learning rate, batch size, number of epochs and sequence length?
@smr97 Sorry, I don't remember at all...
Try your luck with the default param or the one in the README
Most helpful comment
We don't plan to release more code, but it's a pretty trivial change. First change the label type to a float instead of int, then pass in the label. Then we just trained mean squad error vs the labels, i.e:
per_example_loss = tf.square(logits - label_scores)