Transformers: Wrong paraphrase in the TF2/PyTorch README example.

Created on 25 Nov 2019 · 6Comments · Source: huggingface/transformers

🐛 Bug

Model I am using (Bert, XLNet....): TFBertForSequenceClassification

Language I am using the model on (English, Chinese....): English

The problem arise when using:

[x] the official example scripts: https://github.com/huggingface/transformers#quick-tour-tf-20-training-and-pytorch-interoperability
[ ] my own modified scripts: (give details)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: Sequence Classification
[ ] my own task or dataset: (give details)

To Reproduce

Steps to reproduce the behavior:

Run the attached script.
Observe

$ /Users/igor/projects/ml-venv/bin/python /Users/igor/projects/transformers-experiments/paraphrasing_issue.py
2019-11-25 08:58:53.985213: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fed57a2be00 executing computations on platform Host. Devices:
2019-11-25 08:58:53.985243: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/Users/igor/tensorflow_datasets/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from /Users/igor/tensorflow_datasets/glue/mrpc/0.0.2
Train for 115 steps, validate for 7 steps
Epoch 1/2
  4/115 [>.............................] - ETA: 1:22:04 - loss: 0.6936  5/115 [>.............................] - ETA: 1:18:44 - loss: 0.6876  6/115 [>.............................] - ETA: 1:16:01 - loss: 0.6760115/115 [==============================] - 4587s 40s/step - loss: 0.5850 - accuracy: 0.7045 - val_loss: 0.4695 - val_accuracy: 0.8137
Epoch 2/2
115/115 [==============================] - 4927s 43s/step - loss: 0.3713 - accuracy: 0.8435 - val_loss: 0.3825 - val_accuracy: 0.8358
**sentence_1 is a paraphrase of sentence_0
sentence_2 is a paraphrase of sentence_0**

Wonder why.

import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')

# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule 
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

# Load the TensorFlow model in PyTorch for inspection
model.save_pretrained('./save/')
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)

# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")

Expected behavior

sentence_1 is a paraphrase of sentence_0
sentence_2 is not a paraphrase of sentence_0

Environment

OS: MacOS
Python version: 3.7.5
PyTorch version: 1.3.1
PyTorch Transformers version (or branch): last commit afaa33585109550f9ecaaee4e47f187aaaefedd0 as of Sat Nov 23 11:34:45 2019 -0500.
Using GPU ? nope
Distributed of parallel setup ? single machine
Any other relevant information: TF version is 2.0.0

Source

isaprykin

👀1 👍1

All 6 comments

Hi, I'm investigating. For now, I confirm the issue that you observe. I've tested on both CPU and GPU and it gives the same result. I've tested with Pytorch and TF models too, same result. Now, let's track the cause!

mandubian on 2 Dec 2019

Hi again,
Ok I've retrained a Pytorch model using run_glue.py on MRPC to check.
The final metrics are:

***** Eval results  *****
acc = 0.8382608695652174
acc_and_f1 = 0.8608840882272851
f1 = 0.8835073068893529

So it's not crazy high but not near random either.

Then I've retested:

Is "This research was consistent with his findings" same as:

"His findings were compatible with this research." ?
TRUE -> 😄

"His findings were not compatible with this research." ?
TRUE -> 😢

I've taken a more complex sentence from training set

Is 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.' same as:

"Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence." ?
TRUE -> 😄

"Referring to him as only "the witness", Amrozi accused his brother of not deliberately distorting his evidence." ?
TRUE -> 😢

"platypus to him as only "the platypus", platypus accused his platypus of deliberately platypus his evidence." ?
TRUE -> 😭 

"platypus to him as only "the platypus", platypus accused his platypus of deliberately platypus his platypus." ?
FALSE -> 🌝

Here we see that it's not robust to not as in the primary case. Then it's also not robust to replacing any word with platypus until I replace 6 words (which is quite disappointing on the performance of the model, it's true).

I've taken sentences from test set:

Is "A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night." same as:

"A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night." ?
TRUE -> 😢
----------------------------------------------------------------------------------------
Is "The broader Standard & Poor's 500 Index <.SPX> was 0.46 points lower, or 0.05 percent, at 997.02." same as:

"The technology-laced Nasdaq Composite Index .IXIC was up 7.42 points, or 0.45 percent, at 1,653.44." ?
FALSE -> 😄
--------------------------------------------------------------------------------------------
Is "NASA plans to follow-up the rovers' missions with additional orbiters and landers before launching a long-awaited sample-return flight." same as:

"NASA plans to explore the Red Planet with ever more sophisticated robotic orbiters and landers."
FALSE -> 😄
----------------------------------------------------------------------------------------
Is "We are piloting it there to see whether we roll it out to other products." same as:

"Macromedia is piloting this product activation system in Contribute to test whether to roll it out to other products."
TRUE -> 😄

Here we see that sometimes it works, sometimes not. I might be wrong but I haven't seen anything in the code that could explain this issue (83% is the final accuracy on dev set... ok but it remains 1 error on 5 cases). A priori, I'd say that basic BERT trained like that on this tiny dataset is simply not that robust for that task in a generalized case and would need more data or at least more data augmentation.

Do you share my conclusion or see something different?

mandubian on 3 Dec 2019

👀1 😄1

Thanks for the investigation. Was the performance ever different at the time when that example was put into the README?

isaprykin on 4 Dec 2019

TBH, personally I wasn't there, so I don't know...
If anyone at huggingface can answer this question?
I've been looking at MRPC leaderboard https://gluebenchmark.com/leaderboard/ and BERT is around my training above so it looks like a normal score.

mandubian on 4 Dec 2019

👍1

MRPC is a very small dataset (the smallest among all GLUE benchmark and that's why we use it as an example). I should not be expected to generalize well and be usable in real-life settings.
The perfrormance you got @mandubian are a normal score indeed.