Transformers: Bug in MiniLM-L12-H384-uncased modelhub model files

Created on 15 Jul 2020 · 7Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): microsoft/MiniLM-L12-H384-uncased

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[x] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: SQuAD ver.2
[ ] my own task or dataset: (give details below)

Problem: The vocab for microsoft/MiniLM-L12-H384-uncased is missing a token => wrong tokenization => bad performance for SQuAD finetuning
Potential fix: Upload the original vocab that was published in the original Microsoft Repository (https://github.com/microsoft/unilm/tree/master/minilm)

To reproduce

Steps to reproduce the behavior:

While tokenizing the a sample english sentence with miniLM model downloaded from modelhub
Comparing for modelhub tokenizer vocab size vs. modelhub model vocab size

from transformers import AutoTokenizer, AutoModel
tokenizer_modelhub = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
model_mod
elhub = AutoModel.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
assert tokenizer_modelhub.vocab_size == model_modelhub.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)

Output

AssertionError: tokenizer vocab_size 30521 doesn't match embedding vocab size 30522

Download "original" MiniLM model from Microsoft's MiniLM GitHub Repo (https://1drv.ms/u/s!AjHn0yEmKG8qixAYyu2Fvq5ulnU7?e=DFApTA)
Comparing the modelhub MiniLM tokenizer and "original" MiniLM tokenizer token ids

import torch
input_ids_modelhub = torch.tensor([tokenizer_modelhub.encode("Let's see all hidden-states and attentions on this text")])
config_github = AutoConfig.from_pretrained("<github_minilm_model_directory>")
tokenizer_github = AutoTokenizer.from_pretrained(
...            "<github_minilm_model_directory>", config=config_github)
model_github_finetuned = AutoModelForQuestionAnswering.from_pretrained(
...            "<github_minilm_model_directory>", config=config_github)
assert tokenizer_github.vocab_size == model_github.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)
input_ids_github = torch.tensor([tokenizer_github.encode("Let's see all hidden-states and attentions on this text")])

```
print(input_ids_github)
tensor([[ 101, 2292, 1005, 1055, 2156, 2035, 5023, 1011, 2163, 1998, 3086, 2015,
2006, 2023, 3793, 102]])

print(input_ids_modelhub)
tensor([[ 100, 2291, 1004, 1054, 2155, 2034, 5022, 1010, 2162, 1997, 3085, 2014,
2005, 2022, 3792, 101]])

5. Fine-tune modelhub MiniLM model for SQuAD ver 2

python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384
--output_dir finetuned_modelhub_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache

Results:

{'exact': 59.681630590415224, 'f1': 63.78250778488946, 'total': 11873, 'HasAns_exact': 49.73009446693657, 'HasAns_f1': 57.94360913123985, 'HasAns_total': 5928, 'NoAns_exact': 69.60470984020185, 'NoAns_f1': 69.60470984020185, 'NoAns_total': 5945, 'best_exact': 59.690053061568264, 'best_exact_thresh': 0.0, 'best_f1': 63.79093025604285, 'best_f1_thresh': 0.0}

6. Fine-tune original MiniLM model for SQuAD ver 2

python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path
--output_dir finetuned_github_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache

Results:

{'exact': 76.23178640613156, 'f1': 79.57013365427773, 'total': 11873, 'HasAns_exact': 78.50877192982456, 'HasAns_f1': 85.1950399590485, 'HasAns_total': 5928, 'NoAns_exact': 73.96131202691338, 'NoAns_f1': 73.96131202691338, 'NoAns_total': 5945, 'best_exact': 76.23178640613156, 'best_exact_thresh': 0.0, 'best_f1': 79.57013365427775, 'best_f1_thresh': 0.0}
```

Expected behavior

Assertions should pass.
input_ids_modelhub and input_ids_github should produce same results
Reproduce the Downstream results on MiniLM modelhub files as mentioned in MiniLM model card

Environment info

transformers version: 3.0.2
Platform: Ubuntu 18.04.4 LTS
Python version: 3.6.9
PyTorch version (GPU?): 1.5.1 (Yes)
Tensorflow version (GPU?): Not using
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Source

kolk

👍2 ❤1

Most helpful comment

@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).

JetRunner on 19 Jul 2020

👍2

All 7 comments

I will be off for the next two weeks - maybe @sshleifer @sgugger @julien-c can take a look?

patrickvonplaten on 17 Jul 2020

👍1

@JetRunner Do you know who from @microsoft uploaded the MiniLM model?

julien-c on 18 Jul 2020

@patrickvonplaten did it if I remember it right. He's on vocation so I'll take a look.

JetRunner on 18 Jul 2020

Here's the diff

@patrickvonplaten when u are back to work, pls check why this happened.
I'll re-upload vocab.txt to resolve the problem for now.

JetRunner on 19 Jul 2020

@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).

JetRunner on 19 Jul 2020

👍2

Yes, the CDN caches files for up to 24 hours on each POP. However AFAIK the library doesn't load tokenizer files from the CDN anyways.

julien-c on 20 Jul 2020

The model is working now

kolk on 20 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings