Transformers: Bug in MiniLM-L12-H384-uncased modelhub model files

Created on 15 Jul 2020  路  7Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...): microsoft/MiniLM-L12-H384-uncased

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [x] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

The tasks I am working on is:

  • [x] an official GLUE/SQUaD task: SQuAD ver.2
  • [ ] my own task or dataset: (give details below)

Problem: The vocab for microsoft/MiniLM-L12-H384-uncased is missing a token => wrong tokenization => bad performance for SQuAD finetuning
Potential fix: Upload the original vocab that was published in the original Microsoft Repository (https://github.com/microsoft/unilm/tree/master/minilm)

To reproduce

Steps to reproduce the behavior:

  1. While tokenizing the a sample english sentence with miniLM model downloaded from modelhub
  2. Comparing for modelhub tokenizer vocab size vs. modelhub model vocab size
from transformers import AutoTokenizer, AutoModel
tokenizer_modelhub = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
model_mod
elhub = AutoModel.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
assert tokenizer_modelhub.vocab_size == model_modelhub.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)

Output

AssertionError: tokenizer vocab_size 30521 doesn't match embedding vocab size 30522
  1. Download "original" MiniLM model from Microsoft's MiniLM GitHub Repo (https://1drv.ms/u/s!AjHn0yEmKG8qixAYyu2Fvq5ulnU7?e=DFApTA)
  2. Comparing the modelhub MiniLM tokenizer and "original" MiniLM tokenizer token ids
import torch
input_ids_modelhub = torch.tensor([tokenizer_modelhub.encode("Let's see all hidden-states and attentions on this text")])
config_github = AutoConfig.from_pretrained("<github_minilm_model_directory>")
tokenizer_github = AutoTokenizer.from_pretrained(
...            "<github_minilm_model_directory>", config=config_github)
model_github_finetuned = AutoModelForQuestionAnswering.from_pretrained(
...            "<github_minilm_model_directory>", config=config_github)
assert tokenizer_github.vocab_size == model_github.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)
input_ids_github = torch.tensor([tokenizer_github.encode("Let's see all hidden-states and attentions on this text")])

```
print(input_ids_github)
tensor([[ 101, 2292, 1005, 1055, 2156, 2035, 5023, 1011, 2163, 1998, 3086, 2015,
2006, 2023, 3793, 102]])


print(input_ids_modelhub)
tensor([[ 100, 2291, 1004, 1054, 2155, 2034, 5022, 1010, 2162, 1997, 3085, 2014,
2005, 2022, 3792, 101]])

5. Fine-tune modelhub MiniLM model for SQuAD ver 2

python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384
--output_dir finetuned_modelhub_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache

Results:

{'exact': 59.681630590415224, 'f1': 63.78250778488946, 'total': 11873, 'HasAns_exact': 49.73009446693657, 'HasAns_f1': 57.94360913123985, 'HasAns_total': 5928, 'NoAns_exact': 69.60470984020185, 'NoAns_f1': 69.60470984020185, 'NoAns_total': 5945, 'best_exact': 59.690053061568264, 'best_exact_thresh': 0.0, 'best_f1': 63.79093025604285, 'best_f1_thresh': 0.0}

6. Fine-tune original MiniLM model for SQuAD ver 2

python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path
--output_dir finetuned_github_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache

Results:

{'exact': 76.23178640613156, 'f1': 79.57013365427773, 'total': 11873, 'HasAns_exact': 78.50877192982456, 'HasAns_f1': 85.1950399590485, 'HasAns_total': 5928, 'NoAns_exact': 73.96131202691338, 'NoAns_f1': 73.96131202691338, 'NoAns_total': 5945, 'best_exact': 76.23178640613156, 'best_exact_thresh': 0.0, 'best_f1': 79.57013365427775, 'best_f1_thresh': 0.0}
```

Expected behavior

  1. Assertions should pass.
  2. input_ids_modelhub and input_ids_github should produce same results
  3. Reproduce the Downstream results on MiniLM modelhub files as mentioned in MiniLM model card

Environment info

  • transformers version: 3.0.2
  • Platform: Ubuntu 18.04.4 LTS
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.1 (Yes)
  • Tensorflow version (GPU?): Not using
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Most helpful comment

@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).

All 7 comments

I will be off for the next two weeks - maybe @sshleifer @sgugger @julien-c can take a look?

@JetRunner Do you know who from @microsoft uploaded the MiniLM model?

@patrickvonplaten did it if I remember it right. He's on vocation so I'll take a look.

Here's the diff
image
@patrickvonplaten when u are back to work, pls check why this happened.
I'll re-upload vocab.txt to resolve the problem for now.

@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).

Yes, the CDN caches files for up to 24 hours on each POP. However AFAIK the library doesn't load tokenizer files from the CDN anyways.

The model is working now

Was this page helpful?
0 / 5 - 0 ratings