Model I am using (Bert, XLNet ...): microsoft/MiniLM-L12-H384-uncased
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Problem: The vocab for microsoft/MiniLM-L12-H384-uncased is missing a token => wrong tokenization => bad performance for SQuAD finetuning
Potential fix: Upload the original vocab that was published in the original Microsoft Repository (https://github.com/microsoft/unilm/tree/master/minilm)
Steps to reproduce the behavior:
from transformers import AutoTokenizer, AutoModel
tokenizer_modelhub = AutoTokenizer.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
model_mod
elhub = AutoModel.from_pretrained("microsoft/MiniLM-L12-H384-uncased")
assert tokenizer_modelhub.vocab_size == model_modelhub.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)
Output
AssertionError: tokenizer vocab_size 30521 doesn't match embedding vocab size 30522
import torch
input_ids_modelhub = torch.tensor([tokenizer_modelhub.encode("Let's see all hidden-states and attentions on this text")])
config_github = AutoConfig.from_pretrained("<github_minilm_model_directory>")
tokenizer_github = AutoTokenizer.from_pretrained(
... "<github_minilm_model_directory>", config=config_github)
model_github_finetuned = AutoModelForQuestionAnswering.from_pretrained(
... "<github_minilm_model_directory>", config=config_github)
assert tokenizer_github.vocab_size == model_github.embeddings.word_embeddings.num_embeddings, "tokenizer vocab_size {} doesn't match embedding vocab size {} ".format(tokenizer.vocab_size, model.embeddings.word_embeddings.num_embeddings)
input_ids_github = torch.tensor([tokenizer_github.encode("Let's see all hidden-states and attentions on this text")])
```
print(input_ids_github)
tensor([[ 101, 2292, 1005, 1055, 2156, 2035, 5023, 1011, 2163, 1998, 3086, 2015,
2006, 2023, 3793, 102]])
print(input_ids_modelhub)
tensor([[ 100, 2291, 1004, 1054, 2155, 2034, 5022, 1010, 2162, 1997, 3085, 2014,
2005, 2022, 3792, 101]])
5. Fine-tune modelhub MiniLM model for SQuAD ver 2
python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384
--output_dir finetuned_modelhub_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache
Results:
{'exact': 59.681630590415224, 'f1': 63.78250778488946, 'total': 11873, 'HasAns_exact': 49.73009446693657, 'HasAns_f1': 57.94360913123985, 'HasAns_total': 5928, 'NoAns_exact': 69.60470984020185, 'NoAns_f1': 69.60470984020185, 'NoAns_total': 5945, 'best_exact': 59.690053061568264, 'best_exact_thresh': 0.0, 'best_f1': 63.79093025604285, 'best_f1_thresh': 0.0}
6. Fine-tune original MiniLM model for SQuAD ver 2
python examples/question-answering/run_squad.py --model_type bert
--model_name_or_path
--output_dir finetuned_github_minilm
--data_dir data/squad20
--train_file train-v2.0.json
--predict_file dev-v2.0.json
--learning_rate 4e-5
--num_train_epochs 4
--max_seq_length 384
--doc_stride 128
--per_gpu_train_batch_size 12
--per_gpu_eval_batch_size 12
--gradient_accumulation_steps 4
--version_2_with_negative
--do_lower_case
--verbose_logging
--do_train
--do_eval
--seed 42
--save_steps 5000
--overwrite_output_dir
--overwrite_cache
Results:
{'exact': 76.23178640613156, 'f1': 79.57013365427773, 'total': 11873, 'HasAns_exact': 78.50877192982456, 'HasAns_f1': 85.1950399590485, 'HasAns_total': 5928, 'NoAns_exact': 73.96131202691338, 'NoAns_f1': 73.96131202691338, 'NoAns_total': 5945, 'best_exact': 76.23178640613156, 'best_exact_thresh': 0.0, 'best_f1': 79.57013365427775, 'best_f1_thresh': 0.0}
```
input_ids_modelhub and input_ids_github should produce same resultstransformers version: 3.0.2I will be off for the next two weeks - maybe @sshleifer @sgugger @julien-c can take a look?
@JetRunner Do you know who from @microsoft uploaded the MiniLM model?
@patrickvonplaten did it if I remember it right. He's on vocation so I'll take a look.
Here's the diff

@patrickvonplaten when u are back to work, pls check why this happened.
I'll re-upload vocab.txt to resolve the problem for now.
@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).
Yes, the CDN caches files for up to 24 hours on each POP. However AFAIK the library doesn't load tokenizer files from the CDN anyways.
The model is working now
Most helpful comment
@julien-c I've re-uploaded it. However, CDN seems to have cached the incorrect version (https://cdn.huggingface.co/microsoft/MiniLM-L12-H384-uncased/vocab.txt).