Transformers: How to find the corresponding download models from Amazon?

Created on 13 Dec 2019 · 5Comments · Source: huggingface/transformers

❓ Questions & Help

As we know, the TRANSFORMER could easy auto-download models by the pretrain( ) function.
And the pre-trained BERT/RoBerta model are stored at the path of
./cach/.pytorch/.transformer/....

But, all the name of the download models are like this:

d9fc1956a01fe24af529f239031a439661e7634e6e931eaad2393db3ae1eff03.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.json

It's not readable and hard to distinguish which model is I wanted.

In another word, if I want to find the pretrained model of 'uncased_L-12_H-768_A-12', I can't finde which one is ?

Thanks for your answering.

wontfix

Source

PantherYan

👍4 👀2

Most helpful comment

You also have to save the tokenizer into the same directory:

tokenizer.save_pretrained("./roberta-large-355M")

Let me know if this solves your issue.

julien-c on 28 Dec 2019

❤6 👍5

All 5 comments

Hi, they are named as such because that's a clean way to make sure the model on the S3 is the same as the model in the cache. The name is created from the etag of the file hosted on the S3.

If you want to save it with a given name, you can save it as such:

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")
model.save_pretrained("cased_L-12_H-768_A-12")

LysandreJik on 14 Dec 2019

👍9 🎉1

@LysandreJik, following up the question above, and your answer, I ran this command first:

from transformers import RobertaModel
model = RobertaModel.from_pretrained("roberta-large")
model.save_pretrained("./roberta-large-355M")

I guess, we expect config.json, vocab, and all the other necessary files to be saved in roberta-large-355M directory.

Then I ran:

python ./examples/run_glue.py   --model_type roberta   --model_name_or_path ./roberta-large-355M --task_name MRPC --do_train  --do_eval --do_lower_case --data_dir $GLUE_DIR/$TASK_NAME --max_seq_length 128 --per_gpu_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 2.0 --output_dir ./results/mrpc/

and I am getting:

OSError: Model name './roberta-large-355M' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed './roberta-large-355M' was a path or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url

I checked the roberta-large-355M and there are only: config.json pytorch_model.bin, but files named ['vocab.json', 'merges.txt'] are missing.

same issue with the XLNET:

../workspace/transformers/xlnet_base# ls
config.json  pytorch_model.bin

What am I missing here? Why are all the files not downloaded properly?

Thanks.

rnyak on 26 Dec 2019

👍2

You also have to save the tokenizer into the same directory:

tokenizer.save_pretrained("./roberta-large-355M")

Let me know if this solves your issue.

julien-c on 28 Dec 2019

❤6 👍5

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 26 Feb 2020

OSError: Model name 'roberta-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'roberta-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I got the error above even after saving the tokenizer, config, and model in the same directory