Transformers: access to the vocabulary

Created on 25 Nov 2019 · 2Comments · Source: huggingface/transformers

❓ Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

Source

weiguowilliam

Most helpful comment

You can obtain the 50.257 different tokens with the following code:

import transformers
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
vocab = list(tokenizer.encoder.keys())
assert(len(vocab) == tokenizer.vocab_size) # it returns True!

Close the issue if you've resolved your problem! ;)

Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

TheEdoardo93 on 25 Nov 2019

👍3

All 2 comments

You can obtain the 50.257 different tokens with the following code:

import transformers
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
vocab = list(tokenizer.encoder.keys())
assert(len(vocab) == tokenizer.vocab_size) # it returns True!

Close the issue if you've resolved your problem! ;)

Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

TheEdoardo93 on 25 Nov 2019

👍3

thank you！

weiguowilliam on 25 Nov 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings