Transformers: access to the vocabulary

Created on 25 Nov 2019  ·  2Comments  ·  Source: huggingface/transformers

❓ Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

Most helpful comment

You can obtain the 50.257 different tokens with the following code:

import transformers
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
vocab = list(tokenizer.encoder.keys())
assert(len(vocab) == tokenizer.vocab_size) # it returns True!

Close the issue if you've resolved your problem! ;)

Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

All 2 comments

You can obtain the 50.257 different tokens with the following code:

import transformers
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
vocab = list(tokenizer.encoder.keys())
assert(len(vocab) == tokenizer.vocab_size) # it returns True!

Close the issue if you've resolved your problem! ;)

Questions & Help

Is there any way we can get access to the vocabulary in GPT2? Like a list: [subtoken1, subtoken2, ...subtoken 10000...]

Thank you in advance!

thank you!

Was this page helpful?
0 / 5 - 0 ratings