Transformers: Character level models?

Created on 30 Apr 2020  路  12Comments  路  Source: huggingface/transformers

Hi, are any character-level language models available? Transformer-XL mentions in their paper that they did both word level and character level stuff, yet here it seems only the word level one is available? Is that correct?

Most helpful comment

Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.

Example:

tok = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")
tokens = tok.encode("This is a test sentence") # [108, 265, 24, 111, 4, 3, 249, 7, 76, 25, 69]
print([tok.decode(token) for token in tokens]) # ['T', 'h', 'is', 'is', 'a', 't', 'est', 's', 'ent', 'en', 'ce']

All 12 comments

I don't think we have any character-level language models available on huggingface.co/models, but they should be pretty straightforward to train.

BTW I'm sure you saw it already but PyTorch's nn.Transformer tuto is a char-level language model.

Hi, that's actually is word-level model.
Do you know of any pretrained bidirectional transformer-based character level language models that I can use?

No I'm not aware of any pretrained one.

We will soon have a pretrained ReformerLM model on character level

@patrickvonplaten just for my curiosity would this be the model trained on the "Crime and Punishment" from their Colab (but the vocab is not char-only), or do you have an own trained model 馃

Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.

Example:

tok = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")
tokens = tok.encode("This is a test sentence") # [108, 265, 24, 111, 4, 3, 249, 7, 76, 25, 69]
print([tok.decode(token) for token in tokens]) # ['T', 'h', 'is', 'is', 'a', 't', 'est', 's', 'ent', 'en', 'ce']

"True" char-level is cool because obviously the tokenizer is then pretty trivial :)

Btw, we now have a "True" char-level reformer model here: https://huggingface.co/google/reformer-enwik8 :-)

Any chance of having a TF version of the Reformer model?

Yes in ~2 month I would guess

@patrickvonplaten Is there any notebook/doc on how to fine tune the char level model using reformer-enwiki8 model? Having a doc showing how to pass training data to fine tune it would be helpful.

You should be able to leverage the code shown on the model card of reformer-enwik8 here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 . It shows how data is passed to the model.

Was this page helpful?
0 / 5 - 0 ratings