Transformers: Character level models?

Created on 30 Apr 2020 · 12Comments · Source: huggingface/transformers

Hi, are any character-level language models available? Transformer-XL mentions in their paper that they did both word level and character level stuff, yet here it seems only the word level one is available? Is that correct?

Source

KosayJabre

👍2

Most helpful comment

Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.

Example:

tok = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")
tokens = tok.encode("This is a test sentence") # [108, 265, 24, 111, 4, 3, 249, 7, 76, 25, 69]
print([tok.decode(token) for token in tokens]) # ['T', 'h', 'is', 'is', 'a', 't', 'est', 's', 'ent', 'en', 'ce']

patrickvonplaten on 2 May 2020

👍4 😄1

All 12 comments

I don't think we have any character-level language models available on huggingface.co/models, but they should be pretty straightforward to train.

BTW I'm sure you saw it already but PyTorch's nn.Transformer tuto is a char-level language model.

julien-c on 30 Apr 2020

Hi, that's actually is word-level model.
Do you know of any pretrained bidirectional transformer-based character level language models that I can use?

KosayJabre on 1 May 2020

No I'm not aware of any pretrained one.

julien-c on 1 May 2020

We will soon have a pretrained ReformerLM model on character level

patrickvonplaten on 1 May 2020

👍3

@patrickvonplaten just for my curiosity would this be the model trained on the "Crime and Punishment" from their Colab (but the vocab is not char-only), or do you have an own trained model 🤔

stefan-it on 2 May 2020

Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.

Example:

tok = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")
tokens = tok.encode("This is a test sentence") # [108, 265, 24, 111, 4, 3, 249, 7, 76, 25, 69]
print([tok.decode(token) for token in tokens]) # ['T', 'h', 'is', 'is', 'a', 't', 'est', 's', 'ent', 'en', 'ce']

patrickvonplaten on 2 May 2020

👍4 😄1

"True" char-level is cool because obviously the tokenizer is then pretty trivial :)

julien-c on 2 May 2020

Btw, we now have a "True" char-level reformer model here: https://huggingface.co/google/reformer-enwik8 :-)

patrickvonplaten on 18 May 2020

Any chance of having a TF version of the Reformer model?

bjourne on 26 Jun 2020

Yes in ~2 month I would guess

patrickvonplaten on 26 Jun 2020

@patrickvonplaten Is there any notebook/doc on how to fine tune the char level model using reformer-enwiki8 model? Having a doc showing how to pass training data to fine tune it would be helpful.

divyanshu16 on 27 Sep 2020

You should be able to leverage the code shown on the model card of reformer-enwik8 here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 . It shows how data is passed to the model.

patrickvonplaten on 28 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings