Hi, are any character-level language models available? Transformer-XL mentions in their paper that they did both word level and character level stuff, yet here it seems only the word level one is available? Is that correct?
I don't think we have any character-level language models available on huggingface.co/models, but they should be pretty straightforward to train.
BTW I'm sure you saw it already but PyTorch's nn.Transformer tuto is a char-level language model.
Hi, that's actually is word-level model.
Do you know of any pretrained bidirectional transformer-based character level language models that I can use?
No I'm not aware of any pretrained one.
We will soon have a pretrained ReformerLM model on character level
@patrickvonplaten just for my curiosity would this be the model trained on the "Crime and Punishment" from their Colab (but the vocab is not char-only), or do you have an own trained model 馃
Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.
Example:
tok = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment")
tokens = tok.encode("This is a test sentence") # [108, 265, 24, 111, 4, 3, 249, 7, 76, 25, 69]
print([tok.decode(token) for token in tokens]) # ['T', 'h', 'is', 'is', 'a', 't', 'est', 's', 'ent', 'en', 'ce']
"True" char-level is cool because obviously the tokenizer is then pretty trivial :)
Btw, we now have a "True" char-level reformer model here: https://huggingface.co/google/reformer-enwik8 :-)
Any chance of having a TF version of the Reformer model?
Yes in ~2 month I would guess
@patrickvonplaten Is there any notebook/doc on how to fine tune the char level model using reformer-enwiki8 model? Having a doc showing how to pass training data to fine tune it would be helpful.
You should be able to leverage the code shown on the model card of reformer-enwik8 here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 . It shows how data is passed to the model.
Most helpful comment
Yeah that was my bad definition of char-only I guess :D. The vocab has 320 tokens, so it's more like on "very" small word units level.
Example: