Transformers: Tokenization in C++

Created on 11 Dec 2019 · 4Comments · Source: huggingface/transformers

Is there any general strategy for tokenizing text in C++ in a way that's compatible with the existing pretrained BertTokenizer implementation?
I'm looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?

Source

cnapun

Most helpful comment

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

thomwolf on 11 Dec 2019

🎉7

All 4 comments

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

thomwolf on 11 Dec 2019

🎉7

Any update on this? It is already beyond "the end of the year".

zh217 on 7 Jan 2020

I also tried to figure out an alternative beyond manual tokenizer. Will your approach handle with multiple models? I'm looking for a GPT-2 tokenizer in C++.