Is there any general strategy for tokenizing text in C++ in a way that's compatible with the existing pretrained BertTokenizer implementation?
I'm looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?
You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.
Any update on this? It is already beyond "the end of the year".
I also tried to figure out an alternative beyond manual tokenizer. Will your approach handle with multiple models? I'm looking for a GPT-2 tokenizer in C++.
Check out this repo: https://github.com/huggingface/tokenizers
You can already use it from transformers, using BertTokenizerFast
Most helpful comment
You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.