Transformers: Tokenization in C++

Created on 11 Dec 2019  路  4Comments  路  Source: huggingface/transformers

Is there any general strategy for tokenizing text in C++ in a way that's compatible with the existing pretrained BertTokenizer implementation?
I'm looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?

Most helpful comment

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

All 4 comments

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

Any update on this? It is already beyond "the end of the year".

I also tried to figure out an alternative beyond manual tokenizer. Will your approach handle with multiple models? I'm looking for a GPT-2 tokenizer in C++.

Check out this repo: https://github.com/huggingface/tokenizers

You can already use it from transformers, using BertTokenizerFast

Was this page helpful?
0 / 5 - 0 ratings

Related issues

guanlongtianzi picture guanlongtianzi  路  3Comments

siddsach picture siddsach  路  3Comments

lemonhu picture lemonhu  路  3Comments

0x01h picture 0x01h  路  3Comments

lcswillems picture lcswillems  路  3Comments