Bert: how to realize the tokenization of BERT model in c++

Created on 14 Oct 2019  ·  14Comments  ·  Source: google-research/bert

Thanks for your work.

If I want to use tensorflow c++ api to import the pretrained BERT model, how could I process the txt data in C++, including tokenization of BERT? is there c++ wrapper for Bert? or does tensorfow c++ api provide the tokenization of Bert? Or do I need to implement the same tokenization.py in c++?

Thanks for any information.

Most helpful comment

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

All 14 comments

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough 😁

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough 😁

That means, you realized teh whole BERT projects in c++ without python?

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

Many thanks. Could you send me a link of your code about accented characters. I didn't find the corresponding code in your project.

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

And another question, how to realize the unicode normalization in c++? Thanks for any information.

bert_tokenizer.cc have all the relate code. In short, we use utf8proc to do unicode normalization.

bert_tokenizer.cc have all the relate code. In short, we use utf8proc to do unicode normalization.

Do you find that accented character after utf8proc_NFD is a NULL? Is it normal? Thanks for any information.

bert_tokenizer.cc have all the relate code. In short, we use utf8proc to do unicode normalization.

Sorry, I solved the last problem. It's some about windows system.

May i ask you another question? Why do you use utf8::utf8to16? Becasue chinese characters? Could I direct use "the transformation between string and wstring in c++" to realize the "utf8::utf8to16"? Welcome any suggestion. Thanks a lot.

Was this page helpful?
0 / 5 - 0 ratings