Bert: how to realize the tokenization of BERT model in c++

Created on 14 Oct 2019 · 14Comments · Source: google-research/bert

Thanks for your work.

If I want to use tensorflow c++ api to import the pretrained BERT model, how could I process the txt data in C++, including tokenization of BERT? is there c++ wrapper for Bert? or does tensorfow c++ api provide the tokenization of Bert? Or do I need to implement the same tokenization.py in c++?

Thanks for any information.

Source

lytum

❤1 👍1

Most helpful comment

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

koth on 21 Oct 2019

👍3

All 14 comments

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

koth on 21 Oct 2019

👍3

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

lytum on 21 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough 😁

koth on 21 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough 😁

That means, you realized teh whole BERT projects in c++ without python?

lytum on 21 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

koth on 22 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

lytum on 23 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

koth on 28 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

lytum on 28 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

koth on 28 Oct 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

Many thanks. Could you send me a link of your code about accented characters. I didn't find the corresponding code in your project.

lytum on 5 Nov 2019

have a look at here, we have implemented the bert tokenizer in cpp
, https://github.com/LieluoboAi/radish/blob/master/radish/bert/bert_tokenizer.h

Have you also finished the intergration between c++ and BERT in python version?

currently no python intergration,cpp is enough grin

That means, you realized teh whole BERT projects in c++ without python?

yes, great thanks to https://github.com/huggingface/transformers.

May I ask you another question? How to process the accented characters? I didn't find the processing with accented character in your tokenization c++ code.

currently, we didn't add accented character normalization support, we will try to add it and keep it same as bert's tokenization results.

Thanks for your reply. For the multiligual BERT model, accented tokenization is not used, so actually, if we would like to train a multilingual model, your c++ tokenization is enough. right? What's your idea?

Its ok, but normalize accented characters would be a plus for bert model i think.

We now added accented characters normalization support.

And another question, how to realize the unicode normalization in c++? Thanks for any information.

lytum on 6 Nov 2019

bert_tokenizer.cc have all the relate code. In short， we use utf8proc to do unicode normalization.

koth on 6 Nov 2019

bert_tokenizer.cc have all the relate code. In short， we use utf8proc to do unicode normalization.

Do you find that accented character after utf8proc_NFD is a NULL? Is it normal? Thanks for any information.

lytum on 12 Nov 2019

bert_tokenizer.cc have all the relate code. In short， we use utf8proc to do unicode normalization.

Sorry, I solved the last problem. It's some about windows system.

May i ask you another question? Why do you use utf8::utf8to16? Becasue chinese characters? Could I direct use "the transformation between string and wstring in c++" to realize the "utf8::utf8to16"? Welcome any suggestion. Thanks a lot.

lytum on 14 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to extract the word embedding parameters from the pretrained files?

dzhao123 · 3Comments

why need to change words to "###*"by apply tokenization?

waallf · 4Comments

Expected masked_lm_accuracy

okgrammer · 4Comments

not good when I use BERT for seq2seq model in keyphrase generation

whqwill · 4Comments

what is the max length of the context?

hmxv2 · 4Comments