Transformers: Customize BertTokenizer and Feature Extraction from Bert Model

Created on 15 Aug 2019 · 6Comments · Source: huggingface/transformers

❓ Questions & Help

Hello everybody, I tuned Bert follow this example with my corpus in my country language - Vietnamese.
So now I have 2 question that concerns:

With my corpus, in my country language Vietnamese, I don't want use Bert Tokenizer from from_pretrained BertTokenizer classmethod, so it get tokenizer from pretrained bert models.
Now I want use only BasicTokenize - whitespace split only, so i must customize this function with it's output are same with output of from_pretrained function. Anyone has better solution, can you help me ?
I want only get embeded vector so I can use with my problem, aren't Next Sentence Prediction task, so I thinked I will get last hidden layer from Bert Model used this follow code:
model_state_dict = torch.load(output_model_file) model = pytorch_transformers.BertModel.from_pretrained('bert-base-multilingual-cased', do_lower_case=False, state_dict=model_state_dict) tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False, state_dict=model_state_dict) input_ids = torch.tensor(tokenizer.encode(sent)).unsqueeze(0) # Batch size 1 outputs = model(input_ids)
Is that right, anyone has better solution, can you help me ?

Sorry for about my English, can anyone help me ?

wontfix

Source

hungph-dev-ict

Most helpful comment

You shouldn't change the Tokenizer, because the Tokenizer produces the vocabulary that the Embedding layer expects. Considering the example you gave:

'I am going to school'  -> ['I', 'am', 'go', '##ing', 'to', 'school']
Whitespace tokenization -> ['I', 'am', 'going', 'to', 'school']

The word "going" was split into "go ##ing" because BERT uses WordPiece embeddings and bert-base-multilingual-cased vocabulary does not contain the word going. You could write your own tokenizer that performs whitespace tokenization, but you would have to map all unknown tokens to the [UNK] token. The final tokenization would be:

 ['I', 'am', '[UNK]', 'to', 'school']

The performance will most certainly drop, because you would have embeddings for a really small percentage of your tokens.

What you probably want is to change the vocabulary BERT uses. This requires generating a new vocabulary for your corpus and pretraining BERT from scratch (you can initialize with the weights of bert-base-multilingual-cased) replacing the Embedding layer.

fabiocapsouza on 20 Aug 2019

👍8 ❤3 👎1

All 6 comments

) Not really sure what your meaning here, but use whatever tokenizer that you used to tokenise your corpus; a tokenizer just converts words into integers anyways.
) You are pretty much right if all you want is the hidden states, outputs = model(input_ids) will create a tuple with the hidden layers. You can then use these vectors as inputs to different classifiers.

Only thing is that by doing it this way the BERT model ends up having frozen weights. Now it might just be that BERT has already found the best representation for your downstream predictions, but more than likely it has not. Instead, it's much better to allow BERT to be fine tuned.

(Just to let you know, BERT can be fine tuned on a binary classification problem straight out the box, more than likely will offer better performance than hand engineering a classifier).

andrewpatterson2018 on 15 Aug 2019

@andrewpatterson2018 thank you for your help, my first question is from paragraph, BertTokenizer split its into words like:
'I am going to school' -> ['I', 'am', 'go', '##ing', 'to', 'school']
But I want its to be like: -> ['I', 'am', 'going', 'to', 'school']
Because in my language word structure is different from English. I want WhiteSpaceSplit only.
Do you have any solution ?
Thank you very much !

hungph-dev-ict on 17 Aug 2019

You shouldn't change the Tokenizer, because the Tokenizer produces the vocabulary that the Embedding layer expects. Considering the example you gave:

'I am going to school'  -> ['I', 'am', 'go', '##ing', 'to', 'school']
Whitespace tokenization -> ['I', 'am', 'going', 'to', 'school']

 ['I', 'am', '[UNK]', 'to', 'school']

The performance will most certainly drop, because you would have embeddings for a really small percentage of your tokens.

fabiocapsouza on 20 Aug 2019

👍8 ❤3 👎1

@fabiocapsouza thank you very much !
But now I want use BERT to fine tuned with my corpus, so I want use bert-base-multilingual-cased as initial weights.
I understand that don't change vocabulary by BERT, when I tuned, I go to folder, open vocab.txt, and this that file has been added vocabulary in my corpus but those words are tokenizer by using the BERT's BasicTokenizer, but what I want is that it gets tokenizer my way. I understand the output of the tokenizer to match the BERT encoder. Will I have to re-code all functions?
Because BERT tokenizer in addition to tokenize is masked, will I have to re-code to match my tokenize method ?
Thank you !

hungph-dev-ict on 24 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 23 Oct 2019

@fabiocapsouza thank you very much !
But now I want use BERT to fine tuned with my corpus, so I want use bert-base-multilingual-cased as initial weights.
I understand that don't change vocabulary by BERT, when I tuned, I go to folder, open vocab.txt, and this that file has been added vocabulary in my corpus but those words are tokenizer by using the BERT's BasicTokenizer, but what I want is that it gets tokenizer my way. I understand the output of the tokenizer to match the BERT encoder. Will I have to re-code all functions?
Because BERT tokenizer in addition to tokenize is masked, will I have to re-code to match my tokenize method ?
Thank you !

Did you make your own tokenizer that was not generating ## in the vocab file?