Hello everybody, I tuned Bert follow this example with my corpus in my country language - Vietnamese.
So now I have 2 question that concerns:
from_pretrained BertTokenizer classmethod, so it get tokenizer from pretrained bert models.from_pretrained function. Anyone has better solution, can you help me ?model_state_dict = torch.load(output_model_file)
model = pytorch_transformers.BertModel.from_pretrained('bert-base-multilingual-cased', do_lower_case=False, state_dict=model_state_dict)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False, state_dict=model_state_dict)
input_ids = torch.tensor(tokenizer.encode(sent)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)Sorry for about my English, can anyone help me ?
) Not really sure what your meaning here, but use whatever tokenizer that you used to tokenise your corpus; a tokenizer just converts words into integers anyways.
) You are pretty much right if all you want is the hidden states, outputs = model(input_ids) will create a tuple with the hidden layers. You can then use these vectors as inputs to different classifiers.
Only thing is that by doing it this way the BERT model ends up having frozen weights. Now it might just be that BERT has already found the best representation for your downstream predictions, but more than likely it has not. Instead, it's much better to allow BERT to be fine tuned.
(Just to let you know, BERT can be fine tuned on a binary classification problem straight out the box, more than likely will offer better performance than hand engineering a classifier).
@andrewpatterson2018 thank you for your help, my first question is from paragraph, BertTokenizer split its into words like:
'I am going to school' -> ['I', 'am', 'go', '##ing', 'to', 'school']
But I want its to be like: -> ['I', 'am', 'going', 'to', 'school']
Because in my language word structure is different from English. I want WhiteSpaceSplit only.
Do you have any solution ?
Thank you very much !
You shouldn't change the Tokenizer, because the Tokenizer produces the vocabulary that the Embedding layer expects. Considering the example you gave:
'I am going to school' -> ['I', 'am', 'go', '##ing', 'to', 'school']
Whitespace tokenization -> ['I', 'am', 'going', 'to', 'school']
The word "going" was split into "go ##ing" because BERT uses WordPiece embeddings and bert-base-multilingual-cased vocabulary does not contain the word going. You could write your own tokenizer that performs whitespace tokenization, but you would have to map all unknown tokens to the [UNK] token. The final tokenization would be:
['I', 'am', '[UNK]', 'to', 'school']
The performance will most certainly drop, because you would have embeddings for a really small percentage of your tokens.
What you probably want is to change the vocabulary BERT uses. This requires generating a new vocabulary for your corpus and pretraining BERT from scratch (you can initialize with the weights of bert-base-multilingual-cased) replacing the Embedding layer.
@fabiocapsouza thank you very much !
But now I want use BERT to fine tuned with my corpus, so I want use bert-base-multilingual-cased as initial weights.
I understand that don't change vocabulary by BERT, when I tuned, I go to folder, open vocab.txt, and this that file has been added vocabulary in my corpus but those words are tokenizer by using the BERT's BasicTokenizer, but what I want is that it gets tokenizer my way. I understand the output of the tokenizer to match the BERT encoder. Will I have to re-code all functions?
Because BERT tokenizer in addition to tokenize is masked, will I have to re-code to match my tokenize method ?
Thank you !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@fabiocapsouza thank you very much !
But now I want use BERT to fine tuned with my corpus, so I want usebert-base-multilingual-casedas initial weights.
I understand that don't change vocabulary by BERT, when I tuned, I go to folder, open vocab.txt, and this that file has been added vocabulary in my corpus but those words are tokenizer by using the BERT's BasicTokenizer, but what I want is that it gets tokenizer my way. I understand the output of the tokenizer to match the BERT encoder. Will I have to re-code all functions?
Because BERT tokenizer in addition to tokenize is masked, will I have to re-code to match my tokenize method ?
Thank you !
Did you make your own tokenizer that was not generating ## in the vocab file?
Most helpful comment
You shouldn't change the Tokenizer, because the Tokenizer produces the vocabulary that the Embedding layer expects. Considering the example you gave:
The word "going" was split into "go ##ing" because BERT uses WordPiece embeddings and
bert-base-multilingual-casedvocabulary does not contain the wordgoing. You could write your own tokenizer that performs whitespace tokenization, but you would have to map all unknown tokens to the [UNK] token. The final tokenization would be:The performance will most certainly drop, because you would have embeddings for a really small percentage of your tokens.
What you probably want is to change the vocabulary BERT uses. This requires generating a new vocabulary for your corpus and pretraining BERT from scratch (you can initialize with the weights of
bert-base-multilingual-cased) replacing the Embedding layer.