Transformers: What should be the label of sub-word units in Token Classification with Bert

Created on 26 Feb 2019 · 3Comments · Source: huggingface/transformers

Hi,

I'm trying to use BERT for a token-level tagging problem such as NER in German.

This is what I've done so far for input preparation:

from pytorch_pretrained_bert.tokenization import BertTokenizer, WordpieceTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)

sentences=  ["Bis 2013 steigen die Mittel aus dem EU-Budget auf rund 120 Millionen Euro ."]
labels = [["O","O","O","O","O","O","O","B-ORGpart","O","O","O","O","B-OTH","O"]]
tokens = tokenizer.tokenize(sentences[0])

When I check the tokens I see that there are now 18 tokens instead of 14 (as expected) because of the sub-word units.

>>> tokens
['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']

My question is that how should I modify the labels array. Should I label each sub-word unit with the label of the original word or should I do something else ? As second question, which one of the examples in the resository can be used as an example code for this purpose ? run_classifier.py ? run_squad.py?

UPDATE

OK, according to the paper it should be handled as follows (From Section 4.3 of BERT paper):

To make this compatible with WordPiece tokenization, we feed each CoNLL-tokenized
input word into our WordPiece tokenizer and use the hidden state corresponding to the first
sub-token as input to the classifier. Where no prediction is made for X. Since
the WordPiece tokenization boundaries are a known part of the input, this is done for both
training and test.

Then, for the above example , the correct input output pair is :

['Bis', '2013', 'st', '##eig', '##en', 'die', 'Mittel', 'aus', 'dem', 'EU', '##-', '##B', '##ud', '##get', 'auf', 'rund', '120', 'Millionen', 'Euro', '.']
['O', 'O', 'O', 'X', 'X', 'O', 'O', 'O', 'O', 'B-ORGpart', 'X', 'X', 'X', 'X', 'O', 'O', 'O', 'O', 'B-OTH', 'O']

Then my question is evolved to " How the sub-tokens could be masked during training & testing ?"

Source

ereday

👍11

Most helpful comment

You do not need to introduce an additional tag. This is explained here:

https://github.com/huggingface/pytorch-pretrained-BERT/issues/64#issuecomment-443703063

bheinzerling on 28 Feb 2019

👍2

All 3 comments

I have a similar problem. I labeled the tokens as "X" and then got an error relating to NUM_LABELS. BERT appears to have thought the X was a third label, and I only specified there to be two labels.