Bert: Problem with wordpiece tokenization

Created on 8 Apr 2019 · 11Comments · Source: google-research/bert

I'm doing a NER project and trying to use BERT. For BERT, it uses wordpiece tokenization, which means one word may break into several pieces. Then for NER, how to find the corresponding class label for the word broken into several tokens? for example, if 'London' was broken into '##lon" and "##don", shall we give the same label "location" to both "##lon" and "##don"?

I have seen one example using BERT for NER: https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ But the entity labels and tokens are mis-matched.

Source

yexing99

👀1 👍1

Most helpful comment

@yexing99

as mentioned in the paper, you can use the corresponding label for the first one and 'X' label for the rest.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L274

in the inference time, you need to skip the 'X' labeled tokens.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L832

dsindex on 8 Apr 2019

👍5 👀1

All 11 comments

@yexing99

in the inference time, you need to skip the 'X' labeled tokens.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L832

dsindex on 8 Apr 2019

👍5 👀1

In the original github post, the author has told us how to deal with this kind of problem.
"If you have a pre-tokenized representation with word-level annotations, you can simply tokenize each input word independently, and deterministically maintain an original-to-tokenized alignment:" in https://github.com/google-research/bert

beamind on 8 Apr 2019

👀1

@beamind @dsindex Thank you for response. I did some search and found one solution: https://github.com/huggingface/pytorch-pretrained-BERT/issues/64#issuecomment-443703063

yexing99 on 8 Apr 2019

❤1 👍1

I am using https://github.com/google/sentencepiece in the following way:

Using `sentencepiece` to Extract a WordPiece Vocabulary

BERT needs a WordPiece vocabulary file to run, so we need to decide on a number of tokens and then run sentencepiece to extract a list of valid tokens.

The sentencepiece Pypi library isn't sufficient for our needs, we need to clone the Github repo, build and install the software to create our vocabulary.

Make sure you're in the root directory of this project and run:

git clone https://github.com/google/sentencepiece
cd sentencepiece

mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Now we can use sp_train to create a vocabulary of our 4.7 million sentences.

cd ../models
spm_train --input="../data/sentences.csv" --model_prefix=wsl --vocab_size=20000

# Add the [CLS], [SEP], [UNK] and [MASK] tags, or pre-training will error out
echo -e "[CLS]\t0\n[SEP]\t0\n[UNK]\t0\n[MASK]\t0\n$(cat wsl.vocab)" > wsl.vocab

# Remove the numbers, just retain the tag vocabulary
cat wsl.vocab | cut -d$'\t' -f1 > wsl.stripped.vocab

Then:

Using BERT to Pretrain a Language Model

Next we use the WordPiece vocabulary to pre-train a BERT model that we will then use, as a tranfer learning strategy, to encode the text of Stack Overflow questions.

Creating a BERT conda environment

It is not possible to create a new conda environment from which to install tensorflow==1.14.0, which BERT needs, so you will need to run this code outside of this notebook, from the root directory of this project.

conda create -y -n bert python=3.7.4
conda init bash

Now in a new shell, change directory to the root of project:

cd /path/to/weakly_supervised_learning_code

Now run:

conda activate bert
pip install tensorflow-gpu==1.14.0

Running BERT Pre-Training

We need to configure BERT to use our vocabulary size, so we create a bert_config.json file in the bert/ directory. Then we execute the create_pretraining_data.py command to pre-train the network.

# Tell BERT how many tokens to use
echo '{ "vocab_size": 20004 }' > bert/bert_config.json 

python bert/create_pretraining_data.py \
   --input_file=data/sentences.csv \
   --output_file=data/tf_examples.tfrecord \
   --vocab_file=models/wsl.stripped.vocab \
   --bert_config_file=bert/bert_config.json \
   --do_lower_case=False \
   --max_seq_length=128 \
   --max_predictions_per_seq=20 \
   --num_train_steps=20 \
   --num_warmup_steps=10 \
   --random_seed=1337 \
   --learning_rate=2e-5

conda deactivate

rjurney on 17 Oct 2019

@beamind @dsindex Thank you for response. I did some search and found one solution: huggingface/transformers#64 (comment)

hello ,
how you did it please ? i didn't understand how to use it ?

AsmaZbt on 4 Feb 2020

am trying to train language model on urdu , follwoing line is not genrating correct tokens

tokenizer.encode("میں فوراّ واپس آوں گا").tokens
ERROR
above line returns gibberish
['<s>', 'ÙħÛĮÚº', 'ĠÙģÙĪØ±Ø§', 'Ùĳ', 'ĠÙĪØ§Ù¾Ø³', 'ĠØ¢', 'ÙĪÚº', 'ĠÚ¯Ø§', '</s>']

samreenkazi on 2 Mar 2020

I didnt see any discription about added tokens or special tokens file in blog? can anyone help me fix this issue
03/02/2020 12:21:08 - INFO - transformers.tokenization_utils - Didn't find file ./urBERTo/added_tokens.json. We won't load it.
03/02/2020 12:21:08 - INFO - transformers.tokenization_utils - Didn't find file ./urBERTo/special_tokens_map.json`

samreenkazi on 2 Mar 2020

@samreenkazi

am trying to train language model on urdu , follwoing line is not genrating correct tokens

tokenizer.encode("میں فوراّ واپس آوں گا").tokens
ERROR
above line returns gibberish
['<s>', 'ÙħÛĮÚº', 'ĠÙģÙĪØ±Ø§', 'Ùĳ', 'ĠÙĪØ§Ù¾Ø³', 'ĠØ¢', 'ÙĪÚº', 'ĠÚ¯Ø§', '</s>']

I do not know if BERT for the Urdu language exists, but you can check out more here.

anjani-dhrangadhariya on 19 Mar 2020

@samreenkazi Have you found a solution? I'm facing similar issues.

zephyrous on 8 Jun 2020

No not as yet ; but working on it

On Mon, 8 Jun 2020 at 21:48, Yannis Evangelou notifications@github.com
wrote:

@samreenkazi https://github.com/samreenkazi Have you found a solution?
I'm facing similar issues.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google-research/bert/issues/560#issuecomment-640745703,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ALPPGB2BBGJACDS7ECWQDD3RVUI5TANCNFSM4HEE4P6Q
.

samreenkazi on 8 Jun 2020

It could be that you're missing a font. You can use external fonts in Google Colab. If you execute the following, can you see the correct characters in the plot? If not, you should probably import and unzip external fonts.

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

plt.plot(range(50), range(50), 'r')
plt.title('کے لیے')
plt.ylabel('دو')
plt.xlabel('ف')
plt.show()

zephyrous on 8 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings