I'm doing a NER project and trying to use BERT. For BERT, it uses wordpiece tokenization, which means one word may break into several pieces. Then for NER, how to find the corresponding class label for the word broken into several tokens? for example, if 'London' was broken into '##lon" and "##don", shall we give the same label "location" to both "##lon" and "##don"?
I have seen one example using BERT for NER: https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ But the entity labels and tokens are mis-matched.
@yexing99
as mentioned in the paper, you can use the corresponding label for the first one and 'X' label for the rest.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L274
in the inference time, you need to skip the 'X' labeled tokens.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L832
In the original github post, the author has told us how to deal with this kind of problem.
"If you have a pre-tokenized representation with word-level annotations, you can simply tokenize each input word independently, and deterministically maintain an original-to-tokenized alignment:" in https://github.com/google-research/bert
@beamind @dsindex Thank you for response. I did some search and found one solution: https://github.com/huggingface/pytorch-pretrained-BERT/issues/64#issuecomment-443703063
I am using https://github.com/google/sentencepiece in the following way:
sentencepiece to Extract a WordPiece VocabularyBERT needs a WordPiece vocabulary file to run, so we need to decide on a number of tokens and then run sentencepiece to extract a list of valid tokens.
The sentencepiece Pypi library isn't sufficient for our needs, we need to clone the Github repo, build and install the software to create our vocabulary.
Make sure you're in the root directory of this project and run:
git clone https://github.com/google/sentencepiece
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v
Now we can use sp_train to create a vocabulary of our 4.7 million sentences.
cd ../models
spm_train --input="../data/sentences.csv" --model_prefix=wsl --vocab_size=20000
# Add the [CLS], [SEP], [UNK] and [MASK] tags, or pre-training will error out
echo -e "[CLS]\t0\n[SEP]\t0\n[UNK]\t0\n[MASK]\t0\n$(cat wsl.vocab)" > wsl.vocab
# Remove the numbers, just retain the tag vocabulary
cat wsl.vocab | cut -d$'\t' -f1 > wsl.stripped.vocab
Then:
Next we use the WordPiece vocabulary to pre-train a BERT model that we will then use, as a tranfer learning strategy, to encode the text of Stack Overflow questions.
It is not possible to create a new conda environment from which to install tensorflow==1.14.0, which BERT needs, so you will need to run this code outside of this notebook, from the root directory of this project.
conda create -y -n bert python=3.7.4
conda init bash
Now in a new shell, change directory to the root of project:
cd /path/to/weakly_supervised_learning_code
Now run:
conda activate bert
pip install tensorflow-gpu==1.14.0
We need to configure BERT to use our vocabulary size, so we create a bert_config.json file in the bert/ directory. Then we execute the create_pretraining_data.py command to pre-train the network.
# Tell BERT how many tokens to use
echo '{ "vocab_size": 20004 }' > bert/bert_config.json
python bert/create_pretraining_data.py \
--input_file=data/sentences.csv \
--output_file=data/tf_examples.tfrecord \
--vocab_file=models/wsl.stripped.vocab \
--bert_config_file=bert/bert_config.json \
--do_lower_case=False \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--random_seed=1337 \
--learning_rate=2e-5
conda deactivate
@beamind @dsindex Thank you for response. I did some search and found one solution: huggingface/transformers#64 (comment)
hello ,
how you did it please ? i didn't understand how to use it ?
am trying to train language model on urdu , follwoing line is not genrating correct tokens
tokenizer.encode("میں فوراّ واپس آوں گا").tokens
ERROR
above line returns gibberish
['<s>',
'ÙħÛĮÚº',
'ĠÙģÙĪØ±Ø§',
'Ùij',
'ĠÙĪØ§Ù¾Ø³',
'ĠØ¢',
'ÙĪÚº',
'Ġگا',
'</s>']
I didnt see any discription about added tokens or special tokens file in blog? can anyone help me fix this issue
03/02/2020 12:21:08 - INFO - transformers.tokenization_utils - Didn't find file ./urBERTo/added_tokens.json. We won't load it.
03/02/2020 12:21:08 - INFO - transformers.tokenization_utils - Didn't find file ./urBERTo/special_tokens_map.json`
@samreenkazi
am trying to train language model on urdu , follwoing line is not genrating correct tokens
tokenizer.encode("میں فوراّ واپس آوں گا").tokens
ERROR
above line returns gibberish
['<s>', 'ÙħÛĮÚº', 'ĠÙģÙĪØ±Ø§', 'Ùij', 'ĠÙĪØ§Ù¾Ø³', 'ĠØ¢', 'ÙĪÚº', 'Ġگا', '</s>']
I do not know if BERT for the Urdu language exists, but you can check out more here.
@samreenkazi Have you found a solution? I'm facing similar issues.
No not as yet ; but working on it
On Mon, 8 Jun 2020 at 21:48, Yannis Evangelou notifications@github.com
wrote:
@samreenkazi https://github.com/samreenkazi Have you found a solution?
I'm facing similar issues.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google-research/bert/issues/560#issuecomment-640745703,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ALPPGB2BBGJACDS7ECWQDD3RVUI5TANCNFSM4HEE4P6Q
.
It could be that you're missing a font. You can use external fonts in Google Colab. If you execute the following, can you see the correct characters in the plot? If not, you should probably import and unzip external fonts.
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
plt.plot(range(50), range(50), 'r')
plt.title('کے لیے')
plt.ylabel('دو')
plt.xlabel('ف')
plt.show()
Most helpful comment
@yexing99
as mentioned in the paper, you can use the corresponding label for the first one and 'X' label for the rest.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L274
in the inference time, you need to skip the 'X' labeled tokens.
for example,
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/bert_lstm_ner.py#L832