Fairseq: KeyErrors when running multiprocessing_bpe_encoder.py

Created on 18 Nov 2020  路  10Comments  路  Source: pytorch/fairseq

Hi all,

I have created my own BPE vocab with the tokenizers library following the steps described here.

I am now trying to encode my corpus (made of Brazilian tweets) using the multiprocessing_bpe_encoder.py script. When doing so, the script works fines for a while and then crashes due to some KeyErrors:

processed 260000 lines
processed 270000 lines
processed 280000 lines
processed 290000 lines
processed 300000 lines
processed 310000 lines
processed 320000 lines
processed 330000 lines
processed 340000 lines
processed 350000 lines
processed 360000 lines
processed 370000 lines
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 117, in encode_lines
    tokens = self.encode(line)
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 101, in encode
    ids = bpe.encode(line)
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in encode
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in <genexpr>
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
KeyError: '臑'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 130, in <module>
    main()
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 78, in main
    for i, (filt, enc_lines) in enumerate(encoded_lines, start=1):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 325, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
KeyError: '臑'

I also get another KeyError: KeyError: '臍'.

Neither my corpus nor my encoder.json and vocab.bpe contain these characters and I'm not sure what the problem is?

Thanks a lot in advance for the help.

bug

All 10 comments

Are you able to share a reproducible example (original text file, encoder.json, vocab.bpe, etc), maybe via dropbox?

Hi @lematt1991 and thanks for your reply! Sure, will prepare one now and share it with you via Google Drive.

Are you able to share a reproducible example (original text file, encoder.json, vocab.bpe, etc), maybe via dropbox?

Just shared a folder containing this information (with a README) with you @lematt1991, you should have received an email. Let me know if you need anything else and thanks for your help :)

Hmm, I'm unable to reproduce this. I ran the following, where I've copied the contents of your google drive into a directory called repro, and dumped the results of build_bpe.py into repro/vocab/files:

mkdir repro/corpus
mv repro/test-aa repro/corpus/test-aa
python build_bpe.py --corpus_dir corpus
python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json repro/vocab/files/vocab.json \
    --encoder-bpe repro/vocab/files/merges.txt \
    --inputs repro/corpus/test-aa \
    --outputs repro/corpus.bpe \
    --keep-empty \
    --workers 60

This processed all the way to the end without any errors, and produced a corpus.bpe file with the same number of lines as the test-aa file. Are these the correct steps to reproduce your problem? If so, can you try upgrading your tokenizers/transformers libraries? Based on the thread you linked, it seems there was a bug that they fixed at some point.

Thanks a lot for the help.

Yes, these seem like the correct steps. The fourth line of your bash file will create the bpe files but you won't use the output if you directly use the files that I gave you in repro/vocab/files. From what I understand, it seems that you were able to encode the test-aa file with my bpe files (the ones in repro/vocab/files). The only difference with the way I did it is that you use the code from the fairseq package doing python -m examples.roberta.multiprocessing_bpe_encoder and I copied the script from the repo and do python3 multiprocessing_bpe_encoder.py. Could that be the reason? I'll try with python -m examples.roberta.multiprocessing_bpe_encoder and see whether I still get errors.

If I use the vocab.json and merges.txt files that you provide in your google drive, I get the error that you see. It seems it's a problem with tokenizers/transformers.

Try upgrading tokenizers. If that doesn't work, try opening an issue with them.

Got it. Thanks a lot. Out of curiosity, what are the versions of tokenizers and transformers you used?

python -c "import tokenizers; print(tokenizers.__version__)"
0.9.4

Cool. Will recreate my vocab and close this as soon as it is fixed on my side. Thanks a lot for the help :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chengfx picture chengfx  路  3Comments

galphag picture galphag  路  3Comments

tyoc213 picture tyoc213  路  3Comments

PhilippeMarcotte picture PhilippeMarcotte  路  3Comments

AranKomat picture AranKomat  路  3Comments