Fairseq: KeyErrors when running multiprocessing_bpe_encoder.py

Created on 18 Nov 2020 · 10Comments · Source: pytorch/fairseq

Hi all,

I have created my own BPE vocab with the tokenizers library following the steps described here.

I am now trying to encode my corpus (made of Brazilian tweets) using the multiprocessing_bpe_encoder.py script. When doing so, the script works fines for a while and then crashes due to some KeyErrors:

processed 260000 lines
processed 270000 lines
processed 280000 lines
processed 290000 lines
processed 300000 lines
processed 310000 lines
processed 320000 lines
processed 330000 lines
processed 340000 lines
processed 350000 lines
processed 360000 lines
processed 370000 lines
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 117, in encode_lines
    tokens = self.encode(line)
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 101, in encode
    ids = bpe.encode(line)
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in encode
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
  File "/scratch/mt4493/twitter_labor/code/envs/inference_env/lib/python3.7/site-packages/fairseq/data/encoders/gpt2_bpe_utils.py", line 119, in <genexpr>
    self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
KeyError: 'Ğ'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 130, in <module>
    main()
  File "/scratch/mt4493/twitter_labor/code/pretraining/preprocessing/roberta/multiprocessing_bpe_encoder.py", line 78, in main
    for i, (filt, enc_lines) in enumerate(encoded_lines, start=1):
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 325, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/share/apps/anaconda3/2019.10/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
KeyError: 'Ğ'

I also get another KeyError: KeyError: 'Ě'.

Neither my corpus nor my encoder.json and vocab.bpe contain these characters and I'm not sure what the problem is?

Thanks a lot in advance for the help.

bug

Source

manueltonneau

All 10 comments

Are you able to share a reproducible example (original text file, encoder.json, vocab.bpe, etc), maybe via dropbox?

lematt1991 on 18 Nov 2020

Hi @lematt1991 and thanks for your reply! Sure, will prepare one now and share it with you via Google Drive.

manueltonneau on 18 Nov 2020

👍1

Are you able to share a reproducible example (original text file, encoder.json, vocab.bpe, etc), maybe via dropbox?

Just shared a folder containing this information (with a README) with you @lematt1991, you should have received an email. Let me know if you need anything else and thanks for your help :)

manueltonneau on 18 Nov 2020

Hmm, I'm unable to reproduce this. I ran the following, where I've copied the contents of your google drive into a directory called repro, and dumped the results of build_bpe.py into repro/vocab/files:

mkdir repro/corpus
mv repro/test-aa repro/corpus/test-aa
python build_bpe.py --corpus_dir corpus
python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json repro/vocab/files/vocab.json \
    --encoder-bpe repro/vocab/files/merges.txt \
    --inputs repro/corpus/test-aa \
    --outputs repro/corpus.bpe \
    --keep-empty \
    --workers 60

This processed all the way to the end without any errors, and produced a corpus.bpe file with the same number of lines as the test-aa file. Are these the correct steps to reproduce your problem? If so, can you try upgrading your tokenizers/transformers libraries? Based on the thread you linked, it seems there was a bug that they fixed at some point.

lematt1991 on 18 Nov 2020

Thanks a lot for the help.

Yes, these seem like the correct steps. The fourth line of your bash file will create the bpe files but you won't use the output if you directly use the files that I gave you in repro/vocab/files. From what I understand, it seems that you were able to encode the test-aa file with my bpe files (the ones in repro/vocab/files). The only difference with the way I did it is that you use the code from the fairseq package doing python -m examples.roberta.multiprocessing_bpe_encoder and I copied the script from the repo and do python3 multiprocessing_bpe_encoder.py. Could that be the reason? I'll try with python -m examples.roberta.multiprocessing_bpe_encoder and see whether I still get errors.

manueltonneau on 18 Nov 2020

If I use the vocab.json and merges.txt files that you provide in your google drive, I get the error that you see. It seems it's a problem with tokenizers/transformers.

lematt1991 on 18 Nov 2020

Try upgrading tokenizers. If that doesn't work, try opening an issue with them.

lematt1991 on 18 Nov 2020

Got it. Thanks a lot. Out of curiosity, what are the versions of tokenizers and transformers you used?

manueltonneau on 18 Nov 2020

python -c "import tokenizers; print(tokenizers.__version__)"
0.9.4

lematt1991 on 18 Nov 2020

🚀1

Cool. Will recreate my vocab and close this as soon as it is fixed on my side. Thanks a lot for the help :)

manueltonneau on 18 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Where is fairseq-preprocess

chengfx · 3Comments

fairseq/clib/libbleu/libbleu.cpp:10:10: fatal error: 'array' file not found

galphag · 3Comments

Want to construct a STT from wav2vec

tyoc213 · 3Comments

Improve memory efficiency of preprocess.py

PhilippeMarcotte · 3Comments

AdaFactor to save GPU memory?

AranKomat · 3Comments