Fairseq: Preprocessing using sentence piece

Created on 18 Jan 2019 · 13Comments · Source: pytorch/fairseq

Hi
I am using sentencepiece to preprocess the raw text data to bpe'd text and use it to train a model in fairseq.
English to chinese . there would be two vocabs , one each for a language. How to use this vocab to generate indexed and binarized train files (after applying sentencepiece on raw test without tokenization)
via preprocess.py when I explicitly provide dicts (which are sentencepiece outputs) the no of types in dictionary gets reduced from 32k to 31939 and it is failing at . what's the right way to preprocess?
Also 9.3% of tokens are replaced by OOV

Source

gvskalyan

Most helpful comment

You should probably regenerate the dictionary to get the exact number of units, otherwise you'll have embeddings in your model that won't be trained.

If you want to reuse the sentencepiece dictionary, you can easily convert it to the fairseq format. The main differences are that fairseq uses the format <token> <frequency> (with a space) whereas sentencepiece uses <token>\t<negative_id> (with a tab). Fairseq uses the frequency column to do filtering, so you can simply create a new dictionary with a dummy count of 100 or something. You also need to remove <unk>, <s> and </s> from the sentencepiece dictionary:

cut -f1 sentencepiece.vocab | tail -n +4 | sed "s/$/ 100/g" > fairseq.vocab

I'll also be merging a commit shortly that adds a --remove-bpe=sentencepiece option to generate.py so that you can detokenize the sentencepiece output during generation.

myleott on 29 Jan 2019

👍6

All 13 comments

Hi @gvskalyan

I had exactly the same issue. The problem is that preprocess.py creates its vocabularies looking at only the generated sub-tokens in the training corpus. The problem is that with sub-word tokenizers, there could be more symbols than the ones generated (i.e. some implementations add the alphabet to the symbol list, to be able to encode even unseen, very-odd, words).

With this pull request https://github.com/pytorch/fairseq/pull/448 I proposed a customizable preprocess.py implementation.

The solution I came up with is to create a custom TranslationTask that creates a custom implementation of Dictionary that load the sub-word processor vocabulary. The problem is that the "tokenization" is still a simple whitespace-tokenizer, and cannot be changed at the moment.
So basically my pipeline is:

Create a custom Dictionary class that implements the sub-word policy and a custom Task (i.e. my_custom_task that loads it.
Create the sub-word processor/dictionary independently from fairseq and sub-word split the whole training corpus (i.e. train.subtok.en > train.subtok.fr).
Invoke preprocess.py on train.subtok.en and train.subtok.fr with the options --join-dictionary, --srcdict /path/to/previously/created/vocab and --task my_custom_task

This way preprocess.py will use my custom implementation of Dictionary and will encode sub-tokens accordingly.

davidecaroselli on 18 Jan 2019

👍2

The problem is that preprocess.py creates its vocabularies looking at only the generated sub-tokens in the training corpus.

After training, only the subword units that appeared in the training set would have their embeddings trained, while these other subword units would have random embeddings. Why is it beneficial to add these missing subword units to the vocab?

@gvskalyan what is the error message you see?

myleott on 29 Jan 2019

in a call to .finalize method of the Dictionary in fairseq data.
The problem is how could sentence piece generated dictionary could be used (as straight as the one generated by preprocess.py) or else should the dictionary be generated again to get the no of exact required sentencepiece'd units

gvskalyan on 29 Jan 2019

You should probably regenerate the dictionary to get the exact number of units, otherwise you'll have embeddings in your model that won't be trained.

cut -f1 sentencepiece.vocab | tail -n +4 | sed "s/$/ 100/g" > fairseq.vocab

I'll also be merging a commit shortly that adds a --remove-bpe=sentencepiece option to generate.py so that you can detokenize the sentencepiece output during generation.

myleott on 29 Jan 2019

👍6

According to this is their any reduction in vocabulary size from sentencepiece to that of generated one.
Does this affect performance / no of unknowns , since the dictionary provided by SPM based on bpe would not be used at all and flags like thresholds' and nwords' would produce different vocab than intended one..
@myleott . Also please provide this spm scripts in master https://github.com/facebookresearch/flores/tree/master/scripts

gvskalyan on 8 Feb 2019

In this case it will be same size, since the same training set is used to learn the sentencepiece vocabulary as was used to build the fairseq dictionary, and because we don't have any threshold/nwords filters.

If you had such filters, or if the sentencepiece vocab was built on a different dataset, then it's possible that the fairseq dictionary could be smaller than the sentencepiece one. Usually if a token doesn't appear in your training set (or appears too infrequently), then if you artificially keep it in the vocab you will end up with a random embedding for that token (since it never gets trained). In that case you're better off encoding it as an unknown token.

myleott on 9 Feb 2019

hi everyone, now i want to custom preprocess.py to receives not only word text file but also its tag file. I want to combine word and pos tag to improve translation performance. Are there any options for me to custom it ?

lengockyquang on 8 May 2019

hi everyone, now i want to custom preprocess.py to receives not only word text file but also its tag file. I want to combine word and pos tag to improve translation performance. Are there any options for me to custom it ?

Hi lengockyquang, I'm facing the same problem. Are you able to find a way to build a custom preprocess module?

AlphaDing on 4 Jun 2019

for my problem, i've created pretrained embedding for my translation task. The whole fairseq is too complicated for me to custom it without instruction :(

lengockyquang on 4 Jun 2019

for my problem, i've created pretrained embedding for my translation task. The whole fairseq is too complicated for me to custom it without instruction :(

Thank you for your reply! May I ask after you obtain the embeddings, how did you load it with fairseq-preprocess?

AlphaDing on 4 Jun 2019

you can use pretrained embedding with fairseq-train --encoder-embed-path options

lengockyquang on 4 Jun 2019

@davidecaroselli does using sentencepiece affect the convergence of model or take longer training time to attain the same loss. Also it might also be due to the dictionary's tokens diff.
Also if I want to reduce the dictionary size, reducing the no of bpe codes , so less often tokens can be splitted and gets dropped when creating a dictionary so that the inference can be faster like instead of 38000 going to 32000 tokens (shared emeddings across encoder and decoder). Is this advised?

gvskalyan on 26 Jun 2019

You should probably regenerate the dictionary to get the exact number of units, otherwise you'll have embeddings in your model that won't be trained.

If you want to reuse the sentencepiece dictionary, you can easily convert it to the fairseq format. The main differences are that fairseq uses the format <token> <frequency> (with a space) whereas sentencepiece uses <token>\t<negative_id> (with a tab). Fairseq uses the frequency column to do filtering, so you can simply create a new dictionary with a dummy count of 100 or something. You also need to remove <unk>, <s> and </s> from the sentencepiece dictionary:
cut -f1 sentencepiece.vocab | tail -n +4 | sed "s/$/ 100/g" > fairseq.vocab
I'll also be merging a commit shortly that adds a --remove-bpe=sentencepiece option to generate.py so that you can detokenize the sentencepiece output during generation.

Hi @myleott , I had another question: is the "order" of mapping from dictionary to the embedding matrix is done on the basis of frequency as well? i.e. If I extract the embedding matrix: after the first 4 special tokens, the rest of the embedings correspond to the tokens in the dictionary (which are already sorted by frequency) in the same order?