Fairseq: Fail to load transformer.wmt19.en-de due to BPE code format issue

Created on 22 Oct 2019  路  2Comments  路  Source: pytorch/fairseq

When loading translation model (e.g. 'transformer.wmt19.en-de'), found that the BPE format code does not align with expected format and got the following exception.

Error: invalid line 1 in BPE codes file: e n</w> 1423551864
The line should exist of exactly two subword units, separated by whitespace

subword_nmt library expects there is only 2 items in 1 row while there is 3 items

Snippet of bpecodes

e n</w> 1423551864
e r 1300703664
e r</w> 1142368899
i n 1130674201
c h 933581741

Snippet of subword_nmt implementation

self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]

for i, item in enumerate(self.bpe_codes):
        if len(item) != 2:
            sys.stderr.write('Error: invalid line {0} in BPE codes file: {1}\n'.format(i+offset, ' '.join(item)))
            sys.stderr.write('The line should exist of exactly two subword units, separated by whitespace\n')
            sys.exit(1)

Sample Code

import torch
en2de_ensemble = torch.hub.load(
    'pytorch/fairseq', 'transformer.wmt19.en-de',
    checkpoint_file='model1.pt',
    tokenizer='moses', bpe='subword_nmt')

Should I use subword_nmt? Or I installed a wrong version of library?

Most helpful comment

You should be using fastbpe, check the example here: https://github.com/pytorch/fairseq/tree/master/examples/wmt19

All 2 comments

You should be using fastbpe, check the example here: https://github.com/pytorch/fairseq/tree/master/examples/wmt19

pip install fastBPE not working

Was this page helpful?
0 / 5 - 0 ratings