Fairseq: alignment indices out of range for sequence of tokens

Created on 22 Mar 2020  路  6Comments  路  Source: pytorch/fairseq

馃悰 Bug

I would like to use the predicted alignment to match tokens in the in- and output.

To Reproduce

To get alignments, I have modified the hub_utils and changed the following line:
https://github.com/pytorch/fairseq/blob/42f65d65776327598a2d3ded2e92e5818c70a125/fairseq/hub_utils.py#L133

I replaced the line with the following code, which will return not only the tokens, but also the alignment:
return list(zip([self.decode(hypos[0]['tokens']) for hypos in batched_hypos], [hypos[0]['alignment'] for hypos in batched_hypos]))

With this change, the following code will work:

import torch

en2fr = torch.hub.load('pytorch/fairseq', 'transformer.wmt14.en-fr', tokenizer='moses', bpe='subword_nmt')

en = "The sky had changed from clear, sunny cold, to driving sleet and mist. Wrapping myself in my shaggy jacket of the cloth called bearskin, I fought my way against the stubborn storm."
en_tokens = en2fr.tokenize(en)

fr, alignment = en2fr.translate(en, beam=1, print_alignment=True)
fr_tokens = en2fr.tokenize(fr)

en_tokens = en_tokens.split()
fr_tokens = fr_tokens.split()

for a,b in alignment:
    if a>=len(en_tokens) or b>=len(fr_tokens):
        print('out of range')
        continue
    print(en_tokens[a], fr_tokens[b])

Some of the alignments make sense. For example, it is matching:

  • clear clair
  • sunny ensoleill茅

But most of the time it outputs "out of range"

Expected behavior

Alignment indices that are not out of range for the tokens. This doesn't necessarily mean that the alignments make perfect sense. But I would expect them to be at least valid for the given data.

Environment

  • fairseq Version: 0.9.0
  • PyTorch Version: 1.4.0
  • OS: Linux
  • How you installed fairseq: pip
  • Python version: 3.7.6
  • CUDA/cuDNN version: 10.1 and 7.5
  • GPU models and configuration: T2000 (mobile gpu)
bug

All 6 comments

Here is an easier way to reproduce, execute the following in the terminal:

curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
MODEL_DIR=wmt14.en-fr.fconv-py
fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --beam 5 --source-lang en --target-lang fr \
    --tokenizer moses \
    --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
    --print-alignment

As an example, you can enter "There is a bird in the tree"

This produces the alignments: 0-0 0-1 1-2 3-3 3-4 3-5 4-6 6-7 6-8
Curiously, my hacky code version produces different alignments:
(3, 0), (1, 1), (1, 2), (3, 3), (3, 4), (5, 5), (4, 6), (6, 7), (6, 8)

But in both cases, the alignments contain 6-8 and this is out of range for the given tokens.

It's because it's removing the target side BPE when printing. If you comment out the decode_fn line in interactive.py then you get:

There is a bird in the tree
S-0     There is a bird in the tree
H-0     -0.17229345575966476    Il y a un ois@@ eau dans l' arbre
P-0     -0.8115 -0.1816 -0.0239 -0.0831 -0.0740 -0.0003 -0.0617 -0.1506 -0.0053 -0.3309
A-0     0-0 0-1 1-2 1-3 3-4 3-5 4-6 6-7 6-8

I'll submit a fix.

Hi, thanks for the quick answer and help, this is awesome :)

I've added a similar workaround to my local fairseq installation, also outputting both the original and detokenized hypo.

But to be honest, it's not that useful. I would like to use the alignment for further text processing and having the bpe in there is a problem. If the text still contains those weird @@ encoding symbols, it's just not readable.

Is it possible to get an alignment for the detokenized version?

I've read up on your bpe code, but as far as I can see, it's completely independent of anything alignment related. Is there maybe a way to get some kind of tracing, which token ends up becoming which in the detok version?

Hm, after looking at some translations, it seems as if BPE only ever adds @@, potentially more than once.

In fact, the bpe has a bpe_symbol member that seems to be constant.
So does this perform only a single step of bpe?
Can I just merge tokens that end on the bpe_symbol and adjust the alignment accordingly?

And sorry for the off-topic question, but why are you doing this?
I've debug-stepped through the encoding code. It seems to do dictionary lookups for the bpe coded tokens.
This is completely new to me, so far I only had code like this:
raw text -> sequence of tokens -> sequence of dictionary indices -> sequence of embedding vectors -> model

Doesn't bpe interfere with the dictionary lookup e.g. embedding ?
If there is a paper that enlarges on this, I would be very interested.

Here is the original paper that introduced BPE: https://arxiv.org/abs/1508.07909. It's pretty widely used these days in machine translation because it allows you to represent a much larger vocabulary without training a separate embedding for every word variant. For example, instead of having an embedding for learning, learned, learns, you might have one for learn@@ and then a few for common suffixes (ing, ed, s). The @@ symbol just means that the token is a non-terminal BPE token, so can safely be removed.

Re: alignment, you might look into the paper Jointly Learning to Align and Translate with Transformer Models. The reference implementation for this is in fairseq: https://github.com/pytorch/fairseq/tree/master/examples/joint_alignment_translation. One thing I noticed from that paper is in Section 5.1 they write:

For all setups and models used in this work, we learn a joint source and target Byte-Pair-Encoding (BPE, Sennrich et al. (2016)) with 32k merge operations. We observe that even for statistical alignment models sub-word units are beneficial. To convert the alignments from sub-word-level back to word-level, we consider each target word as being aligned to a source word if an alignment between any of the target sub-words and source sub-words exists."

Ah that makes sense.
Ok, then I鈥榣l be able to adjust the alignment accordingly.

And thanks a lot for the papers

Was this page helpful?
0 / 5 - 0 ratings