Fairseq: Using Levenshtein Transformer for APE

Created on 22 Apr 2020 · 5Comments · Source: pytorch/fairseq

🚀 Feature Request

The original paper claims that the unaltered Levenshtein Transformer architecture can be used for refinement, for example the last sentence in their abstract:

We further confirm the flexibility of our model by showing a Levenshtein Transformer trained by machine translation can straightforwardly be used for automatic post-editing.

Is this capability supported? If so, could an example be added in the appropriate README? If not, are there any plans to work on it?

Motivation

Being able to translate and post-edit using the same model would save a lot of resources and can help push the barrier on APE.

Alternatives

Ideally the code addition would only impact inference, in contrast to traditional APE approaches where the training regime changes substantially (using triples of (source, mt, pe) instead of (source, target) pairs. This is confirmed in section 4.2 of the original paper which shows promising refinement result in a “zero-shot post-editing” setting.

enhancement help wanted

Source

steremma

👍5

Most helpful comment

@steremma @wjm41 Thanks for bringing this up! This feature is actually on our to-do list, but unfortunately we are waiting for more bandwidth to finish refactoring our old code to latest fairseq.

As you pointed out, the main changes required for APE are: 1) (source, mt, pe) triplets in the dataset class; 2) initiate the decoder with mt outputs instead of empty sentences. Please feel free to make a pull request for your implementation and let us know if you need any clarification of the implementation details. Thanks!

add @MultiPath

kahne on 25 Apr 2020

👍4

All 5 comments

CC @kahne

lematt1991 on 22 Apr 2020

I was about to request this as well. If there are no plans to include this capability in fairseq, could some pointers be provided on how one could modify the code to achieve this by themselves? eg where/how to modify the loss function and the decoding layers.

wjm41 on 24 Apr 2020

👍1

Please correct me if I am wrong but from what I understood from the paper the modification is actually much simpler (no modification of loss/decoding layers is required). The only thing to change is the initialization of y0 at decoding, instead of the empty sequence is should be initialised with the mt output to be refined. It would be amazing is someone could refute/confirm this though.

steremma on 24 Apr 2020

In Algorithm 1 (section A of the appendix to the paper), it looks like y_0 (the initial sequence mt) is involved in the calculation of the y_ins and y_del which are used to evaluate the loss. I'm not sure if there are other places where the existence of a non-empty y_0 makes a difference both in terms of the theory and the implementation (maybe extra modifications required to data preprocessing/batching?)

wjm41 on 24 Apr 2020

add @MultiPath

kahne on 25 Apr 2020

👍4

Was this page helpful?

0 / 5 - 0 ratings