Glow: [TextTranslator] Quantized text translation is broken

Created on 8 Nov 2018 · 20Comments · Source: pytorch/glow

Looks like a FunctionConverter refactor broke quantized text translation.

Before (https://github.com/pytorch/glow/commit/96377be74fde8678a50ebf43e5b6f1bc62b40a9a):

$ ./bin/text-translator -m=en2gr -load_profile=en2gr.yaml --do_not_quantize_nodes=Add
Enter a sentence in English to translate to German: My name is Jordan .
mein Name ist Jordan .

After (https://github.com/pytorch/glow/commit/2a8f7d52246fe1014d0b95e11ce2effa0968139b):

$ ./bin/text-translator -m=en2gr -load_profile=en2gr.yaml --do_not_quantize_nodes=Add
Enter a sentence in English to translate to German: My name is Jordan .
&quot; für , , der , . , .

Source

jfix71

All 20 comments

@qcolombet Do you have any initial feeling/guess as to why this broke? The commit that appears to have broken it was intended to be NFC.

jfix71 on 9 Nov 2018

No idea, I'll take a look.

qcolombet on 9 Nov 2018

Oh, the one thing that could happen is the profile needs to be regenerated.

qcolombet on 9 Nov 2018

I regenerated it and unfortunately it's still broken 😕

jfix71 on 9 Nov 2018

Thanks for double checking. Looking!

qcolombet on 9 Nov 2018

For me 96377be is already acting strange (though not as strange as 2a8f7d5):

Enter a sentence in English to translate to German: My name is Jordan .
mein Name ist der Jordan @-@ Konzern , der sich mit der Verarbeitung

Anyway, investigating.

qcolombet on 9 Nov 2018

I think I found the bug. I believe this is another instance of #1832 where essentially we decide new quantization parameters for the result of an operation and apply them, whereas the execution expects the output parameters to be the same as the input parameters. This instance of the bug is particularly nasty because usually the operators that have such constraints check it in the verifier, whereas this one was not checked :(.

The problematic node in that case is Slice.
We assign a new scale to its result, whereas the generated code doesn't perform any rescale internally, thus we should have the same scale and offset as the inputs.
The fix is trivial (will post a PR soon), but I will double check if we have more nodes in that situation and add the proper verifier checks.

qcolombet on 9 Nov 2018

👍1

@qcolombet Thanks for investigating this!

I'm now also wondering why your output using https://github.com/pytorch/glow/commit/96377be74fde8678a50ebf43e5b6f1bc62b40a9a is different than mine...

jfix71 on 9 Nov 2018

I'm now also wondering why your output using 96377be is different than mine...

Yeah, that part was weird. Also I tried to run a debug version of the text-translator and I ended up with complete garbage. It smells like uninitialized memory of some sort, but it could simply be my build being in a weird shape!

qcolombet on 9 Nov 2018

After looking closer, we have theoretically the same bug for transpose and reshape. It wouldn't manifest though because right now our quantization profiler won't use different a scale and offset for them. If you apply a profile manually baked, you can expose the bug though.

Fix is in #1999.

qcolombet on 10 Nov 2018

Hey @jfix71! I didn't mark #1999 as fixing this issue, because I still get the garbage output after the correct translation is printed, thus, consistent with what I saw with 96377be.
If you can try #1999 on your side and see if you still have an issue that would at least means that I am looking at something different :).

qcolombet on 10 Nov 2018

@qcolombet It's working for me with https://github.com/pytorch/glow/pull/1999, thanks!! Now we should investigate why we're seeing different output...

jfix71 on 10 Nov 2018

👍1

Essentially I think the eos token doesn’t go through. Why, I don’t know, but that’s work for another day :).

qcolombet on 10 Nov 2018

👍1

All right, found that was the issue for my gibberish outputs.
I was running a buggy profile that was performing a rescale with a very small scale that ended up clipping EOS on something else x).

qcolombet on 14 Nov 2018

Huh, as in you were using a profile that was dumped with a different version of the text-translator? Not something we need a bug fix for?

jfix71 on 14 Nov 2018

Huh, as in you were using a profile that was dumped with a different version of the text-translator?

A buggy text-translator version, yes :).

Not something we need a bug fix for?

We may have something to fix, but I don't know how to reproduce the problem :S.

To be honest, I don't even know how I ended up with a broken version of the profile. I know I was running a broken compiler (the quantization bug fixed with #1999), but I don't see how this version of text-translator would have produced a buggy-profile given the float version ran just fine. This doesn't make sense.

However, when narrowing the bug I saw that one of the rescale was mapping a range from ~[-0.7, 0.7] to ~[-0.01, 0.01] whereas we were just taking a tensorview out of it, and that didn't seem right :).
At this point I figured my profile was broken and I regenerated it.

The one thing that might have happened is the profile file may not have been properly flushed given I quit the program with Ctrl-C.

qcolombet on 14 Nov 2018

Bottom line is the only thing we could fix is how to quit the text-translator :).

qcolombet on 14 Nov 2018

Yeah, and that's already been fixed -- so I guess we just hope it was due to Ctrl-C and move on 😅

jfix71 on 14 Nov 2018

@qcolombet @jfix71 To avoid such profile vs NN model mismatches in the future, could we compute and store some kind of a hash (over the graph) in the profile and then, when we try to load a profile, we would compare the stored hash with a hash of the current graph to see if they match?

opti-mix on 14 Nov 2018

@opti-mix Not sure it would have caught this problem, but I like that. It could ensure that there are no issues regardless of why the graph changed -- e.g. because a different model was loaded, or because a graph optimization changed, etc.

jfix71 on 14 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings