Looks like a FunctionConverter refactor broke quantized text translation.
Before (https://github.com/pytorch/glow/commit/96377be74fde8678a50ebf43e5b6f1bc62b40a9a):
$ ./bin/text-translator -m=en2gr -load_profile=en2gr.yaml --do_not_quantize_nodes=Add
Enter a sentence in English to translate to German: My name is Jordan .
mein Name ist Jordan .
After (https://github.com/pytorch/glow/commit/2a8f7d52246fe1014d0b95e11ce2effa0968139b):
$ ./bin/text-translator -m=en2gr -load_profile=en2gr.yaml --do_not_quantize_nodes=Add
Enter a sentence in English to translate to German: My name is Jordan .
" f眉r , , der , . , .
@qcolombet Do you have any initial feeling/guess as to why this broke? The commit that appears to have broken it was intended to be NFC.
No idea, I'll take a look.
Oh, the one thing that could happen is the profile needs to be regenerated.
I regenerated it and unfortunately it's still broken 馃槙
Thanks for double checking. Looking!
For me 96377be is already acting strange (though not as strange as 2a8f7d5):
Enter a sentence in English to translate to German: My name is Jordan .
mein Name ist der Jordan @-@ Konzern , der sich mit der Verarbeitung
Anyway, investigating.
I think I found the bug. I believe this is another instance of #1832 where essentially we decide new quantization parameters for the result of an operation and apply them, whereas the execution expects the output parameters to be the same as the input parameters. This instance of the bug is particularly nasty because usually the operators that have such constraints check it in the verifier, whereas this one was not checked :(.
The problematic node in that case is Slice.
We assign a new scale to its result, whereas the generated code doesn't perform any rescale internally, thus we should have the same scale and offset as the inputs.
The fix is trivial (will post a PR soon), but I will double check if we have more nodes in that situation and add the proper verifier checks.
@qcolombet Thanks for investigating this!
I'm now also wondering why your output using https://github.com/pytorch/glow/commit/96377be74fde8678a50ebf43e5b6f1bc62b40a9a is different than mine...
I'm now also wondering why your output using 96377be is different than mine...
Yeah, that part was weird. Also I tried to run a debug version of the text-translator and I ended up with complete garbage. It smells like uninitialized memory of some sort, but it could simply be my build being in a weird shape!
After looking closer, we have theoretically the same bug for transpose and reshape. It wouldn't manifest though because right now our quantization profiler won't use different a scale and offset for them. If you apply a profile manually baked, you can expose the bug though.
Fix is in #1999.
Hey @jfix71! I didn't mark #1999 as fixing this issue, because I still get the garbage output after the correct translation is printed, thus, consistent with what I saw with 96377be.
If you can try #1999 on your side and see if you still have an issue that would at least means that I am looking at something different :).
@qcolombet It's working for me with https://github.com/pytorch/glow/pull/1999, thanks!! Now we should investigate why we're seeing different output...
Essentially I think the eos token doesn鈥檛 go through. Why, I don鈥檛 know, but that鈥檚 work for another day :).
All right, found that was the issue for my gibberish outputs.
I was running a buggy profile that was performing a rescale with a very small scale that ended up clipping EOS on something else x).
Huh, as in you were using a profile that was dumped with a different version of the text-translator? Not something we need a bug fix for?
Huh, as in you were using a profile that was dumped with a different version of the text-translator?
A buggy text-translator version, yes :).
Not something we need a bug fix for?
We may have something to fix, but I don't know how to reproduce the problem :S.
To be honest, I don't even know how I ended up with a broken version of the profile. I know I was running a broken compiler (the quantization bug fixed with #1999), but I don't see how this version of text-translator would have produced a buggy-profile given the float version ran just fine. This doesn't make sense.
However, when narrowing the bug I saw that one of the rescale was mapping a range from ~[-0.7, 0.7] to ~[-0.01, 0.01] whereas we were just taking a tensorview out of it, and that didn't seem right :).
At this point I figured my profile was broken and I regenerated it.
The one thing that might have happened is the profile file may not have been properly flushed given I quit the program with Ctrl-C.
Bottom line is the only thing we could fix is how to quit the text-translator :).
Yeah, and that's already been fixed -- so I guess we just hope it was due to Ctrl-C and move on 馃槄
@qcolombet @jfix71 To avoid such profile vs NN model mismatches in the future, could we compute and store some kind of a hash (over the graph) in the profile and then, when we try to load a profile, we would compare the stored hash with a hash of the current graph to see if they match?
@opti-mix Not sure it would have caught this problem, but I like that. It could ensure that there are no issues regardless of why the graph changed -- e.g. because a different model was loaded, or because a graph optimization changed, etc.