Glow: Should lower/optimize behave differently based on the compilation mode?

Created on 23 Jul 2018 · 5Comments · Source: pytorch/glow

As of today our API takes the CompilationMode for both the lowering and optimization phase (glow::lower and glow::optimize).
Except for one thing in the lowering, which I will come back later, this argument does change the behavior of these functions and thus, I was wondering if we should pass it at all.
My way of thinking is if we don't use it, let's not pass it and if/when we need it we can add it.

Now going back to the only user of this argument in lowering.
The lowering of batch normalization is the only user of that argument, but I believe we could avoid that use and that would be a better design. Basically, we lower BN differently between inference and training, whereas I think the right thing would be to have a different representation when we differentiate BN. In essence, what I am suggesting for this use case is to introduce a new node for mean and variance normalization (i.e., for what we special case with the training mode during lowering.) In other words, differentiation for BN would produce BNGrad(BN(meanVarNorm)) and that would completely eliminate the need for a special lowering.

All in all, as of today, I don't see a reason why we pass this mode.

What do people think?

Source

qcolombet

All 5 comments

Thanks for raising this issue @qcolombet. Let's talk about BatchNormalization. BN is implemented with conceptual 3 nodes: BN-inference, BN-train and BN-gradient. If I understand your suggestion, when we construct the inference graph (when loading C2 models, or when constructing our unit tests) we'll emit the BN-inference node. Later, during differentiation we'll insert the BN-gradient node, just like we do today, but also convert BN-inference into BN-train. Also, we would change the BN optimization to work on BN-inference and not on BN-train. Is this what you are suggesting? Generally, I like this direction. I think that it makes sense.

nadavrot on 23 Jul 2018

Hi @nadavrot,

In essence, yes that's what I am suggesting with one difference.
Instead of creating a BN-train node that would replace the BN-inference during differentiation, I was thinking of replacing BN with two nodes BN + meanVarNorm during differentiation.

The effects are the same, the difference is that we wouldn't have to change the optimizations.
I.e.,
Input:
BN-inference

Differentation: