Glow: [GraphOptimizer] Remove unnecessary NHWC-NCWH transposes.

Created on 13 Sep 2018 · 7Comments · Source: pytorch/glow

In Shufflenet we have unnecessary transposes between NCHW and NHWC. We transpose to NCHW, then do some shuffling of the C dimension (via Reshape-Transpose-Reshape), then transpose again back to NHWC. This is unnecessary -- we can remove the transposes between NCHW and NHWC, and then modify the Dims/Shuffle of the Reshape-Transpose-Reshape to shuffle the C dimension.

Such a subgraph from Shufflenet can be seen here:
shufflenet_subgraph

good first issue

Source

jfix71

👍3

Most helpful comment

shufflenetDAG.pdf
From looking at the whole graph this seems to happen a lot (16 times I think) so fixing this should remove 32 transposes

jackm321 on 8 Nov 2018

👍2 ❤1

All 7 comments

@jackm321, this would be also interesting, you'll get a sense of how high-level graph optimizations are implemented in Glow and implement one more. You'd also touch model loader part as well as some debugging instrumentation as to how to dump graphs etc.

Let me know if you'd like to tackle this one.

rdzhabarov on 30 Oct 2018

👍3

shufflenetDAG.pdf
From looking at the whole graph this seems to happen a lot (16 times I think) so fixing this should remove 32 transposes

jackm321 on 8 Nov 2018

👍2 ❤1

To summarize a discussion with @jfix71, @rdzhabarov, and I, this seems to be a little bit difficult to make a general solution to. It would be nice if we could see a transpose in the graph and walk upwards through a series of single-output nodes in the graph or even a dag to look for the opposite transpose (as in the first and last transposes in the picture above) and then remove them and transform the nodes in the middle appropriately. However it seems difficult to do this final step of transforming the nodes in the middle appropriately because some additional knowledge is required about how to transform the specific nodes in the middle (or if it's even possible for those nodes).
As an example, in the case here the next node after the first transpose is a reshape and we can tell in this case which dimension is getting split up because all other dimensions match except for one of them so we can tell how to modify this reshape if we were to remove the preceding transpose however if for example multiple dimensions were changed this reasoning would not be possible. Therefore a general solution to this is probably not straightforward so we'll focus on a solution to the more narrow problem depicted here of alternating transposes and reshapes where the external transposes together form the identity. To do this we will look for this exact pattern in the graph and replace it with the more optimal pattern. If there is a more general solution to this that someone sees please let me know and we can try that as well but to begin with we'll just reproduce this specific problem in a unit test and fix it and then try to generalize from there if possible.

jackm321 on 9 Nov 2018

👍2

To try to understand a bit more about where this weird pattern came from, I looked at the shufflenet paper and the pattern seen in the middle three nodes (reshape, transpose, reshape) is the the Channel Shuffle Operation described on page 2 https://arxiv.org/pdf/1707.01083.pdf. In caffe2 this is one operator https://caffe2.ai/docs/operators-catalogue.html#channelshuffle but in glow it gets turned into 3 nodes (reshape, transpose, reshape) https://github.com/pytorch/glow/blob/8cb23a35632865a27c3dc2dd1194f84cef59adea/lib/Graph/Graph.cpp#L842 at model loading time (although this model is actually loaded from onnx and it is 3 operators there as well).
Because this is now 3 nodes instead of 1 and because there is no rule for sinking transposes below reshapes means that even though the first and last transposes in the pattern from the screenshot above cancel each other out, they can't be eliminated because they can't be moved next to each other.
The only mystery left is where do these extra transposes before and after the ChannelShuffle nodes come from? They don't exist in the input graph as seen here
screen shot 2018-11-12 at 10 20 30 am . The answer is that they come from the fact that glow adds a NCHW2NHWC transpose before each conv node and a NHWC2NCHW transpose after each conv node https://github.com/pytorch/glow/blob/2bf2027e8268ab194371a718e741a97fa6a710d9/lib/Importer/ONNXModelLoader.cpp#L320. So the first transpose seen in this pattern is actually a NHWC2NCHW from a previous conv node that has been sunk down in the graph to this point and then gotten "stuck" here and the last transpose is the NCHW2NHWC from the conv that follows the channel shuffle.
Now that this pattern has been fully accounted for we see that the natural solution is to "unstick" the transpose from above the channel shuffle operation so it can cancel out with the transpose below the pattern. To do this we can add a rule to GraphOptimizer take the (reshape, transpose, reshape) pattern as a whole unit and sink transposes below them.

jackm321 on 12 Nov 2018

👍1

@jackm321 Thanks for the analysis! Overall I think it makes sense to pattern match this reshape-transpose-reshape pattern and sink the transpose below it.

However just to note, one alternative here could be to add a node for ChannelShuffle so that sinkCode() better understands what's happening -- it would see TransposeNHWC2NCHW-ChannelShuffle-TransposeNCHW2NHWC and more easily understand how to modify the ChannelShuffle's parameters. Later on we would lower ChannelShuffleNode to Reshape-Transpose-Reshape. Of course, adding more nodes adds complexity to other parts of the compiler, so it's not necessarily a better idea -- there's a tradeoff here.

jfix71 on 12 Nov 2018

@jfix71 I this is a good point, I thought about this approach as well (adding a ChannelShuffleNode). I feel like I don't have enough context yet to know which solution will be better overall. I think this specific GraphOptimization might be cleaner with a new node but as you said it may add more complexity in other places. I currently have a working solution to this specific issue that doesn't involve adding a new ChannelShuffleNode but if people think it would overall be better to have that new node I can add it and implement the transpose sinking with that.

jackm321 on 12 Nov 2018

@jackm321 In this case I think it makes sense to not add a new node because we can somewhat easily determine what the ChannelShuffle's parameters were from the Reshape-Transpose-Reshape.

jfix71 on 13 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings