Glow: Change the default format how FC (and MatMul) stores weights?

Created on 2 Apr 2019 · 5Comments · Source: pytorch/glow

When we load FC from C2 models, we automatically transpose the weights, so that we have A: M x K and W:K x N

And we use those layouts in the interpreter and in the CPU backend. But almost all of our HW backends (e.g. OpenCL, etc) require that W is provided in pre-transposed, so that the inputs are A:M x K and B:N x K, because this results in a much more efficient memory access pattern. To achieve that, all those HW backends introduce their custom versions of FC, where the only difference is the transposed W.

It also seems that this pre-transposed W layout may be more efficient even on CPUs, because of a more efficient memory access pattern, but this is not clear.

Given all this, should we change the default for the Glow's FC node and use pre-transposed Ws? This would allow us to get rid of re-defining HW-backend specific FC nodes and may even make our CPU FCs faster.

Opinions?

Remark: Besides transposing W, HW may also require a HW-specific padding and alignment of each row of W. That cannot be naturally expressed using Glow's Tensors yet.

Source

opti-mix

Most helpful comment

Yes, I agree. This design question should not change performance, just the default canonical representation of things.

nadavrot on 2 Apr 2019

👍2

All 5 comments

@tlepley @ayermolo can you weight in?

rdzhabarov on 2 Apr 2019

"It also seems that this pre-transposed W layout may be more efficient even on CPUs, because of a more efficient memory access pattern, but this is not clear."

Transposed matrix multiplications are not more efficient on CPUs, due to the use of vectors. See this: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

nadavrot on 2 Apr 2019

Transposed matrix multiplications are not more efficient on CPUs, due to the use of vectors. See this: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

OK, since this is the case, we could actually introduce a custom CPUFC for CPUs, which would do what we do today. This would be the exact opposite of what we have today, where the CPUs are OK with the weights format, but all other backends need to introduce custom FC operations ;-)

opti-mix on 2 Apr 2019

Yes, I agree. This design question should not change performance, just the default canonical representation of things.

nadavrot on 2 Apr 2019

👍2

I think for us either way is fine. We do something custom anyway.
Thanks @rdzhabarov