CPUs perform best when tensors are allocated at a multiple of 64 bytes. The reason is that AVX512 instructions operate on 64 bytes at a time and memory access is more efficient when memory is aligned.
This code intends to align to a multiple of 64 bytes:
https://github.com/apache/incubator-mxnet/blob/4bb82245ee5fcbfd32da6461f7b0770ae3c2d9b6/src/storage/cpu_device_storage.h#L54-L56
However, the above code pedantically only controls overall alignment of memory blocks, not the storage managers that divvy them up. Commit 3ef00b8840c05c49118705f6fd9663ebb951f3a1 broke 64-byte alignment in the default storage manager.
CHECK_EQ(reinterpret_cast<intptr_t>(handle->dptr) % 64, 0);
Note that doing the same on 2abf0b8c2b3361c73c9dfdeabdb8a88278b693d0 works successfully without error.
cc @andrei5055 @pengzhao-intel
I will take a look at this
I will take a look at this
Thanks 馃憤
Fixed by #18885
Most helpful comment
I will take a look at this