Found by @VibhuJawa in https://github.com/dmlc/xgboost/issues/4911 :
import cudf
import numpy as np
def minimal():
X = cudf.DataFrame({'x': cudf.Series([0, 1, 2, None], dtype=np.int32)})
print(X['x'].__cuda_array_interface__['mask'])
if __name__ == '__main__':
minimal()
Here the mask object in the interface is of length 4 bits instead of 8 bytes required by https://arrow.apache.org/docs/format/Layout.html#null-bitmaps
cudf version: 0.9.0
Actually even the check in XGBoost is not correct. We are checking for multiple of 8 bits instead of multiple of 8 bytes since typestr: <t1 means 1 bit in bit field. So the mask object should be of at least 64 bits.
I think the bug should include divisibility too as the format prescribes that it should be be divisible by 8 bytes and the current xgboost implementation seems to be checking for that too.
For example, below seems to be failing too with the same exception.
X = cudf.DataFrame({'x': cudf.Series([0, 1, 2, None]*1_003, dtype=np.int32)})
y = cudf.Series([0]*len(X))
xgboost.DMatrix(X,y)
Output:
XGBoostError: [17:44:16] /workspace/src/data/columnar.h:145: Check failed: get<Integer>(j_shape.front()) % 8 == 0 (4 vs. 0) : Length of validity mask must be a multiple of 8 bytes.
And the bool mask is as follows:
X['x'].__cuda_array_interface__['mask']
Output:
namespace(__cuda_array_interface__={'shape': (4012,), 'typestr': '<t1', 'data': (140344544985088, True), 'version': 1})
@kkraus14 , Would love to get your thoughts on this . Thanks
@trivialfis and I connected offline to discuss this and ended up with about as many questions as answers... Both Numpy __array_interface__ and __cuda_array_interface__ allow specifying a bitarray by using the t typestr character and the integer following it specifies the number of bits in the field. For the mask we use <t1 to indicate it's a single bit per value, and for the shape give the number of relevant bits (number of elements in the column). The problem with this is that the shape isn't representative of the underlying memory allocation (though technically the data's shape doesn't have to representative either). __cuda_array_interface__ is not Arrow specific so it's not safe to assume 8 byte or 64 byte aligned allocations for all libraries using it. Additionally, say a new library comes along and wants to use the same interface as cuDF for an on-GPU conversion to DMatrix, it doesn't necessarily have to align to Arrow spec either.
From my perspective, the safest and most compatible path forward for XGBoost would be to just round to the nearest byte allocation, i.e. for a shape of 13 round to 2 bytes, for 21 round to 3 bytes, etc, but this is obviously a completely non-ideal situation. The other option is to continue using it as is but have an implicit agreement that cuDF memory allocation will be 8 byte aligned, so for a shape of 13 or 21 round to 8 bytes, for a shape of 65 round to 16 bytes. This presents other issues in GPU memory sharing from things like CuPy / Numba / PyTorch / etc. which aren't necessarily 8 byte aligned allocations.
@shwina @RAMitchell @harrism any thoughts here?
cc @dantegd @jakirkham as well since they have good thoughts / opinions about __cuda_array_interface__ as well
To me, the best fix would be asking arrow's team to make an explicit API for obtaining underlying memory layout. Right now we have two workarounds:
More thoughts are welcomed.
In XGBoost, since we use cuda instead of avx instructions, we can safely ignore the alignment requirement and parse whatever specified by the interface. It would be a dirty fix and not good for long term, since arrow has growing support for GPU and we might obtain a GPU df from arrow instead of directly from cudf in the future.
This has an added benefit of supporting non-arrow formatted objects potentially in the future as well while also supporting arrow formatted objects, Is the only difference handling the allocation alignment where Arrow guaratees at least an 8 byte alignment?
It might be worth looking at how things like NumPy and CuPy's packbits and unpackbits handle this sort of thing.
This is just a question of padding right? As the library consuming the bit mask we don't know if we can read the mask using 64 bit words or we have to read one byte at the time.
From xgboosts perspective assuming 1 byte alignment seems safe and would also work when reading inputs that are 8 byte aligned. Performance concerns are negligible.
From cudf perspective perhaps it would be safest to allocate the memory 8 byte aligned, but you don't necessarily have to make this guarantee or advertise this.
This is just a question of padding right? As the library consuming the bit mask we don't know if we can read the mask using 64 bit words or we have to read one byte at the time.
From xgboosts perspective assuming 1 byte alignment seems safe and would also work when reading inputs that are 8 byte aligned. Performance concerns are negligible.
From cudf perspective perhaps it would be safest to allocate the memory 8 byte aligned, but you don't necessarily have to make this guarantee or advertise this.
I believe we already allocate the memory 64 byte aligned for masks and I believe it will be 256 byte aligned with the cudf::column redesign. The problem is there's no way for us to advertise it using this interface. From xgboost's perspective assuming 1 byte alignment would be ideal as it will open up the doors for other GPU libraries in the future who aren't necessarily 8/64/256 byte aligned. I.E. CuPy or PyTorch.
To state it more strongly: ALL device allocations created by cudaMalloc* are at least 256-byte aligned. However suballocators must ensure to maintain allocation alignment. If you use RMM, its suballocator (currently cnmem) currently aligns all pointers that it returns at 512 byte granularity.
Building off what @harrism said, bitmask allocations in libcudf are _always_ 256B _aligned_, and the allocation is _padded_ to a multiple of a 64B.
always at least 256B aligned
The problem is there's no way for us to advertise it using this interface.
Yup. Is there other way to advertise it?
From xgboost's perspective assuming 1 byte alignment would be ideal
Just did.
For the current XGBoost solution, see: https://github.com/dmlc/xgboost/pull/4928/files
Most helpful comment
always at least 256B aligned