I'm currently implementing some algorithms on top of the IntegerArray class with numba. Therefore I would need to pass in the two separate backing NumPy arrays instead of the pandas class. Using series.to_numpy() isn't helpful in this case as this returns a object-typed array. For now, I'll keep using series.values._data and series.values._mask but I'm aware that using private variables is not a long-term solution.
Given that series.values._data may be undefined where the mask is True, I know that returning the raw data may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.
Yeah, but we have two optimizations planned that makes this a bit tricky:
._mask optional. If there are no missing values then we can avoid allocating the array to store it.._mask a bitmask rather than a bytemask.
series.values._mask
Just an FYI, series.array.isna() will give you _mask through a public API. Doesn't help you with ._data though :)
Actually... can you use a combination of these two?
data = array.to_numpy(na_value=0)
mask = array.isna()
I think that's 0-copy access to both components...
Edit: no it's definitely not zero-copy, since we insert na_value for the NAs.
Thank you for the detailed answer!
I think array.isna() is a pretty decent interface for me as it will be stable regarding the future performance optimization 2.. When 2. is implemented, having a public interface to retrieve the bitmask would also be nice but at least this way, my code won't break on implementation changes (it will just get a bit slower).
One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement __arrow_array__, so you can get zero-copy access to it in a way that won't change if / when we update our implementation.
cc @jorisvandenbossche if you have thoughts.
Note that array.isna() also gives a copy of the mask, so not necessarily optimal.
The __arrow_array__ is indeed zero-copy for the data, but not for the mask since that gets converted into a bitmask. If you need it as a bitmask, that might be fine, but if you actually need a numpy boolean mask for numba, that would give a double conversion bytemask-bitmask->bytemask.
But, long term it would indeed be good to have some more official way to access this.
We could add a return_mask=True/False option to to_numpy()?
Note that array.isna() also gives a copy of the mask,
Actually, we don't:
but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array ..
As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only pandas + numba. Using the __arrow_array__ interface for this would be for me on the same level as just accessing the private members of the array. I'm interesting in converting some existing computations that were yet implemented with numba on numpy float+int arrays to also support the new integer extension type.
I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by pandas. As long as it is a bytemask, take that and when the internal implementation switches, take the bitmask.
Taking a step back, an alternative for this use case would also be to provide an interface in pandas Series.apply_nonnull(numba_func_that_works_on_scalars) like it's done in Rolling.apply. This would solve my problem of having access to the raw arrays while removing the need for the user to check for nulls on byte- or bitmask. Especially as in the case of bitmask, I have not yet found a simple and performant numba access pattern.
We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask.
For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU.
cc @brandon-b-miller
I think the best we can do for now is say "use ._mask and ._data with the caveats that"
._mask is not None (to protect against it being optional in the future)_mask.size / itemsize is what you want (to protect against it being a bitmask in the future)And we'll agree among friends that pandas won't change the names?
We could offer a public .mask and .data but I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.
Thanks @TomAugspurger! We'll use ._data and ._mask for now.
FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have .data / .mask nullify the offset by doing the pointer arithmetic / copy construction (if needed) respectively, but also provide a .base_data and .base_mask for when the underlying allocation is needed to be accessed zero copy without handling the offset.
Most helpful comment
I think the best we can do for now is say "use
._maskand._datawith the caveats that"._maskis not None (to protect against it being optional in the future)_mask.size/itemsizeis what you want (to protect against it being a bitmask in the future)And we'll agree among friends that pandas won't change the names?
We could offer a public
.maskand.databut I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.