pandas 🚀 - Update setting data pointers for Cython 3

Cython alpha was released today: https://pypi.org/project/Cython/#history

Locally I get below error when building the extensions

        buf = <float64_t *>arr.data
        bufarr = np.empty(win, dtype=float)
        oldbuf = <float64_t *>bufarr.data
        for i in range((win - offset), (N - offset)):
            buf = buf + 1
            bufarr.data = <char *>buf
                 ^
------------------------------------------------------------

pandas/_libs/window/aggregations.pyx:1385:18: Assignment to a read-only property

cc @scoder

alimcmaster1 on 6 May 2020

My guess is this is the result of https://github.com/cython/cython/pull/3571

WillAyd on 6 May 2020

So IIUC assigning the data pointer has been deprecated since 2013. Quick fix might be to not use Cython alpha in CI, but in any case we probably need to fix this.

Not sure there is a true replacement in the more modern API that allows for assigning the data pointer at least according to the following mailing list:

https://mail.python.org/pipermail/numpy-discussion/2013-November/068172.html

ref https://github.com/cython/cython/issues/2498 and ping @rgommers as well

WillAyd on 6 May 2020

That looks like the one evil hack that smells a bit when you implement it but otherwise works perfectly, and that drops on your feet as soon as you've forgotten about it. You already CC-ed NumPy, they are the right people to ask here. CC-ing @mattip as well, who came up with the property wrapper for these fields for Cython.

I'm happy that you didn't ask for an upstream revert of the change. :)

scoder on 6 May 2020

https://github.com/pandas-dev/pandas/blob/18a7e973ced93b9c1afe42e6d76b9a13f91d3515/pandas/_libs/reduction.pyx#L102-L106

is also very fishy. It also seems to be at the root of a very strange behavior when group by apply is used in statsmodels/statsmodels#6603 . In this example, the wrong data was ending up in the OLS regression because it looks like it has the wrong buffer somehow. This seems to be not completely safe in the case of a deferred operation.

bashtage on 6 May 2020

Maybe rewrite that to use bufarr = PyArray_SimpleNewFromData(1, &win, NPY_FLOAT32, buf) instead of bufarr.data = ...? As for the line pointed to by @bashtage, that also needs a rewrite to be safe.

mattip on 6 May 2020

The Reducer class was added in a commit in response to gh-309 in 2011. Perhaps it could be rewritten to use nditer from python or the Array Iterator from C (if the former gives performance problems).

mattip on 6 May 2020

Of course you could just write PyArray_Bytes(chunk) = arr.data, although that would not solve any possible bugs or race conditions due to reusing chunk inappropriately

mattip on 6 May 2020

ok I think we need to fix the cython version again, @alimcmaster1 would you mind doing a PR for this.

We should certainly address this older code paths. IIRC @jbrockmendel was almost able to remove this entire module, but has some perf issues.

jreback on 6 May 2020

If we don't fix these for 1.1.0 (or 1.0.4), we might consider putting an upper bound on Cython in our pyproject.toml: https://github.com/pandas-dev/pandas/blob/master/pyproject.toml. I think that's the only situation where this would affect users. People installing pandas from source after Cython 3.0 is released.

TomAugspurger on 13 May 2020

was almost able to remove this entire module, but has some perf issues.

Yah, we need to decide how big a perf penalty we're willing to accept in exchange for losing the maintenance burdens in this module.

jbrockmendel on 14 May 2020

In the case of dummy_buf it is whether the code path is correct for any functions that defer evaluation outside of the apply, which can result in incorrect results. This seems like a bigger issue.

bashtage on 14 May 2020

Do you have a code sample to reproduce that issue? Can see if #34080 fixes it once green

WillAyd on 14 May 2020

import statsmodels.api as sm
import pandas as pd
import numpy as np

np.random.seed(42)

Ys = pd.DataFrame(np.random.randn(4,10))
regressors = np.random.randn(10,2)

def regress1(s):
    return sm.OLS(s.values, regressors).fit()

results = pd.DataFrame(Ys).apply(regress1, axis='columns')
print([res.rsquared for res in results])

correct = [ sm.OLS(Ys.T[col], regressors).fit().rsquared for col in Ys.T]
print(correct)

Return

[-inf, -inf, -inf, -inf]
[0.4362196642459433, 0.011922654209731376, 0.31130804443466864, 0.16468214939479164]
c:\git\statsmodels\statsmodels\regression\linear_model.py:1702: RuntimeWarning: divide by zero encountered in double_scalars
  return 1 - self.ssr/self.uncentered_tss

The first return may vary since the data seems to be from an empty array which has random values.

bashtage on 14 May 2020

Not that simple, but it was how this issue was discovered.

bashtage on 14 May 2020

What's the status here? We have a few places to update (can we compile a list in the original post? I'll start it) that we need to update before we can build pandas with Cython 3.x?

Is there anything that has to be done here for pandas 1.1.0, other than perhaps setting an upper bound for Cython in our pyproject.toml?

TomAugspurger on 17 Jun 2020

I think we need to invest some time to get things to work with Cython 3. Specifically this comment https://github.com/pandas-dev/pandas/issues/34014#issuecomment-624401636 I started down that path in #34080 just never saw through

WillAyd on 17 Jun 2020

👍1

Yah, we need to decide how big a perf penalty we're willing to accept in exchange for losing the maintenance burdens in this module.

Re-upping this. Just ripping out libreduction entirely asvs affected are:

       before           after         ratio
     [e6e08890]       [d03335bc]
     <master>         <cy30>
+     4.98±0.03ms       42.5±0.5ms     8.54  groupby.Apply.time_scalar_function_single_col
+      8.59±0.2ms       59.4±0.4ms     6.92  frame_methods.Apply.time_apply_ref_by_name
+      17.9±0.1ms        113±0.7ms     6.32  groupby.Apply.time_scalar_function_multi_col
+         144±1ms         466±20ms     3.23  groupby.MultiColumn.time_lambda_sum
+        77.4±1ms         239±10ms     3.09  groupby.MultiColumn.time_col_select_lambda_sum
+         276±2μs          534±4μs     1.93  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'transformation')
+         277±1μs          535±6μs     1.93  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'direct')
+      49.7±0.6ms         92.1±1ms     1.85  frame_methods.Iteration.time_iteritems_indexing
+         270±2μs          495±1μs     1.83  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
+         277±2μs          507±2μs     1.83  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'direct')
+       279±0.7μs          507±2μs     1.82  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
+         271±2μs          494±2μs     1.82  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
+         267±2μs          421±2μs     1.58  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'transformation')
+       267±0.7μs          419±4μs     1.57  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'direct')
+         315±1ms          437±2ms     1.39  frame_methods.Nunique.time_frame_nunique
+       129±0.4ms          175±4ms     1.36  frame_methods.Apply.time_apply_user_func
+         749±6μs          983±3μs     1.31  groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'transformation')
+         731±3μs          952±6μs     1.30  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'transformation')
+         731±2μs         947±30μs     1.30  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
+        964±20μs      1.24±0.01ms     1.28  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ne')
+         751±2μs          960±6μs     1.28  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
+    10.00±0.05ms       12.7±0.1ms     1.27  frame_methods.Apply.time_apply_lambda_mean
+         752±2μs          956±4μs     1.27  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'transformation')
+     11.2±0.08ms       13.9±0.2ms     1.24  frame_methods.Apply.time_apply_np_mean
+         736±5μs          880±6μs     1.19  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'direct')
+         732±7μs          868±3μs     1.19  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'transformation')
+        809±40ns        938±100ns     1.16  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
+      2.17±0.2μs       2.46±0.2μs     1.13  index_cached_properties.IndexCache.time_is_all_dates('IntervalIndex')
+      74.1±0.5ms       83.6±0.4ms     1.13  groupby.Groups.time_series_groups('int64_large')
+     10.8±0.02ms      12.2±0.07ms     1.12  groupby.MultiColumn.time_col_select_numpy_sum
+        812±20ns         903±30ns     1.11  index_cached_properties.IndexCache.time_is_monotonic('Int64Index')
+     1.30±0.03μs       1.44±0.2μs     1.11  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+     1.93±0.05μs       2.14±0.1μs     1.11  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+        457±10ns         507±20ns     1.11  index_cached_properties.IndexCache.time_is_all_dates('RangeIndex')
+        445±20ns         493±50ns     1.11  index_cached_properties.IndexCache.time_is_unique('RangeIndex')
+     1.91±0.01ms       2.11±0.5ms     1.11  rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'count')
+     1.38±0.04μs      1.53±0.08μs     1.10  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+     2.24±0.01ms       2.47±0.4ms     1.10  rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'sum')
+     2.26±0.01ms       2.48±0.4ms     1.10  rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'int', 'sum')
+     3.27±0.01ms       3.60±0.5ms     1.10  rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'std')
-      45.9±0.5ms       41.3±0.3ms     0.90  io.csv.ReadCSVCategorical.time_convert_post
-         552±2μs          495±2μs     0.90  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'direct')
-         554±1μs          493±2μs     0.89  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'transformation')
-     14.8±0.08ms       13.1±0.3ms     0.89  groupby.Nth.time_frame_nth('float64')
-     6.07±0.04ms      5.33±0.05ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
-        29.7±2ms       25.7±0.3ms     0.87  stat_ops.FrameMultiIndexOps.time_op(0, 'mad')
-      30.6±0.3ms      26.3±0.08ms     0.86  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('any', 1)
-      2.17±0.3μs       1.86±0.1μs     0.86  index_cached_properties.IndexCache.time_inferred_type('CategoricalIndex')
-     5.39±0.03ms      4.59±0.03ms     0.85  index_object.SetOperations.time_operation('int', 'symmetric_difference')
-        516±20μs          434±9μs     0.84  arithmetic.NumericInferOps.time_subtract(<class 'numpy.int32'>)
-      6.74±0.2ms       5.53±0.2ms     0.82  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'std')
-      20.7±0.5ms       16.6±0.3ms     0.80  stat_ops.FrameMultiIndexOps.time_op(0, 'skew')
-     7.75±0.06ms      6.15±0.04ms     0.79  stat_ops.FrameMultiIndexOps.time_op(0, 'sem')
-      1.28±0.01s       1.01±0.02s     0.79  groupby.Apply.time_copy_function_multi_col
-         518±3ms          403±3ms     0.78  groupby.Apply.time_copy_overhead_single_col
-     4.85±0.04ms      3.33±0.01ms     0.69  stat_ops.FrameMultiIndexOps.time_op(1, 'std')
-     4.90±0.02ms      3.32±0.01ms     0.68  stat_ops.FrameMultiIndexOps.time_op(0, 'std')
-         519±1ms        347±0.8ms     0.67  groupby.Transform.time_transform_lambda_max
-         188±2ms          104±2ms     0.55  groupby.TransformEngine.time_dataframe_cython

Updated Shown asvs removing all of libreduction instead of just Reducer

jbrockmendel on 18 Jun 2020

Can't you use PyArray_New to create a zero-copy array from an existing array without reassigning the data pointer?

bashtage on 18 Jun 2020

@mroeschke any idea for ways to avoid the problem-causing pattern in aggregations.pyx?

jbrockmendel on 20 Jun 2020

Not immediately. It looks to affect rolling.apply but I can try to play around with it

mroeschke on 20 Jun 2020

AFAICT this code

    cdef move(self, int start, int end):
        """
        For slicing
        """
        self.buf.data = self.values.data + self.stride * start
        self.buf.shape[0] = end - start

should be equivalent (ex the creation of a new ndarray object) to

    cdef move(self, int start, int end):
        self.buf = self.values[start:end]

but when I try this I get a bunch of test failures and a segfault that i cant reproduce in isolation. Am I misunderstanding what move is doing?

jbrockmendel on 25 Jun 2020

Is this a blocker for 1.1?

jbrockmendel on 6 Jul 2020

I don’t think it needs to block a release as it’s a development dependency, though obviously one we want to keep current

Sent from my iPhone

On Jul 6, 2020, at 10:53 AM, jbrockmendel notifications@github.com wrote:

Is this a blocker for 1.1?

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WillAyd on 6 Jul 2020

I think if we don't update this for 1.1, then we'll want to set an upper bound on Cython in our pyproject.toml so that users can compile pandas 1.1 from source after Cython 3.0 is released.

TomAugspurger on 6 Jul 2020

To advocate for fixing sooner rather than later, in addition to being a technical issue for future Cython, it is also a bug when the function being applied has any deferred operations.

bashtage on 7 Jul 2020

👍1

@bashtage attempts so far have not gone great, xref #34997, extra eyeballs may help out

jbrockmendel on 7 Jul 2020

Here is one example that does not set data pointer in the move method.

cdef class Slider:
    cdef:
        ndarray values, buf
        Py_ssize_t stride
        char *orig_data

    def __init__(self, ndarray values, ndarray buf):
        assert values.ndim == 1
        assert values.dtype == buf.dtype

        if not values.flags.contiguous:
            values = values.copy()

        self.values = values.copy()  # the original values would be modified if not copied
        self.buf = buf

        self.stride = values.strides[0]
        self.orig_data = self.buf.data

        self.buf.data = self.values.data
        self.buf.strides[0] = self.stride

    cdef move(self, int start, int end):
        # self.buf.data = self.values.data + self.stride * start
        self.buf.shape[0] = end - start  # length must be set first
        self.buf[:] = self.values[start:end]

    cdef reset(self):
        self.buf.data = self.orig_data
        self.buf.shape[0] = 0

The two pointer setting statements in the __init__ and reset cannot be removed. Otherwise, there would be segfaults and free wrong memory errors. Somehow, we have to keep the original buf. It won't work if we do something like self.buf = buf.copy() or self.buf = self.values[start:end].

Unfortunately I don't know enough about cython and other parts of the code to figure this out. We might need to look at how the Slider's buf is used to find another way for this.

fangchenli on 11 Oct 2020

I spent some more time looking at what this is doing today and came up with the following notes (I apologize if I am saying really obvious things)

The Slider and BlockSlider classes are implementing views into a numpy array by:

taking in two arrays
stashing the pointer to one of them
on demand mutating the second array to point at the (offset) pointer to the first and adjusting the strides to give the window
on the way out the old guts of the second array are put back

The mutated array is then used to update "cached objects" in their calling class by updating the pandas side block manager details. I suspect that this is the source of the stats model issues mentioned above as the code is aggressively changing things underneath the eventual user-exposed objects.

The change that has broken things is than cython now disallows relpacing the guts of a numpy array (which seem fair!). My guess is that the performance gains come from both not memory thrashing and not falling back to the python layer. The cython docs says that when you do [] on a numpy array it falls back to python (I assume because the inputs are too variable) which is probably the source of the major performance regressions.

I am not super clear how the numpy nbiter interface works, but it looks like it is focused on getting an iterator over single elements, or at least fixed steps through the array, where as for this code we need iteration over variable size windows.

It looks like the way to do this with memory views ( https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html#efficient-indexing-with-memoryviews ) but those seem to require knowing what the type is up front.

My suspicion is that the right solution here is to do something like what @mattip suggested above and in def move use the pointers we have to the underlying data and fabricate new numpy arrays of just the sub-section that is needed.

These classes appear to only be used internally to the reduction module so I do not think there are any back-compatibility with completely re-writing them.

attn @scoder for guidance on which of these methods (or one I do not see) is the best path.

tacaswell on 14 Nov 2020

👍1

also, I think this should be milestoned to an actual release.

tacaswell on 14 Nov 2020

also, I think this should be milestoned to an actual release.

w/o a dedicated resource to do this it's impossible to actually say it will be in a particular release

jreback on 15 Nov 2020

also, I think this should be milestoned to an actual release.

This would certainly be nice.

Is there any indication that cython will be releasing 3.0 anytime soon? The mailing list has been quiet; i havent been following the issue tracker recently.

I still think #34997 was a reasonable approach, but i never got it working. More eyeballs may be helpful.

IIRC one of the hangups was that the non-cython path behaves slightly differently from the cython path when it comes to unwrapping results at each step in the loop. If that were resolved, then we would at least have the option of ripping out the cython version altogether.

jbrockmendel on 15 Nov 2020

Pandas: Update setting data pointers for Cython 3

All 32 comments

Related issues