Need to update the following locations
bufarr.datachunk.datajust started failing.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage
if anyone has insights
Cython alpha was released today: https://pypi.org/project/Cython/#history
Locally I get below error when building the extensions
buf = <float64_t *>arr.data
bufarr = np.empty(win, dtype=float)
oldbuf = <float64_t *>bufarr.data
for i in range((win - offset), (N - offset)):
buf = buf + 1
bufarr.data = <char *>buf
^
------------------------------------------------------------
pandas/_libs/window/aggregations.pyx:1385:18: Assignment to a read-only property
cc @scoder
My guess is this is the result of https://github.com/cython/cython/pull/3571
So IIUC assigning the data pointer has been deprecated since 2013. Quick fix might be to not use Cython alpha in CI, but in any case we probably need to fix this.
Not sure there is a true replacement in the more modern API that allows for assigning the data pointer at least according to the following mailing list:
https://mail.python.org/pipermail/numpy-discussion/2013-November/068172.html
ref https://github.com/cython/cython/issues/2498 and ping @rgommers as well
That looks like the one evil hack that smells a bit when you implement it but otherwise works perfectly, and that drops on your feet as soon as you've forgotten about it. You already CC-ed NumPy, they are the right people to ask here. CC-ing @mattip as well, who came up with the property wrapper for these fields for Cython.
I'm happy that you didn't ask for an upstream revert of the change. :)
is also very fishy. It also seems to be at the root of a very strange behavior when group by apply is used in statsmodels/statsmodels#6603 . In this example, the wrong data was ending up in the OLS regression because it looks like it has the wrong buffer somehow. This seems to be not completely safe in the case of a deferred operation.
Maybe rewrite that to use bufarr = PyArray_SimpleNewFromData(1, &win, NPY_FLOAT32, buf) instead of bufarr.data = ...? As for the line pointed to by @bashtage, that also needs a rewrite to be safe.
The Reducer class was added in a commit in response to gh-309 in 2011. Perhaps it could be rewritten to use nditer from python or the Array Iterator from C (if the former gives performance problems).
Of course you could just write PyArray_Bytes(chunk) = arr.data, although that would not solve any possible bugs or race conditions due to reusing chunk inappropriately
ok I think we need to fix the cython version again, @alimcmaster1 would you mind doing a PR for this.
We should certainly address this older code paths. IIRC @jbrockmendel was almost able to remove this entire module, but has some perf issues.
If we don't fix these for 1.1.0 (or 1.0.4), we might consider putting an upper bound on Cython in our pyproject.toml: https://github.com/pandas-dev/pandas/blob/master/pyproject.toml. I think that's the only situation where this would affect users. People installing pandas from source after Cython 3.0 is released.
was almost able to remove this entire module, but has some perf issues.
Yah, we need to decide how big a perf penalty we're willing to accept in exchange for losing the maintenance burdens in this module.
In the case of dummy_buf it is whether the code path is correct for any functions that defer evaluation outside of the apply, which can result in incorrect results. This seems like a bigger issue.
Do you have a code sample to reproduce that issue? Can see if #34080 fixes it once green
import statsmodels.api as sm
import pandas as pd
import numpy as np
np.random.seed(42)
Ys = pd.DataFrame(np.random.randn(4,10))
regressors = np.random.randn(10,2)
def regress1(s):
return sm.OLS(s.values, regressors).fit()
results = pd.DataFrame(Ys).apply(regress1, axis='columns')
print([res.rsquared for res in results])
correct = [ sm.OLS(Ys.T[col], regressors).fit().rsquared for col in Ys.T]
print(correct)
Return
[-inf, -inf, -inf, -inf]
[0.4362196642459433, 0.011922654209731376, 0.31130804443466864, 0.16468214939479164]
c:\git\statsmodels\statsmodels\regression\linear_model.py:1702: RuntimeWarning: divide by zero encountered in double_scalars
return 1 - self.ssr/self.uncentered_tss
The first return may vary since the data seems to be from an empty array which has random values.
Not that simple, but it was how this issue was discovered.
What's the status here? We have a few places to update (can we compile a list in the original post? I'll start it) that we need to update before we can build pandas with Cython 3.x?
Is there anything that has to be done here for pandas 1.1.0, other than perhaps setting an upper bound for Cython in our pyproject.toml?
I think we need to invest some time to get things to work with Cython 3. Specifically this comment https://github.com/pandas-dev/pandas/issues/34014#issuecomment-624401636 I started down that path in #34080 just never saw through
Yah, we need to decide how big a perf penalty we're willing to accept in exchange for losing the maintenance burdens in this module.
Re-upping this. Just ripping out libreduction entirely asvs affected are:
before after ratio
[e6e08890] [d03335bc]
<master> <cy30>
+ 4.98±0.03ms 42.5±0.5ms 8.54 groupby.Apply.time_scalar_function_single_col
+ 8.59±0.2ms 59.4±0.4ms 6.92 frame_methods.Apply.time_apply_ref_by_name
+ 17.9±0.1ms 113±0.7ms 6.32 groupby.Apply.time_scalar_function_multi_col
+ 144±1ms 466±20ms 3.23 groupby.MultiColumn.time_lambda_sum
+ 77.4±1ms 239±10ms 3.09 groupby.MultiColumn.time_col_select_lambda_sum
+ 276±2μs 534±4μs 1.93 groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'transformation')
+ 277±1μs 535±6μs 1.93 groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'direct')
+ 49.7±0.6ms 92.1±1ms 1.85 frame_methods.Iteration.time_iteritems_indexing
+ 270±2μs 495±1μs 1.83 groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
+ 277±2μs 507±2μs 1.83 groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'direct')
+ 279±0.7μs 507±2μs 1.82 groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
+ 271±2μs 494±2μs 1.82 groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
+ 267±2μs 421±2μs 1.58 groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'transformation')
+ 267±0.7μs 419±4μs 1.57 groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'direct')
+ 315±1ms 437±2ms 1.39 frame_methods.Nunique.time_frame_nunique
+ 129±0.4ms 175±4ms 1.36 frame_methods.Apply.time_apply_user_func
+ 749±6μs 983±3μs 1.31 groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'transformation')
+ 731±3μs 952±6μs 1.30 groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'transformation')
+ 731±2μs 947±30μs 1.30 groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
+ 964±20μs 1.24±0.01ms 1.28 arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ne')
+ 751±2μs 960±6μs 1.28 groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
+ 10.00±0.05ms 12.7±0.1ms 1.27 frame_methods.Apply.time_apply_lambda_mean
+ 752±2μs 956±4μs 1.27 groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'transformation')
+ 11.2±0.08ms 13.9±0.2ms 1.24 frame_methods.Apply.time_apply_np_mean
+ 736±5μs 880±6μs 1.19 groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'direct')
+ 732±7μs 868±3μs 1.19 groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'transformation')
+ 809±40ns 938±100ns 1.16 index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
+ 2.17±0.2μs 2.46±0.2μs 1.13 index_cached_properties.IndexCache.time_is_all_dates('IntervalIndex')
+ 74.1±0.5ms 83.6±0.4ms 1.13 groupby.Groups.time_series_groups('int64_large')
+ 10.8±0.02ms 12.2±0.07ms 1.12 groupby.MultiColumn.time_col_select_numpy_sum
+ 812±20ns 903±30ns 1.11 index_cached_properties.IndexCache.time_is_monotonic('Int64Index')
+ 1.30±0.03μs 1.44±0.2μs 1.11 index_cached_properties.IndexCache.time_values('DatetimeIndex')
+ 1.93±0.05μs 2.14±0.1μs 1.11 index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+ 457±10ns 507±20ns 1.11 index_cached_properties.IndexCache.time_is_all_dates('RangeIndex')
+ 445±20ns 493±50ns 1.11 index_cached_properties.IndexCache.time_is_unique('RangeIndex')
+ 1.91±0.01ms 2.11±0.5ms 1.11 rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'count')
+ 1.38±0.04μs 1.53±0.08μs 1.10 index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+ 2.24±0.01ms 2.47±0.4ms 1.10 rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'sum')
+ 2.26±0.01ms 2.48±0.4ms 1.10 rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'int', 'sum')
+ 3.27±0.01ms 3.60±0.5ms 1.10 rolling.VariableWindowMethods.time_rolling('DataFrame', '1d', 'int', 'std')
- 45.9±0.5ms 41.3±0.3ms 0.90 io.csv.ReadCSVCategorical.time_convert_post
- 552±2μs 495±2μs 0.90 groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'direct')
- 554±1μs 493±2μs 0.89 groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'transformation')
- 14.8±0.08ms 13.1±0.3ms 0.89 groupby.Nth.time_frame_nth('float64')
- 6.07±0.04ms 5.33±0.05ms 0.88 stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
- 29.7±2ms 25.7±0.3ms 0.87 stat_ops.FrameMultiIndexOps.time_op(0, 'mad')
- 30.6±0.3ms 26.3±0.08ms 0.86 frame_methods.Dropna.time_dropna_axis_mixed_dtypes('any', 1)
- 2.17±0.3μs 1.86±0.1μs 0.86 index_cached_properties.IndexCache.time_inferred_type('CategoricalIndex')
- 5.39±0.03ms 4.59±0.03ms 0.85 index_object.SetOperations.time_operation('int', 'symmetric_difference')
- 516±20μs 434±9μs 0.84 arithmetic.NumericInferOps.time_subtract(<class 'numpy.int32'>)
- 6.74±0.2ms 5.53±0.2ms 0.82 stat_ops.FrameMultiIndexOps.time_op([0, 1], 'std')
- 20.7±0.5ms 16.6±0.3ms 0.80 stat_ops.FrameMultiIndexOps.time_op(0, 'skew')
- 7.75±0.06ms 6.15±0.04ms 0.79 stat_ops.FrameMultiIndexOps.time_op(0, 'sem')
- 1.28±0.01s 1.01±0.02s 0.79 groupby.Apply.time_copy_function_multi_col
- 518±3ms 403±3ms 0.78 groupby.Apply.time_copy_overhead_single_col
- 4.85±0.04ms 3.33±0.01ms 0.69 stat_ops.FrameMultiIndexOps.time_op(1, 'std')
- 4.90±0.02ms 3.32±0.01ms 0.68 stat_ops.FrameMultiIndexOps.time_op(0, 'std')
- 519±1ms 347±0.8ms 0.67 groupby.Transform.time_transform_lambda_max
- 188±2ms 104±2ms 0.55 groupby.TransformEngine.time_dataframe_cython
Updated Shown asvs removing all of libreduction instead of just Reducer
Can't you use PyArray_New to create a zero-copy array from an existing array without reassigning the data pointer?
@mroeschke any idea for ways to avoid the problem-causing pattern in aggregations.pyx?
Not immediately. It looks to affect rolling.apply but I can try to play around with it
AFAICT this code
cdef move(self, int start, int end):
"""
For slicing
"""
self.buf.data = self.values.data + self.stride * start
self.buf.shape[0] = end - start
should be equivalent (ex the creation of a new ndarray object) to
cdef move(self, int start, int end):
self.buf = self.values[start:end]
but when I try this I get a bunch of test failures and a segfault that i cant reproduce in isolation. Am I misunderstanding what move is doing?
Is this a blocker for 1.1?
I don’t think it needs to block a release as it’s a development dependency, though obviously one we want to keep current
Sent from my iPhone
On Jul 6, 2020, at 10:53 AM, jbrockmendel notifications@github.com wrote:

Is this a blocker for 1.1?—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
I think if we don't update this for 1.1, then we'll want to set an upper bound on Cython in our pyproject.toml so that users can compile pandas 1.1 from source after Cython 3.0 is released.
To advocate for fixing sooner rather than later, in addition to being a technical issue for future Cython, it is also a bug when the function being applied has any deferred operations.
@bashtage attempts so far have not gone great, xref #34997, extra eyeballs may help out
Here is one example that does not set data pointer in the move method.
cdef class Slider:
cdef:
ndarray values, buf
Py_ssize_t stride
char *orig_data
def __init__(self, ndarray values, ndarray buf):
assert values.ndim == 1
assert values.dtype == buf.dtype
if not values.flags.contiguous:
values = values.copy()
self.values = values.copy() # the original values would be modified if not copied
self.buf = buf
self.stride = values.strides[0]
self.orig_data = self.buf.data
self.buf.data = self.values.data
self.buf.strides[0] = self.stride
cdef move(self, int start, int end):
# self.buf.data = self.values.data + self.stride * start
self.buf.shape[0] = end - start # length must be set first
self.buf[:] = self.values[start:end]
cdef reset(self):
self.buf.data = self.orig_data
self.buf.shape[0] = 0
The two pointer setting statements in the __init__ and reset cannot be removed. Otherwise, there would be segfaults and free wrong memory errors. Somehow, we have to keep the original buf. It won't work if we do something like self.buf = buf.copy() or self.buf = self.values[start:end].
Unfortunately I don't know enough about cython and other parts of the code to figure this out. We might need to look at how the Slider's buf is used to find another way for this.
I spent some more time looking at what this is doing today and came up with the following notes (I apologize if I am saying really obvious things)
The Slider and BlockSlider classes are implementing views into a numpy array by:
The mutated array is then used to update "cached objects" in their calling class by updating the pandas side block manager details. I suspect that this is the source of the stats model issues mentioned above as the code is aggressively changing things underneath the eventual user-exposed objects.
The change that has broken things is than cython now disallows relpacing the guts of a numpy array (which seem fair!). My guess is that the performance gains come from both not memory thrashing and not falling back to the python layer. The cython docs says that when you do [] on a numpy array it falls back to python (I assume because the inputs are too variable) which is probably the source of the major performance regressions.
I am not super clear how the numpy nbiter interface works, but it looks like it is focused on getting an iterator over single elements, or at least fixed steps through the array, where as for this code we need iteration over variable size windows.
It looks like the way to do this with memory views ( https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html#efficient-indexing-with-memoryviews ) but those seem to require knowing what the type is up front.
My suspicion is that the right solution here is to do something like what @mattip suggested above and in def move use the pointers we have to the underlying data and fabricate new numpy arrays of just the sub-section that is needed.
These classes appear to only be used internally to the reduction module so I do not think there are any back-compatibility with completely re-writing them.
attn @scoder for guidance on which of these methods (or one I do not see) is the best path.
also, I think this should be milestoned to an actual release.
also, I think this should be milestoned to an actual release.
w/o a dedicated resource to do this it's impossible to actually say it will be in a particular release
also, I think this should be milestoned to an actual release.
This would certainly be nice.
Is there any indication that cython will be releasing 3.0 anytime soon? The mailing list has been quiet; i havent been following the issue tracker recently.
I still think #34997 was a reasonable approach, but i never got it working. More eyeballs may be helpful.
IIRC one of the hangups was that the non-cython path behaves slightly differently from the cython path when it comes to unwrapping results at each step in the loop. If that were resolved, then we would at least have the option of ripping out the cython version altogether.