Cudf: [FEA] Assignment by index

Created on 20 Apr 2019  路  22Comments  路  Source: rapidsai/cudf

series[i] = val

TypeError: 'Series' object does not support item assignment

cuDF version 0.6.1+1.g9ca9325

How do I assign values to series and dataframes by index (and column)?

Edit (Mark Harris, May 27 2019):

Setting ranges of values is also needed, e.g.: series[x:y] = value

So, on the C++ side, I propose a new API

gdf_column cudf::set_range(gdf_column const &input, gdf_scalar const& value, gdf_index_type start, gdf_index_type end);

cuDF (Python) cuStreamz feature request

Most helpful comment

Re-opening since this is missing Python bindings and there's no separate issue open for them.

All 22 comments

@pyotr777 currently this isn't supported via the cuDF API. You could hack around the lack of support today by writing a Numba kernel against the underlying device array data of the Series series.data.mem, but we will support this in the future.

I see. Thank you for your answer.

Setting ranges of values is also needed:
series[x:y] = value

Would it useful to make the API more generic to allow assigning an arbitrary number of ranges?

gdf_column cudf::set_range(gdf_column const &input, gdf_column const& values, 
                                             std::vector<std::pair<gdf_index_type, gdf_index_type>const& ranges);

where values[i] will be filled into [ ranges[i].first, ranges[i].second ].

@jrhemstad I looked a bit, but didn't come across people trying to set values on lists of ranges in Pandas.

But that would be nice to have as an option if it isn't much additional work.

Sounds like I'm trying to solve a problem that doesn't exist then.

Let's just stick with Mark's original API:

gdf_column cudf::set_range(gdf_column const &input, gdf_scalar const& value, gdf_index_type start, gdf_index_type end);

Hmmm, so the request on the Python side is for in-place assignment. Is that what is needed? Should we do the same on the C++ side, or do an out-of-place algorithm as I suggested above?

@kkraus14 @jrhemstad feel free to comment.

I've started on the C++ implementation of this, since it is blocking an important project.

Hmmm, so the request on the Python side is for in-place assignment. Is that what is needed? Should we do the same on the C++ side, or do an out-of-place algorithm as I suggested above?

@kkraus14 @jrhemstad feel free to comment.

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

I think you wrote gather to have the option of operating in-place? https://github.com/rapidsai/cudf/blob/a7068797b44353c7e2e65590a540b4f8219fa7ad/cpp/include/copying.hpp#L161

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

I think you wrote gather to have the option of operating in-place?

True, but that's ultimately because I needed an in-place gather for internal usage. I also wasn't thinking as much about library design then.

So you are arguing that in-place gather should be in cudf::detail namespace and the API should be changed to return a new column?

So you are arguing that in-place gather should be in cudf::detail namespace and the API should be changed to return a new column?

Yeah, precisely.

The problem is that trivially in-place operations can incur significant overhead to perform out-of-place. fill is an example: for example filling a single index of a column that has 1B ints results in allocation and copy overhead of ~4GB. Since cuDF is aimed at big data, we can't afford that kind of memory inefficiency. I imagine that you chose to expose in-place gather to avoid that inefficiency. I don't think we can afford a rule like "all operations are out-of-place".

I discussed with @kkraus14 and @randerzander and they felt that an operation like series[i] = val would almost always be used in-place on the Python side. So for now I will expose it in-place. Users can perform out-of-place by explicitly calling cudf::copy before calling cudf::fill().

Just be aware we're going to run into situations where operations on some column element types can't be done in place. As @davidwendt mentioned, a fill on a String column has to be out of place. Likewise will be true of any variable width element type, or a dictionary type. So we'll have some APIs that will need to check the types to be sure they can be done in-place, and the user will need to know that they need to call the in-place vs. out-of-place version depending on the column type.

Another (perhaps common?) use-case that needs to be done out-of-place:

In [9]: a = pd.Series([1, 2, 3])                                                                                                                                                                               

In [10]: a                                                                                                                                                                                                     
Out[10]: 
0    1
1    2
2    3
dtype: int64

In [11]: a.iloc[2] = 3.5                                                                                                                                                                                       

In [12]: a                                                                                                                                                                                                     
Out[12]: 
0    1.0
1    2.0
2    3.5
dtype: float64

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

iloc[2] is basically asking for the 2 integer location row. It's essentially a gather, but can return by view instead of by copy. This could technically run out of place by replacing the underlying array of the column as opposed to actually modifying the files in place.

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

@harrism Well, if the datatype of a was int32 before, and then it changed it to float64, then this operation would need to be out of place. The element width has changed, so there needs to be another memory allocation.

Although, personally I'd implement this with an out-of-place cast op and then an in-place set_range. I was just clarifying what @shwina meant.

Closed by #1908

Re-opening since this is missing Python bindings and there's no separate issue open for them.

Thanks, forgot about that!

Was this page helpful?
0 / 5 - 0 ratings