Cudf: [FEA] Assignment by index

Created on 20 Apr 2019 · 22Comments · Source: rapidsai/cudf

series[i] = val

TypeError: 'Series' object does not support item assignment

cuDF version 0.6.1+1.g9ca9325

How do I assign values to series and dataframes by index (and column)?

Edit (Mark Harris, May 27 2019):

Setting ranges of values is also needed, e.g.: series[x:y] = value

So, on the C++ side, I propose a new API

gdf_column cudf::set_range(gdf_column const &input, gdf_scalar const& value, gdf_index_type start, gdf_index_type end);

cuDF (Python) cuStreamz feature request

Source

pyotr777

Most helpful comment

Re-opening since this is missing Python bindings and there's no separate issue open for them.

randerzander on 24 Jun 2019

👍3

All 22 comments

@pyotr777 currently this isn't supported via the cuDF API. You could hack around the lack of support today by writing a Numba kernel against the underlying device array data of the Series series.data.mem, but we will support this in the future.

kkraus14 on 21 Apr 2019

I see. Thank you for your answer.

pyotr777 on 22 Apr 2019

Setting ranges of values is also needed:
series[x:y] = value

randerzander on 25 May 2019

Would it useful to make the API more generic to allow assigning an arbitrary number of ranges?

gdf_column cudf::set_range(gdf_column const &input, gdf_column const& values, 
                                             std::vector<std::pair<gdf_index_type, gdf_index_type>const& ranges);

where values[i] will be filled into [ ranges[i].first, ranges[i].second ].

jrhemstad on 27 May 2019

@jrhemstad I looked a bit, but didn't come across people trying to set values on lists of ranges in Pandas.

But that would be nice to have as an option if it isn't much additional work.

randerzander on 28 May 2019

Sounds like I'm trying to solve a problem that doesn't exist then.

Let's just stick with Mark's original API:

gdf_column cudf::set_range(gdf_column const &input, gdf_scalar const& value, gdf_index_type start, gdf_index_type end);

jrhemstad on 28 May 2019

Hmmm, so the request on the Python side is for in-place assignment. Is that what is needed? Should we do the same on the C++ side, or do an out-of-place algorithm as I suggested above?

@kkraus14 @jrhemstad feel free to comment.

harrism on 31 May 2019

👍1

I've started on the C++ implementation of this, since it is blocking an important project.

harrism on 31 May 2019

Hmmm, so the request on the Python side is for in-place assignment. Is that what is needed? Should we do the same on the C++ side, or do an out-of-place algorithm as I suggested above?

@kkraus14 @jrhemstad feel free to comment.

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

jrhemstad on 31 May 2019

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

I think you wrote gather to have the option of operating in-place? https://github.com/rapidsai/cudf/blob/a7068797b44353c7e2e65590a540b4f8219fa7ad/cpp/include/copying.hpp#L161

harrism on 3 Jun 2019

Everything in libcudf thus far is out of place. It does make something things less efficient, but it's much easier to reason about design when this holds true. We should think about this long term if we're going to require everything to be out of place or have a mix of both.

I think you wrote gather to have the option of operating in-place?

True, but that's ultimately because I needed an in-place gather for internal usage. I also wasn't thinking as much about library design then.

jrhemstad on 3 Jun 2019

So you are arguing that in-place gather should be in cudf::detail namespace and the API should be changed to return a new column?

harrism on 4 Jun 2019

So you are arguing that in-place gather should be in cudf::detail namespace and the API should be changed to return a new column?

Yeah, precisely.

jrhemstad on 4 Jun 2019

The problem is that trivially in-place operations can incur significant overhead to perform out-of-place. fill is an example: for example filling a single index of a column that has 1B ints results in allocation and copy overhead of ~4GB. Since cuDF is aimed at big data, we can't afford that kind of memory inefficiency. I imagine that you chose to expose in-place gather to avoid that inefficiency. I don't think we can afford a rule like "all operations are out-of-place".

I discussed with @kkraus14 and @randerzander and they felt that an operation like series[i] = val would almost always be used in-place on the Python side. So for now I will expose it in-place. Users can perform out-of-place by explicitly calling cudf::copy before calling cudf::fill().

harrism on 4 Jun 2019

Just be aware we're going to run into situations where operations on some column element types can't be done in place. As @davidwendt mentioned, a fill on a String column has to be out of place. Likewise will be true of any variable width element type, or a dictionary type. So we'll have some APIs that will need to check the types to be sure they can be done in-place, and the user will need to know that they need to call the in-place vs. out-of-place version depending on the column type.

jrhemstad on 4 Jun 2019

Another (perhaps common?) use-case that needs to be done out-of-place:

In [9]: a = pd.Series([1, 2, 3])                                                                                                                                                                               

In [10]: a                                                                                                                                                                                                     
Out[10]: 
0    1
1    2
2    3
dtype: int64

In [11]: a.iloc[2] = 3.5                                                                                                                                                                                       

In [12]: a                                                                                                                                                                                                     
Out[12]: 
0    1.0
1    2.0
2    3.5
dtype: float64

shwina on 4 Jun 2019

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

harrism on 5 Jun 2019

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

iloc[2] is basically asking for the 2 integer location row. It's essentially a gather, but can return by view instead of by copy. This could technically run out of place by replacing the underlying array of the column as opposed to actually modifying the files in place.

kkraus14 on 10 Jun 2019

@shwina I don't see what is out of place there. What does iloc[2] do? It seems to have modified a in place.

@harrism Well, if the datatype of a was int32 before, and then it changed it to float64, then this operation would need to be out of place. The element width has changed, so there needs to be another memory allocation.

Although, personally I'd implement this with an out-of-place cast op and then an in-place set_range. I was just clarifying what @shwina meant.

devavret on 10 Jun 2019

👍1

Closed by #1908

harrism on 22 Jun 2019

Re-opening since this is missing Python bindings and there's no separate issue open for them.

randerzander on 24 Jun 2019

👍3

Thanks, forgot about that!

harrism on 24 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[QST] Multiple GPU memories with Dask-cuDF

jmkim · 3Comments

[BUG] Series built from ephemeral CuPy arrays change due to CuPy's memory reuse

beckernick · 3Comments

[BUG] cudf.read_csv: KeyError: 8

randerzander · 3Comments

[BUG] to_orc fails if one of the columns is a string column

ayushdg · 3Comments

[FEA] Update Python implementation of fillna to use libcudf function

kkraus14 · 3Comments