Ray: [DataFrame] calling delitem on the last column errors

Created on 10 May 2018 · 9Comments · Source: ray-project/ray

This bug affects calling __setitem__ on the last column as well.

Code that causes the bug:

df = rdf.DataFrame({"col1": lrange(3), "col2": lrange(3)})
del df["col2"]

Output:
```
Remote function ray.dataframe.utils._deploy_func failed with:

Traceback (most recent call last):
File "/home/peter/workspace/ray/python/ray/dataframe/utils.py", line 132, in _deploy_func
return func(dataframe, *args)
File "/home/peter/workspace/ray/python/ray/dataframe/dataframe.py", line 4728, in del_helper
cols = df.columns[to_delete] # either int or an array of ints
File "/home/peter/workspace/ray_env/lib/python3.6/site-packages/pandas/core/indexes/range.py", line 544, in __getitem__
return super_getitem(key)
File "/home/peter/workspace/ray_env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 1754, in __getitem__
result = getitem(key)
IndexError: index 0 is out of bounds for axis 1 with size 0

You can inspect errors by running

  ray.error_info()

If this driver is hanging, start a new one with

  ray.init(redis_address="192.168.0.114:49682")

```

bug

Source

pschafhalter

Most helpful comment

@pschafhalter Good point. I'll go head and refactor __delitem__ to use block partitions.

simon-mo on 16 May 2018

👍2

All 9 comments

Issue

The issue is we had redundancy in calls, inside `__delitem__, we called:

        self._row_partitions = _map_partitions(
            del_helper, self._row_partitions, to_delete)

and then we called:

            self._col_partitions[i] = _deploy_func.remote(
                del_helper, self._col_partitions[i], to_delete_in_partition)

because we already deleted the data, the column deletion won't succeed.

The implementation note in docstring states:

Notes: This operation happen on row and column partition
                  simultaneously. No rebuild.

but setting row_partitions and col_partitions automatically "rebuild" the block paritions via _create_block_partitions.

Fix

The fix is simply

Just perform the __delitem__ in one of the row/col partition scheme
Leave the col_metadata reset computation as-is.

Question

I would go with fixing it by removing the row_parititions schemes because:

col_paritions should be generally faster and memory efficient than row_parititions because
- row_partitions create intermediate "copy" of each row
- col_partitions, with some tweaks, need only touch the part of columns that need to be deleted without memory overhead

@devin-petersohn @pschafhalter comments?

simon-mo on 15 May 2018

@simon-mo Using col_partitions makes sense to me in terms of correctness. I think we might be able to improve performance even more if __delitem__ modifies block_partitions directly, using col_metadata to find the blocks that we need to modify.

pschafhalter on 16 May 2018

@pschafhalter Good point. I'll go head and refactor __delitem__ to use block partitions.

simon-mo on 16 May 2018

👍2

Thanks for looking into this @simon-mo! Modifying _block_partitions is the most efficient. Additionally, is it possible to remove empty _block_partitions? It may not be worth it considering we don't expose the partitioning to the user, and managing everything ourselves we can get away without some of this cleanup.

devin-petersohn on 16 May 2018

@devin-petersohn
Yes we can do that. The logic will simply be: if removing this column will create empty block, just pop this column in the _block_partitions array, and modify _col_metadata.

One caveat is that I might need to make a copy of our block partition ObjectID array, given that delete the item in original won't affect "view":

In [4]: df_view = df.iloc[1:,:]

In [5]: df
Out[5]:
   a  b
0  1  2
1  1  2
2  1  2
3  1  2

In [6]: del df['b']

In [7]: df_view
Out[7]:
   a  b
1  1  2
2  1  2
3  1  2

In [8]: df
Out[8]:
   a
0  1
1  1
2  1
3  1

basically, I need to think about how to decouple the view and the original in this situation. But it's doable.

simon-mo on 16 May 2018

For this pass, let's just focus on fixing the bug. There are other methods that will be affected by this, and we should probably do this with all of them at the same time.

devin-petersohn on 16 May 2018

👍1

Resolved with #2080

devin-petersohn on 18 May 2018

I'm encountering the same issue on __getitem__