This bug affects calling __setitem__ on the last column as well.
Code that causes the bug:
df = rdf.DataFrame({"col1": lrange(3), "col2": lrange(3)})
del df["col2"]
Output:
```
Remote function ray.dataframe.utils._deploy_func failed with:
Traceback (most recent call last):
File "/home/peter/workspace/ray/python/ray/dataframe/utils.py", line 132, in _deploy_func
return func(dataframe, *args)
File "/home/peter/workspace/ray/python/ray/dataframe/dataframe.py", line 4728, in del_helper
cols = df.columns[to_delete] # either int or an array of ints
File "/home/peter/workspace/ray_env/lib/python3.6/site-packages/pandas/core/indexes/range.py", line 544, in __getitem__
return super_getitem(key)
File "/home/peter/workspace/ray_env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 1754, in __getitem__
result = getitem(key)
IndexError: index 0 is out of bounds for axis 1 with size 0
You can inspect errors by running
ray.error_info()
If this driver is hanging, start a new one with
ray.init(redis_address="192.168.0.114:49682")
```
The issue is we had redundancy in calls, inside `__delitem__, we called:
self._row_partitions = _map_partitions(
del_helper, self._row_partitions, to_delete)
and then we called:
self._col_partitions[i] = _deploy_func.remote(
del_helper, self._col_partitions[i], to_delete_in_partition)
because we already deleted the data, the column deletion won't succeed.
The implementation note in docstring states:
Notes: This operation happen on row and column partition
simultaneously. No rebuild.
but setting row_partitions and col_partitions automatically "rebuild" the block paritions via _create_block_partitions.
The fix is simply
__delitem__ in one of the row/col partition schemeI would go with fixing it by removing the row_parititions schemes because:
row_partitions create intermediate "copy" of each rowcol_partitions, with some tweaks, need only touch the part of columns that need to be deleted without memory overhead@devin-petersohn @pschafhalter comments?
@simon-mo Using col_partitions makes sense to me in terms of correctness. I think we might be able to improve performance even more if __delitem__ modifies block_partitions directly, using col_metadata to find the blocks that we need to modify.
@pschafhalter Good point. I'll go head and refactor __delitem__ to use block partitions.
Thanks for looking into this @simon-mo! Modifying _block_partitions is the most efficient. Additionally, is it possible to remove empty _block_partitions? It may not be worth it considering we don't expose the partitioning to the user, and managing everything ourselves we can get away without some of this cleanup.
@devin-petersohn
Yes we can do that. The logic will simply be: if removing this column will create empty block, just pop this column in the _block_partitions array, and modify _col_metadata.
One caveat is that I might need to make a copy of our block partition ObjectID array, given that delete the item in original won't affect "view":
In [4]: df_view = df.iloc[1:,:]
In [5]: df
Out[5]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
In [6]: del df['b']
In [7]: df_view
Out[7]:
a b
1 1 2
2 1 2
3 1 2
In [8]: df
Out[8]:
a
0 1
1 1
2 1
3 1
basically, I need to think about how to decouple the view and the original in this situation. But it's doable.
For this pass, let's just focus on fixing the bug. There are other methods that will be affected by this, and we should probably do this with all of them at the same time.
Resolved with #2080
I'm encountering the same issue on __getitem__
@kunalgosar can you post a code snippet that can reproduce your issue? I can't reproduce it right now
Most helpful comment
@pschafhalter Good point. I'll go head and refactor
__delitem__to use block partitions.