Describe the bug
When writing a dataframe with strings and with partition_cols (not on the string column), the output file is incorrect.
Steps/Code to reproduce bug
df = cudf.DataFrame()
df["Integer"] = np.array([2345, 11987, 9027, 9027])
df["Integer2"] = np.arange(4)
df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
df["Boolean"] = np.array([True, False, True, False])
print(df)
Integer Integer2 String Boolean
0 2345 0 Alpha True
1 11987 1 Beta False
2 9027 2 Gamma True
3 9027 3 Delta False
df.to_parquet("cudf_partitioned.parquet", index=False, partition_cols=["Integer", "Boolean"])
print(pd.read_parquet("cudf_partitioned.parquet"))
Integer2 String Integer Boolean
0 1 Alpha 11987 False
1 0 Alpha 2345 True
2 3 Alpha 9027 False
3 2 Alpha 9027 True
print(cudf.read_parquet("cudf_partitioned.parquet/Integer=9027/Boolean=True/*.parquet")) # should be 2, Gamma
Integer2 String
0 2 Alpha
Expected behavior
The read output should match the original dataframe
Environment overview (please complete the following information)
Thanks for raising @ayushdg - I should be able to investigate this soon.
It seems that the bug is not related to partitioning, but to slicing the string column:
df = pd.DataFrame()
df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
df = cudf.from_pandas(df)
print(df)
String
0 Alpha
1 Beta
2 Gamma
3 Delta
df_select = df.iloc[1:2]
print(df_select)
String
1 Beta
df_select.to_parquet("string_selection.parquet", index=False)
print(pd.read_parquet("string_selection.parquet"))
String
0 b'Alpha'
This sounds a bit familiar - Is this a known bug @galipremsagar ?
cc @rgsl888prabhu ( in case you recognize the problem quickly :) )
I think the slice example you provided is giving correct results, no?
>>> import cudf
>>> s = cudf.Series(["Alpha", "Beta", "Gamma", "Delta"])
>>> s[1:2]
1 Beta
dtype: object
>>> s.to_pandas()[1:2]
1 Beta
dtype: object
>>> s.to_pandas().iloc[1:2]
1 Beta
dtype: object
>>> s.iloc[1:2]
1 Beta
dtype: object
>>> import numpy as np
>>> df = pd.DataFrame()
>>> df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
>>> df = cudf.from_pandas(df)
>>> df
String
0 Alpha
1 Beta
2 Gamma
3 Delta
>>> df.iloc[1:2]
String
1 Beta
>>> df.to_pandas().iloc[1:2]
String
1 Beta
I think the slice example you provided is giving correct results, no?
Thats right - The slice gives the correct result, but then to_parquet seems to be failing to use the new offset from the selection (and is writing the 0th element instead of the 1st element).
I guess this is a bit different from the earlier slicing issues then.
I see, got you.
Looks like this is the case if we use cudf engine to write parquet only.
>>> x = df.iloc[1:2]
>>> x
String
1 Beta
>>> x.to_parquet('s.p', engine="cudf")
>>> pd.read_parquet('s.p')
String
0 Alpha
>>> cudf.read_parquet('s.p')
String
0 Alpha
>>> x.to_parquet('s', engine="")
>>> cudf.read_parquet('s', engine="")
/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/io/parquet.py:196: UserWarning: Using CPU via PyArrow to read Parquet dataset.
warnings.warn("Using CPU via PyArrow to read Parquet dataset.")
String
1 Beta
cc @devavret @vuule for visibility
Was looking at cython layer and round-tripping a sliced table to view and creating a Table and then Dataframe from a view seems to be working fine too.
All my workspaces are busy rn but can someone test this theory: I think applying cudf::slice on a string column does not apply the relevant offsets on the children columns offsets and chars. And parquet writer looks like it assumes those are set: https://github.com/rapidsai/cudf/blob/25a6e312cdd3498b87320cdf32690eeccdd1850a/cpp/src/io/parquet/writer_impl.cu#L209-L215
See, it uses the child.data<>() method assuming the offset has been applied on the child too.
As per my knowledge view.offsets().data<size_type>() and view.chars().data<char>() takes care of offset, but I will try to get a gtest.
Edit:
@devavret Your approach was correct, seems like slice offset is not applied when offsets and chars are accessed.
I got a chance to experiment with this again and with the following change:
diff --git a/cpp/src/io/parquet/writer_impl.cu b/cpp/src/io/parquet/writer_impl.cu
index f7c508488..9e0fd0711 100644
--- a/cpp/src/io/parquet/writer_impl.cu
+++ b/cpp/src/io/parquet/writer_impl.cu
@@ -208,7 +208,7 @@ class parquet_column_view {
_indexes = rmm::device_buffer(_data_count * sizeof(gpu::nvstrdesc_s), stream);
stringdata_to_nvstrdesc<<<((_data_count - 1) >> 8) + 1, 256, 0, stream>>>(
reinterpret_cast<gpu::nvstrdesc_s *>(_indexes.data()),
- view.offsets().data<size_type>(),
+ view.offsets().data<size_type>() + view.offset(),
view.chars().data<char>(),
_nulls,
_data_count);
I get this:
In [1]: import cudf
...: import pandas as pd
...: import numpy as np
...:
...: df = pd.DataFrame()
...: df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
...: df = cudf.from_pandas(df)
...: print(df)
String
0 Alpha
1 Beta
2 Gamma
3 Delta
In [2]: df_select = df.iloc[1:2]
...: print(df_select)
String
1 Beta
In [3]: df_select.to_parquet("string_selection.parquet", index=False)
...: print(pd.read_parquet("string_selection.parquet"))
String
0 b'Beta'
That binary type(b'') issue was recently fixed, do you have the latest changes from branch-0.15?
I tried this with the latest but there's some inexplicable segfault in trying to import cudf. So I tried with an older commit that I knew was working on my machine.
Not pointing to that, I was showing that it works with the change to offset. So either we take care of it here or cudf::slice applies an offset to the offsets child column and that should fix this.
So either we take care of it here or
cudf::sliceapplies an offset to theoffsetschild column and that should fix this.
All of the current strings code is expecting its child column_views have no offset value. I would not recommend changing cudf::slice() for this.