Cudf: [BUG] to_parquet writes incorrect string values when writing a partitioned dataset

Created on 1 Jul 2020 · 14Comments · Source: rapidsai/cudf

Describe the bug
When writing a dataframe with strings and with partition_cols (not on the string column), the output file is incorrect.
Steps/Code to reproduce bug

df = cudf.DataFrame()
df["Integer"] = np.array([2345, 11987, 9027, 9027])
df["Integer2"] = np.arange(4)
df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
df["Boolean"] = np.array([True, False, True, False])

print(df)
   Integer  Integer2 String  Boolean
0     2345         0  Alpha     True
1    11987         1   Beta    False
2     9027         2  Gamma     True
3     9027         3  Delta    False

df.to_parquet("cudf_partitioned.parquet", index=False, partition_cols=["Integer", "Boolean"])

print(pd.read_parquet("cudf_partitioned.parquet"))

   Integer2 String Integer Boolean
0         1  Alpha   11987   False
1         0  Alpha    2345    True
2         3  Alpha    9027   False
3         2  Alpha    9027    True

print(cudf.read_parquet("cudf_partitioned.parquet/Integer=9027/Boolean=True/*.parquet")) # should be 2, Gamma

   Integer2 String
0         2  Alpha

Expected behavior
The read output should match the original dataframe

Environment overview (please complete the following information)

Method of cuDF install: Conda (Nightly from Jul 1)

bug cuIO

Source

ayushdg

All 14 comments

Thanks for raising @ayushdg - I should be able to investigate this soon.

rjzamora on 1 Jul 2020

It seems that the bug is not related to partitioning, but to slicing the string column:

df = pd.DataFrame()
df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
df = cudf.from_pandas(df)
print(df)

  String
0  Alpha
1   Beta
2  Gamma
3  Delta

df_select = df.iloc[1:2]
print(df_select)

  String
1   Beta

df_select.to_parquet("string_selection.parquet", index=False)
print(pd.read_parquet("string_selection.parquet"))

     String
0  b'Alpha'

This sounds a bit familiar - Is this a known bug @galipremsagar ?

rjzamora on 1 Jul 2020

👀1

cc @rgsl888prabhu ( in case you recognize the problem quickly :) )

rjzamora on 1 Jul 2020

I think the slice example you provided is giving correct results, no?

>>> import cudf
>>> s = cudf.Series(["Alpha", "Beta", "Gamma", "Delta"])
>>> s[1:2]
1    Beta
dtype: object
>>> s.to_pandas()[1:2]
1    Beta
dtype: object
>>> s.to_pandas().iloc[1:2]
1    Beta
dtype: object
>>> s.iloc[1:2]
1    Beta
dtype: object

>>> import numpy as np
>>> df = pd.DataFrame()
>>> df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"])
>>> df = cudf.from_pandas(df)
>>> df
  String
0  Alpha
1   Beta
2  Gamma
3  Delta
>>> df.iloc[1:2]
  String
1   Beta
>>> df.to_pandas().iloc[1:2]
  String
1   Beta

galipremsagar on 1 Jul 2020

I think the slice example you provided is giving correct results, no?

Thats right - The slice gives the correct result, but then to_parquet seems to be failing to use the new offset from the selection (and is writing the 0th element instead of the 1st element).

I guess this is a bit different from the earlier slicing issues then.

rjzamora on 1 Jul 2020

I see, got you.

Looks like this is the case if we use cudf engine to write parquet only.

>>> x = df.iloc[1:2]
>>> x
  String
1   Beta
>>> x.to_parquet('s.p', engine="cudf")
>>> pd.read_parquet('s.p')
  String
0  Alpha
>>> cudf.read_parquet('s.p')
  String
0  Alpha

>>> x.to_parquet('s', engine="")
>>> cudf.read_parquet('s', engine="")
/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/io/parquet.py:196: UserWarning: Using CPU via PyArrow to read Parquet dataset.
  warnings.warn("Using CPU via PyArrow to read Parquet dataset.")
  String
1   Beta

galipremsagar on 1 Jul 2020

👍1

cc @devavret @vuule for visibility

kkraus14 on 1 Jul 2020

👀1

Was looking at cython layer and round-tripping a sliced table to view and creating a Table and then Dataframe from a view seems to be working fine too.

galipremsagar on 1 Jul 2020

All my workspaces are busy rn but can someone test this theory: I think applying cudf::slice on a string column does not apply the relevant offsets on the children columns offsets and chars. And parquet writer looks like it assumes those are set: https://github.com/rapidsai/cudf/blob/25a6e312cdd3498b87320cdf32690eeccdd1850a/cpp/src/io/parquet/writer_impl.cu#L209-L215

See, it uses the child.data<>() method assuming the offset has been applied on the child too.

devavret on 1 Jul 2020

As per my knowledge view.offsets().data<size_type>() and view.chars().data<char>() takes care of offset, but I will try to get a gtest.

Edit:
@devavret Your approach was correct, seems like slice offset is not applied when offsets and chars are accessed.

rgsl888prabhu on 1 Jul 2020

I got a chance to experiment with this again and with the following change:

diff --git a/cpp/src/io/parquet/writer_impl.cu b/cpp/src/io/parquet/writer_impl.cu
index f7c508488..9e0fd0711 100644
--- a/cpp/src/io/parquet/writer_impl.cu
+++ b/cpp/src/io/parquet/writer_impl.cu
@@ -208,7 +208,7 @@ class parquet_column_view {
       _indexes = rmm::device_buffer(_data_count * sizeof(gpu::nvstrdesc_s), stream);
       stringdata_to_nvstrdesc<<<((_data_count - 1) >> 8) + 1, 256, 0, stream>>>(
         reinterpret_cast<gpu::nvstrdesc_s *>(_indexes.data()),
-        view.offsets().data<size_type>(),
+        view.offsets().data<size_type>() + view.offset(),
         view.chars().data<char>(),
         _nulls,
         _data_count);

I get this:

In [1]: import cudf 
   ...: import pandas as pd 
   ...: import numpy as np 
   ...:  
   ...: df = pd.DataFrame() 
   ...: df["String"] = np.array(["Alpha", "Beta", "Gamma", "Delta"]) 
   ...: df = cudf.from_pandas(df) 
   ...: print(df)
  String
0  Alpha
1   Beta
2  Gamma
3  Delta

In [2]: df_select = df.iloc[1:2] 
   ...: print(df_select) 
  String
1   Beta

In [3]: df_select.to_parquet("string_selection.parquet", index=False) 
   ...: print(pd.read_parquet("string_selection.parquet")) 
    String
0  b'Beta'

devavret on 2 Jul 2020

That binary type(b'') issue was recently fixed, do you have the latest changes from branch-0.15?

galipremsagar on 2 Jul 2020

I tried this with the latest but there's some inexplicable segfault in trying to import cudf. So I tried with an older commit that I knew was working on my machine.

Not pointing to that, I was showing that it works with the change to offset. So either we take care of it here or cudf::slice applies an offset to the offsets child column and that should fix this.

devavret on 2 Jul 2020

👍1