Cudf: [BUG] Read_parquet not working with spark timestamp type in parquet file

Created on 25 Apr 2019 · 15Comments · Source: rapidsai/cudf

Describe the bug
Reading a parquet file results in error.

Steps/Code to reproduce bug

import cudf
cudf.io.read_parquet('/path/to/file.snappy.parquet')

Output:

RuntimeError                              Traceback (most recent call last)
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

RuntimeError: Invalid gdf_dtype in type_dispatcher

Also when trying to read a non existent file:

import cudf
cudf.io.read_parquet('/path/to/nonexistent/file')

Output

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

NameError: name 'errno' is not defined

Expected behavior
File gets read if valid. If filename is invalid, the proper error gets propagated.
Environment details (please complete the following information):

Method of cuDF install: From source

bug cuIO

Source

ayushdg

Most helpful comment

Ok, PR #1532 should resolve this. It's converted to 1ms DATE64 type, to match pandas behavior.

OlivierNV on 30 Apr 2019

🎉2

All 15 comments

@ayushdg can you create a more specific reproducible example?

Perhaps include creating an example Parquet file with dummy data

randerzander on 25 Apr 2019

We typically get this error with unsupported or missing Parquet to cuDF type mappings.

Do you know what dtypes are in the file? Also, please include small repro file as suggested.

j-ieong on 25 Apr 2019

@j-ieong The data I have when read by pandas consists of strings, datetime64[ns] and bools. I'm working on reproducing it with dummy data and sharing the example soon.

ayushdg on 25 Apr 2019

👍1

I was able to narrow down the issue to datetime/timestamp types being written by spark to parquet

Here is a reproducible example:

import pandas as pd
import numpy as np
import cudf
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SQLContext(sc)

df = pd.DataFrame()
df['c'] = [np.datetime64('2019-03-28 18:56:27.086')]*10000 
# df['c'].dtype is datetime64[ns]

sdf = spark.createDataFrame(df)
display(sdf)
# Output: DataFrame[c: timestamp]

#Spark writes the files to a folder called temp.snappy.parquet containing many parquet files each with # a split of the data
sdf.write.parquet('temp.snappy.parquet')

# Get filenames for the files written by spark
files = [fn for fn in os.listdir('temp.snappy.parquet') if fn.endswith('snappy.parquet')]

# Try reading in one of the files
cudf.io.read_parquet('temp.snappy.parquet/'+files[0])

Output
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-cfe64f9fa978> in <module>
----> 1 cudf.io.read_parquet('temp.snappy.parquet/'+files[0])

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

RuntimeError: Invalid gdf_dtype in type_dispatcher



md5-1e556e9cb4b41b67d404da8454fe45ae



df = pd.read_parquet('temp.snappy.parquet/'+files[0])
df.dtypes
 # Output: datetime64[ns]



md5-8bb5b9d062f80b8effd4ae7b28310f7b



>>> cudf.io.read_parquet('temp.snappy.parquet/abcde')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-bc12f221d5e2> in <module>
----> 1 cudf.io.read_parquet('temp.snappy.parquest/abcde')

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

NameError: name 'errno' is not defined

ayushdg on 26 Apr 2019

Ah I see. Parquet is currently only handling timestamp[ms] and not timestamp[ns].

I don't remember there being a nanosecond option for the timestamp Parquet time, so I would need to check what actual type is being used.

j-ieong on 26 Apr 2019

Could this be using the INT96 type ? (it really would help to get the actual file)

OlivierNV on 26 Apr 2019

@j-ieong I'm not sure if timestamp[ms] vs [ns] is the core issue (even though it would be great if we support [ns] in the future sometime).

>>> df = pd.DataFrame()
>>> df['B'] = [np.datetime64('2019-03-28 18:00:00')]*10000
>>> df.to_parquet('temp.snappy.parquet')
>>> df.dtypes
B    datetime64[ns]

>>> df2 = cudf.io.read_parquet('temp.snappy.parquet')
>>> df2.dtypes
B    datetime64[ms]

This works just fine. Cudf can still read datetime objects just goes to the precision of ms instead of ns unlike pandas.

The issue is when spark writes the same thing to parquet it probably assigns the column metadata with type called timestamp. Pandas can recognize this as datetime and handle it, cudf on the other hand seems to not understand it and give invalid gdf_dtype.

ayushdg on 26 Apr 2019

@OlivierNV The example snippets I shared should be sufficient to reproduce the issue. I'm not too sure if spark uses INT96 or something of that sort, but it should be reproducible with any timestamp column from spark -> parquet -> cudf.

I have attached one of the sample parquet files written by spark in the example above in case that helps.
part-00000-3b44213e-1dbf-4bf1-93c8-7a18a50ac431-c000.snappy.parquet.zip

ayushdg on 26 Apr 2019

I can confirm it's using the [deprecated] INT96 type. I'll work on getting a translation from the 96-bit spark timestamp to DATE64 or TIMESTAMP GDF type.

OlivierNV on 27 Apr 2019

@OlivierNV Interesting! Thank you for the update

ayushdg on 27 Apr 2019

Yeah. PyArrow also had to add a special option to write Spark-compatible Parquet files as use_deprecated_int96_timestamps, that's enabled when writing as spark flavor.

I think in the Parquet file, it should be stored as simply INT96 physical type, with no logical type. As cuDF doesn't have a INT96 or larger dtype, we can either always translate to timestamp or only if the Spark/Pandas key value metadata in the file indicates the column type as timestamp.

j-ieong on 27 Apr 2019

👍2

I vote for always translating INT96 to 64-bit timestamp, since the parquet docs clearly document how this type is deprecated and was exclusively used for timestamps. Unlike the name would suggest, it's not like it has more precision or anything, it's just an inefficient encoding of date/time using 3 separate int32 values.

OlivierNV on 27 Apr 2019

@ayushdg Can I use the file you attached as part of automated tests ?

OlivierNV on 29 Apr 2019

@OlivierNV Sure!

ayushdg on 29 Apr 2019

Ok, PR #1532 should resolve this. It's converted to 1ms DATE64 type, to match pandas behavior.

OlivierNV on 30 Apr 2019

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] DataFrame.empty doesn't perform as advertised in the documentation

yasmina-altair · 3Comments

[FEA] Update Python implementation of fillna to use libcudf function

kkraus14 · 3Comments

Latest Docker container gives CUDA driver version error

MurrayData · 3Comments

[BUG] cudf.read_csv: KeyError: 8

randerzander · 3Comments

[BUG] RunTimeError in `cudf::strings::starts_with`, `cudf::strings::ends_with` and `cudf::strings::find` when `target=''`

galipremsagar · 3Comments