Describe the bug
Reading a parquet file results in error.
Steps/Code to reproduce bug
import cudf
cudf.io.read_parquet('/path/to/file.snappy.parquet')
Output:
RuntimeError Traceback (most recent call last)
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
RuntimeError: Invalid gdf_dtype in type_dispatcher
Also when trying to read a non existent file:
import cudf
cudf.io.read_parquet('/path/to/nonexistent/file')
Output
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
NameError: name 'errno' is not defined
Expected behavior
File gets read if valid. If filename is invalid, the proper error gets propagated.
Environment details (please complete the following information):
@ayushdg can you create a more specific reproducible example?
Perhaps include creating an example Parquet file with dummy data
We typically get this error with unsupported or missing Parquet to cuDF type mappings.
Do you know what dtypes are in the file? Also, please include small repro file as suggested.
@j-ieong The data I have when read by pandas consists of strings, datetime64[ns] and bools. I'm working on reproducing it with dummy data and sharing the example soon.
I was able to narrow down the issue to datetime/timestamp types being written by spark to parquet
Here is a reproducible example:
import pandas as pd
import numpy as np
import cudf
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SQLContext(sc)
df = pd.DataFrame()
df['c'] = [np.datetime64('2019-03-28 18:56:27.086')]*10000
# df['c'].dtype is datetime64[ns]
sdf = spark.createDataFrame(df)
display(sdf)
# Output: DataFrame[c: timestamp]
#Spark writes the files to a folder called temp.snappy.parquet containing many parquet files each with # a split of the data
sdf.write.parquet('temp.snappy.parquet')
# Get filenames for the files written by spark
files = [fn for fn in os.listdir('temp.snappy.parquet') if fn.endswith('snappy.parquet')]
# Try reading in one of the files
cudf.io.read_parquet('temp.snappy.parquet/'+files[0])
Output
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-17-cfe64f9fa978> in <module>
----> 1 cudf.io.read_parquet('temp.snappy.parquet/'+files[0])
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
RuntimeError: Invalid gdf_dtype in type_dispatcher
md5-1e556e9cb4b41b67d404da8454fe45ae
df = pd.read_parquet('temp.snappy.parquet/'+files[0])
df.dtypes
# Output: datetime64[ns]
md5-8bb5b9d062f80b8effd4ae7b28310f7b
>>> cudf.io.read_parquet('temp.snappy.parquet/abcde')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-bc12f221d5e2> in <module>
----> 1 cudf.io.read_parquet('temp.snappy.parquest/abcde')
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
NameError: name 'errno' is not defined
Ah I see. Parquet is currently only handling timestamp[ms] and not timestamp[ns].
I don't remember there being a nanosecond option for the timestamp Parquet time, so I would need to check what actual type is being used.
Could this be using the INT96 type ? (it really would help to get the actual file)
@j-ieong I'm not sure if timestamp[ms] vs [ns] is the core issue (even though it would be great if we support [ns] in the future sometime).
>>> df = pd.DataFrame()
>>> df['B'] = [np.datetime64('2019-03-28 18:00:00')]*10000
>>> df.to_parquet('temp.snappy.parquet')
>>> df.dtypes
B datetime64[ns]
>>> df2 = cudf.io.read_parquet('temp.snappy.parquet')
>>> df2.dtypes
B datetime64[ms]
This works just fine. Cudf can still read datetime objects just goes to the precision of ms instead of ns unlike pandas.
The issue is when spark writes the same thing to parquet it probably assigns the column metadata with type called timestamp. Pandas can recognize this as datetime and handle it, cudf on the other hand seems to not understand it and give invalid gdf_dtype.
@OlivierNV The example snippets I shared should be sufficient to reproduce the issue. I'm not too sure if spark uses INT96 or something of that sort, but it should be reproducible with any timestamp column from spark -> parquet -> cudf.
I have attached one of the sample parquet files written by spark in the example above in case that helps.
part-00000-3b44213e-1dbf-4bf1-93c8-7a18a50ac431-c000.snappy.parquet.zip
I can confirm it's using the [deprecated] INT96 type. I'll work on getting a translation from the 96-bit spark timestamp to DATE64 or TIMESTAMP GDF type.
@OlivierNV Interesting! Thank you for the update
Yeah. PyArrow also had to add a special option to write Spark-compatible Parquet files as use_deprecated_int96_timestamps, that's enabled when writing as spark flavor.
I think in the Parquet file, it should be stored as simply INT96 physical type, with no logical type. As cuDF doesn't have a INT96 or larger dtype, we can either always translate to timestamp or only if the Spark/Pandas key value metadata in the file indicates the column type as timestamp.
I vote for always translating INT96 to 64-bit timestamp, since the parquet docs clearly document how this type is deprecated and was exclusively used for timestamps. Unlike the name would suggest, it's not like it has more precision or anything, it's just an inefficient encoding of date/time using 3 separate int32 values.
@ayushdg Can I use the file you attached as part of automated tests ?
@OlivierNV Sure!
Ok, PR #1532 should resolve this. It's converted to 1ms DATE64 type, to match pandas behavior.
Most helpful comment
Ok, PR #1532 should resolve this. It's converted to 1ms DATE64 type, to match pandas behavior.