Cudf: [BUG] cudf.read_orc reads incorrect data for one row

Created on 10 Jun 2020  路  4Comments  路  Source: rapidsai/cudf

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug

>>> df = cudf.read_orc('to_orc_bug.orc')
>>> df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
>>> df.to_orc('to_orc_bug.orc', compression='snappy')
>>> df2 = cudf.read_orc('to_orc_bug.orc')
>>> df2.upc_nbr[(df2.visit_nbr == 14600028) & (df2.store_nbr == 47)]
999786    2526351652

Expected behavior
Returned value should be 681131184420, not 2526351652.

Environment overview (please complete the following information)

  • Method of cuDF install: Conda
  • 0.14 nightly on ~ May 29

to_orc_bug.orc.zip

bug cuIO

All 4 comments

cc @devavret in case #5324 is related

Tried to read with pyarrow and it works.

import cudf
import pyarrow.orc as orc

df = cudf.read_orc("to_orc_bug.orc")
df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
df.to_orc("to_orc_bug2.orc", compression="snappy")

pdf = orc.ORCFile("to_orc_bug2.orc").read().to_pandas()
print(pdf[(pdf.visit_nbr == 14600028) & (pdf.store_nbr == 47)])
999786    6.811312e+11
Name: upc_nbr, dtype: float64

Seems to be a reader issue.

Relabeled issue as such

This was an easy fix but I'm still trying to figure out how to properly add tests for this, or in general, anything in cuIO.

Was this page helpful?
0 / 5 - 0 ratings