Describe the bug
The example code in read_orc_metadata docstring is incorrect and silently returns unexpected results.
https://github.com/rapidsai/cudf/blob/598a14d820d47b7c3bfcb2bb3341b97a85317646/python/cudf/cudf/utils/ioutils.py#L259
Steps/Code to reproduce bug
import cudf
cudf.DataFrame({'a':[1,2,3]*10000000}).to_orc('test.orc')
path = 'test.orc'
num_rows, stripes, names = cudf.io.read_orc_metadata(path)
stripes
6
cudf.read_orc(path, stripe=1)
a
0 1
1 2
2 3
3 1
4 2
... ...
29999995 2
29999996 3
29999997 1
29999998 2
29999999 3
30000000 rows 脳 1 columns
df = [cudf.read_orc(path, stripe=i) for i in range(stripes)]
df = cudf.concat(df)
df
a
0 1
1 2
2 3
3 1
4 2
... ...
29999995 2
29999996 3
29999997 1
29999998 2
29999999 3
180000000 rows 脳 1 columns
Expected behavior
df = [cudf.read_orc(path, stripe=i) for i in range(stripes)]
df = cudf.concat(df)
df
a
0 1
1 2
2 3
3 1
4 2
... ...
29999995 2
29999996 3
29999997 1
29999998 2
29999999 3
30000000 rows 脳 1 columns
Environment overview (please complete the following information)
Hi @MikeChenfu , looks like you misspelled the parameter name - it's actually stripes. Please retry with the correct name.
Hi @vuule , I check the IO document in the Rapids. Here is the original code. Let me know if I miss something. Thanks.
df = [cudf.read_orc(fname, stripe=i) for i in range(stripes)]
@MikeChenfu can you link to the doc you're referring to? The API docs indicate that the kwarg is stripes as well: https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.io.orc.read_orc
@kkraus14 Thanks for the link.
The above code is from the example of cudf.io.orc.read_orc_metadata. https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.io.read_orc_metadata
Thanks! I'm going to update this issue to fix the example in that docstring.
Thanks @kkraus14 @vuule. stripes is working.