Cudf: [FEA] cudf.read_json behavior differs from Pandas with missing data

Created on 11 Mar 2020 · 13Comments · Source: rapidsai/cudf

For the following test.json file:

{"a":0,"b":2}
{"a":1,"b":3,"c":4}

Reading with Pandas successfully includes all fields:

pd.read_json('test.json', lines=True)
   a  b    c
0  0  2  NaN
1  1  3  4.0

But cudf seems to read the JSON schema only from the first record:

>>> cudf.read_json('test.json', lines=True)
   a  b
0  0  2
1  1  3

cuIO feature request

Source

randerzander

Most helpful comment

@vuule is starting to look at this I believe

kkraus14 on 25 Jun 2020

👍2

All 13 comments

Attempting a workaround, creating a dummy record with all possible field names as the first record exposes another problem:

test.json

{"a":null,"b":null,"c":null,"d":null}
{"a":0,"b":2}
{"a":1,"b":3,"c":4}
{"d":0}

>>> cudf.read_json('test.json', lines=True)
      a     b     c     d
0  null  null  null  null
1   0.0   2.0  null  null
2   1.0   3.0   4.0  null
3   0.0  null  null  null

Above, cudf reads the last record's value of d into the a column.

randerzander on 11 Mar 2020

@randerzander, @kkraus14 Is there an ETA as to when this can be worked on? And how difficult it would be to implement/correct this?

chinmaychandak on 18 Mar 2020

@chinmaychandak this likely wont happen until 0.15 the earliest.

kkraus14 on 18 Mar 2020

😕1

The JSON parser pretty much ignores the field names and assigns the parsed values in order. There are significant changes required to support the desired behavior:

Collect information about all present fields in the file: most likely another GPU pass and some CPU processing;
Assign field values based on the field name: load field name -> column index dictionary on GPU; add field name parsing to the parsing kernel.
These changes would have significant performance impact.

As for the ETA, I think this is about 2 week effort.

@harrism , if this is a P0 for 0.14, someone (preferably me) should get started on this ASAP.

vuule on 8 Apr 2020

@vuule, @kkraus14 This would really unblock a lot of custreamz jobs, without having to do a lateral view explode like @randerzander pointed out.

If possible, could you guys prioritize this? Maybe we can use it off a nightly build, or a feature branch if 0.14 is just going to be a cleanup, documentation release.

@satishvarmadandu Please correct me if I am wrong.

chinmaychandak on 8 Apr 2020

Moving this to be a feature request since it will take significant effort. Let's look at it for 0.15.

harrism on 9 Apr 2020

@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.

chinmaychandak on 5 Jun 2020

@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.

It is, I will start working on this next week.

vuule on 5 Jun 2020

@vuule do you have an idea on what level of effort this will be? Trying to understand the priority of this versus handling a vector of files in I/O readers.

kkraus14 on 5 Jun 2020

@kkraus14 I think it would take me two weeks to implement this.
The directory handling feature seems simpler, but I may be missing something.

vuule on 6 Jun 2020