Cudf: [FEA] cudf.read_json behavior differs from Pandas with missing data

Created on 11 Mar 2020  路  13Comments  路  Source: rapidsai/cudf

For the following test.json file:

{"a":0,"b":2}
{"a":1,"b":3,"c":4}

Reading with Pandas successfully includes all fields:

pd.read_json('test.json', lines=True)
   a  b    c
0  0  2  NaN
1  1  3  4.0

But cudf seems to read the JSON schema only from the first record:

>>> cudf.read_json('test.json', lines=True)
   a  b
0  0  2
1  1  3
cuIO feature request

Most helpful comment

@vuule is starting to look at this I believe

All 13 comments

Attempting a workaround, creating a dummy record with all possible field names as the first record exposes another problem:

test.json

{"a":null,"b":null,"c":null,"d":null}
{"a":0,"b":2}
{"a":1,"b":3,"c":4}
{"d":0}
>>> cudf.read_json('test.json', lines=True)
      a     b     c     d
0  null  null  null  null
1   0.0   2.0  null  null
2   1.0   3.0   4.0  null
3   0.0  null  null  null

Above, cudf reads the last record's value of d into the a column.

@randerzander, @kkraus14 Is there an ETA as to when this can be worked on? And how difficult it would be to implement/correct this?

@chinmaychandak this likely wont happen until 0.15 the earliest.

The JSON parser pretty much ignores the field names and assigns the parsed values in order. There are significant changes required to support the desired behavior:

  • Collect information about all present fields in the file: most likely another GPU pass and some CPU processing;
  • Assign field values based on the field name: load field name -> column index dictionary on GPU; add field name parsing to the parsing kernel.
    These changes would have significant performance impact.

As for the ETA, I think this is about 2 week effort.

@harrism , if this is a P0 for 0.14, someone (preferably me) should get started on this ASAP.

@vuule, @kkraus14 This would really unblock a lot of custreamz jobs, without having to do a lateral view explode like @randerzander pointed out.

If possible, could you guys prioritize this? Maybe we can use it off a nightly build, or a feature branch if 0.14 is just going to be a cleanup, documentation release.

@satishvarmadandu Please correct me if I am wrong.

Moving this to be a feature request since it will take significant effort. Let's look at it for 0.15.

@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.

@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.

It is, I will start working on this next week.

@vuule do you have an idea on what level of effort this will be? Trying to understand the priority of this versus handling a vector of files in I/O readers.

@kkraus14 I think it would take me two weeks to implement this.
The directory handling feature seems simpler, but I may be missing something.

@kkraus14 I think it would take me two weeks to implement this.
The directory handling feature seems simpler, but I may be missing something.

I think we should prioritize the directory handling / vector of files and/or buffers feature first before tackling this one.

@kkraus14 Please do keep me updated on this one.

@vuule is starting to look at this I believe

Was this page helpful?
0 / 5 - 0 ratings