For the following test.json file:
{"a":0,"b":2}
{"a":1,"b":3,"c":4}
Reading with Pandas successfully includes all fields:
pd.read_json('test.json', lines=True)
a b c
0 0 2 NaN
1 1 3 4.0
But cudf seems to read the JSON schema only from the first record:
>>> cudf.read_json('test.json', lines=True)
a b
0 0 2
1 1 3
Attempting a workaround, creating a dummy record with all possible field names as the first record exposes another problem:
test.json
{"a":null,"b":null,"c":null,"d":null}
{"a":0,"b":2}
{"a":1,"b":3,"c":4}
{"d":0}
>>> cudf.read_json('test.json', lines=True)
a b c d
0 null null null null
1 0.0 2.0 null null
2 1.0 3.0 4.0 null
3 0.0 null null null
Above, cudf reads the last record's value of d into the a column.
@randerzander, @kkraus14 Is there an ETA as to when this can be worked on? And how difficult it would be to implement/correct this?
@chinmaychandak this likely wont happen until 0.15 the earliest.
The JSON parser pretty much ignores the field names and assigns the parsed values in order. There are significant changes required to support the desired behavior:
As for the ETA, I think this is about 2 week effort.
@harrism , if this is a P0 for 0.14, someone (preferably me) should get started on this ASAP.
@vuule, @kkraus14 This would really unblock a lot of custreamz jobs, without having to do a lateral view explode like @randerzander pointed out.
If possible, could you guys prioritize this? Maybe we can use it off a nightly build, or a feature branch if 0.14 is just going to be a cleanup, documentation release.
@satishvarmadandu Please correct me if I am wrong.
Moving this to be a feature request since it will take significant effort. Let's look at it for 0.15.
@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.
@harrism, is this still on track for 0.15? This is a major blocker for us currently in custreamz, so I'm hoping you guys can do some magic here.
It is, I will start working on this next week.
@vuule do you have an idea on what level of effort this will be? Trying to understand the priority of this versus handling a vector of files in I/O readers.
@kkraus14 I think it would take me two weeks to implement this.
The directory handling feature seems simpler, but I may be missing something.
@kkraus14 I think it would take me two weeks to implement this.
The directory handling feature seems simpler, but I may be missing something.
I think we should prioritize the directory handling / vector of files and/or buffers feature first before tackling this one.
@kkraus14 Please do keep me updated on this one.
@vuule is starting to look at this I believe
Most helpful comment
@vuule is starting to look at this I believe