Turicreate: SFrame.read_json() treats boolean values as strings

Created on 12 Dec 2018  路  6Comments  路  Source: apple/turicreate

It appears that SFrame.read_json() really just is a thin wrapper around SFrame.read_csv() with some parsing logic which cannot handle all cases. In particular, it does not parse boolean values in JSON correctly and instead treats them as strings. E.g.,

In [1]: cat test.jl
{"a": true}
{"a": false}
{"a": null}
{"a": false}
In [2]: import turicreate
In [3]: turicreate.SFrame.read_json('test.jl', orient='lines')
Out[3]: 
Columns:
    a   str
Rows: 4
Data:
+-------+
|   a   |
+-------+
|  true |
| false |
|  null |
| false |
+-------+
[4 rows x 1 columns]

It's the same for records orientation:

In [4]: cat test.json
[{"a": true},
{"a": false},
{"a": null},
{"a": false}]
In [5]: turicreate.SFrame.read_json('test.json', orient='records')
Out[5]: 
Columns:
    a   str
Rows: 4
Data:
+-------+
|   a   |
+-------+
|  true |
| false |
|  null |
| false |
+-------+
[4 rows x 1 columns]

Expected result is this:

In [6]: turicreate.SFrame({'a': [True, False, None, False]})
Out[6]: 
Columns:
    a   int
Rows: 4
Data:
+------+
|  a   |
+------+
|  1   |
|  0   |
| None |
|  0   |
+------+
[4 rows x 1 columns]

Tested on Turicreate v5.1, Python v3.6.6, macOS v10.14.2

bug engine

All 6 comments

I think one easy way to treat JSON Lines files correctly is this:

import json, turicreate
turicreate.SFrame.read_csv('test.jl', header=False, column_type_hints=str)['X1'] \
    .apply(json.loads) \
    .unpack(column_name_prefix=None)

I'll use this workaround for the time being.

Thanks for the issue @MarkusShepherd.

Turi Create doesn't even have a boolean column type. So that's probably the more fundamental issue here. Boolean column types is a feature I've wanted for a very long time.

@TobyRoseman Should true/false be parsed as 0/1 (ints) rather than "true"/"false" (strings)? If so, then I'd call this a bug rather than a feature request (with Boolean values in SFrame as a separate feature request).

@znation - I agree, in the absence of having a boolean type, "true"/"false" JSON values should be parsed as 0/1. Looks like we're already tracking the boolean feature request in #1069. So let's limit this issue to correctly parsing "true"/"false" JSON values.

Since bool is a subclass of int in Python, I'd say the lack of explicit boolean columns in TC might not be so bad, unless there is some performance to be gained.

Also, I've realised SArray reads in files line by line, so my above workaround can be shortened:

import json, turicreate
turicreate.SArray('test.jl') \
    .apply(json.loads) \
    .unpack(column_name_prefix=None)

I have added some parser features in #1266 that should help fix this.

Was this page helpful?
0 / 5 - 0 ratings