Hey all,
I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.
Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.
Take a look at the specification for newline delimited json here
Does pandas support this yet, and if not, would you consider adding a to_ndjson
function to the existing output formats?
For example, the following table:
> df
a b key version
0 True False ci4diwru70000x6xmmn19nba1 1
1 False True ci4diww6j0001x6xmbyp5o2f0 1
Would be converted to
> df.to_ndjson()
'{ "a": True, "b": False, "key": ci4diwru70000x6xmmn19nba1, "version": 1}\n{ "a": False, "b": True, "key": ci4diww6j0001x6xmbyp5o2f0, "version": 1}'
For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects
something like this, perhaps:
def iterndjson(df):
generator = df.iterrows()
ndjson = []
row = True
while row:
try:
row = next(generator)
ndjson.append(row[1].to_dict())
except StopIteration:
row = None
return ndjson
> df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
{'one': 2, 'two': 4},
{'one': 3, 'two': 5},
{'one': 4, 'two': 6}]
have a look here http://pandas.pydata.org/pandas-docs/stable/io.html#json
this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects).
@karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor?
@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json.
Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.
In [3]: def parse():
with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
list(map(json.loads, fp))
...:
In [4]: %timeit parse()
1 loop, best of 3: 4.05 s per loop
In [18]: import ujson as json
In [19]: def parse():
with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
list(map(json.loads, fp))
....:
In [20]: %timeit parse()
1 loop, best of 3: 1.43 s per loop
In [22]: %time _ = pd.read_json('reviews_Clothing_Shoes_and_Jewelry_5_comma.json', convert_dates=False)
CPU times: user 4.75 s, sys: 520 ms, total: 5.27 s
Wall time: 5.49 s
I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader.
@karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing.
Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/
Hi, is there a way to efficiently json_normalize
newline delimited json read with read_json
?
For example I have a big dataset of nested newline delimited json, and I'd like to read it and flatten at the same time so my data in the cells will be as actual values instead of dicts.
Right now I just json.loads
line by line and then normalizing the whole dict, but it takes a lot of time, and I assume is not elegant in any way.
Thanks!
you would have to show an example
Thanks for a quick reply!
Here is an example with a toy data:
Where I'd like to efficiently apply json_normalize
somehow with read_json
data.
can you cat toy.csv
to here.
Here.
toy.csv.zip
This won't be that efficient, but its reasonably idiomatic.
In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]:
age description price
0 None Very Importrant 500
1 5 NaN 150
2 15 Very Importrant Random Text None
I don't think there is an easy way to do this ATM. In pandas2 this will be more built in.
cc @wesm
It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2
Just checked on my production dataset:
This won't be that efficient, but its reasonably idiomatic.
First, fortunately it actually get's the job done better than using json_normalize
. (no annoying *.string or *.nULL in the column name). So thanks a lot.
Second, unfortunately it is definitely less efficient than using json_normalize
as I did previously. When I increased amount of data to 10000 entries (which is still just a fraction of data I need to process) it took ≈ 20x more time.
@IFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; json_normalize
and the soln I gave are all in python (the actual parser is in c), so these are not very efficient.
stay tuned to https://github.com/pandas-dev/pandas2/issues/71
@jreback thanks for your time, very helpful!
@jreback I have 1 more question: Is it possible to achieve formatting like yours
In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack() Out[18]: age description price 0 None Very Importrant 500 1 5 NaN 150 2 15 Very Importrant Random Text None
but using json_normalize
?
I mean, to get nicely formated column names like _age, description, price_ instead of _age.string, description.nULL, description.string, price.string_ in the columns.
Thanks!
@IFantastic you can just post-process the columns
In [12]: i = pd.Index(['price.string', 'description.nuLL', 'age.string'])
In [13]: i.str.extract('(.*)\\.', expand=False)
Out[13]: Index(['price', 'description', 'age'], dtype='object')
@jreback Thanks!
For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records
constructor together with ujson
:
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
In python2, you'd probably want to use imap
instead of the built-in map
.
Most helpful comment
For ordinary ndjson files (like those output by BigQuery), I use the
DataFrame.from_records
constructor together withujson
:In python2, you'd probably want to use
imap
instead of the built-inmap
.