Pandas: Support ndjson -- newline delimited json -- for streaming data.

Created on 2 Jan 2015  Â·  21Comments  Â·  Source: pandas-dev/pandas

Hey all,

I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.

Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.

Take a look at the specification for newline delimited json here

Does pandas support this yet, and if not, would you consider adding a to_ndjson function to the existing output formats?

For example, the following table:

> df
    a        b          key                     version
0  True   False  ci4diwru70000x6xmmn19nba1        1
1  False  True   ci4diww6j0001x6xmbyp5o2f0        1

Would be converted to

> df.to_ndjson()
'{ "a": True, "b": False, "key": ci4diwru70000x6xmmn19nba1, "version": 1}\n{ "a": False, "b": True, "key": ci4diww6j0001x6xmbyp5o2f0, "version": 1}'

For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects

Enhancement IO JSON

Most helpful comment

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

All 21 comments

something like this, perhaps:

def iterndjson(df):
    generator = df.iterrows()
    ndjson = []
    row = True
    while row:
        try:
            row = next(generator)
            ndjson.append(row[1].to_dict())
        except StopIteration:
            row = None

    return ndjson
> df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
 {'one': 2, 'two': 4},
 {'one': 3, 'two': 5},
 {'one': 4, 'two': 6}]

this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects).

@karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor?

@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json.

Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.

In [3]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ...:

In [4]: %timeit parse()
1 loop, best of 3: 4.05 s per loop

In [18]: import ujson as json

In [19]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ....:

In [20]: %timeit parse()
1 loop, best of 3: 1.43 s per loop

In [22]: %time _ = pd.read_json('reviews_Clothing_Shoes_and_Jewelry_5_comma.json', convert_dates=False)
CPU times: user 4.75 s, sys: 520 ms, total: 5.27 s
Wall time: 5.49 s

I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader.

@karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing.

Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/

Hi, is there a way to efficiently json_normalize newline delimited json read with read_json?
For example I have a big dataset of nested newline delimited json, and I'd like to read it and flatten at the same time so my data in the cells will be as actual values instead of dicts.

Right now I just json.loads line by line and then normalizing the whole dict, but it takes a lot of time, and I assume is not elegant in any way.

Thanks!

you would have to show an example

Thanks for a quick reply!
Here is an example with a toy data:
image

Where I'd like to efficiently apply json_normalize somehow with read_json data.

can you cat toy.csv to here.

Here.
toy.csv.zip

This won't be that efficient, but its reasonably idiomatic.

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]: 
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

I don't think there is an easy way to do this ATM. In pandas2 this will be more built in.

cc @wesm

It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2

Just checked on my production dataset:

This won't be that efficient, but its reasonably idiomatic.

First, fortunately it actually get's the job done better than using json_normalize. (no annoying *.string or *.nULL in the column name). So thanks a lot.

Second, unfortunately it is definitely less efficient than using json_normalize as I did previously. When I increased amount of data to 10000 entries (which is still just a fraction of data I need to process) it took ≈ 20x more time.

@IFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; json_normalize and the soln I gave are all in python (the actual parser is in c), so these are not very efficient.

stay tuned to https://github.com/pandas-dev/pandas2/issues/71

@jreback thanks for your time, very helpful!

@jreback I have 1 more question: Is it possible to achieve formatting like yours

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]:
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

but using json_normalize?

I mean, to get nicely formated column names like _age, description, price_ instead of _age.string, description.nULL, description.string, price.string_ in the columns.

Thanks!

@IFantastic you can just post-process the columns

In [12]: i = pd.Index(['price.string', 'description.nuLL', 'age.string'])

In [13]: i.str.extract('(.*)\\.', expand=False)
Out[13]: Index(['price', 'description', 'age'], dtype='object')

@jreback Thanks!

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

datapythonista picture datapythonista  Â·  44Comments

jreback picture jreback  Â·  61Comments

jsexauer picture jsexauer  Â·  81Comments

ShaharNaveh picture ShaharNaveh  Â·  51Comments

quicknir picture quicknir  Â·  58Comments