Pandas: Support ndjson -- newline delimited json -- for streaming data.

Created on 2 Jan 2015 · 21Comments · Source: pandas-dev/pandas

Hey all,

I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.

Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.

Take a look at the specification for newline delimited json here

Does pandas support this yet, and if not, would you consider adding a to_ndjson function to the existing output formats?

For example, the following table:

> df
    a        b          key                     version
0  True   False  ci4diwru70000x6xmmn19nba1        1
1  False  True   ci4diww6j0001x6xmbyp5o2f0        1

Would be converted to

> df.to_ndjson()
'{ "a": True, "b": False, "key": ci4diwru70000x6xmmn19nba1, "version": 1}\n{ "a": False, "b": True, "key": ci4diww6j0001x6xmbyp5o2f0, "version": 1}'

For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects

Enhancement IO JSON

Source

okdistribute

👍9 😄1

Most helpful comment

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

KristianHolsheimer on 29 May 2019

👍7 ❤3 🚀2 🎉2

All 21 comments

something like this, perhaps:

def iterndjson(df):
    generator = df.iterrows()
    ndjson = []
    row = True
    while row:
        try:
            row = next(generator)
            ndjson.append(row[1].to_dict())
        except StopIteration:
            row = None

    return ndjson

> df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
 {'one': 2, 'two': 4},
 {'one': 3, 'two': 5},
 {'one': 4, 'two': 6}]

okdistribute on 2 Jan 2015

👍2

have a look here http://pandas.pydata.org/pandas-docs/stable/io.html#json

jreback on 2 Jan 2015

this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects).

@karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor?

jtratner on 10 Jan 2015

@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json.

Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.

In [3]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ...:

In [4]: %timeit parse()
1 loop, best of 3: 4.05 s per loop

In [18]: import ujson as json

In [19]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ....:

In [20]: %timeit parse()
1 loop, best of 3: 1.43 s per loop

In [22]: %time _ = pd.read_json('reviews_Clothing_Shoes_and_Jewelry_5_comma.json', convert_dates=False)
CPU times: user 4.75 s, sys: 520 ms, total: 5.27 s
Wall time: 5.49 s

I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader.

aterrel on 2 Jun 2016

@karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing.

aterrel on 3 Jun 2016

Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/

okdistribute on 3 Jun 2016

Hi, is there a way to efficiently json_normalize newline delimited json read with read_json?
For example I have a big dataset of nested newline delimited json, and I'd like to read it and flatten at the same time so my data in the cells will be as actual values instead of dicts.

Right now I just json.loads line by line and then normalizing the whole dict, but it takes a lot of time, and I assume is not elegant in any way.

Thanks!

MtDersvan on 3 Apr 2017

you would have to show an example

jreback on 3 Apr 2017

Thanks for a quick reply!
Here is an example with a toy data:

Where I'd like to efficiently apply json_normalize somehow with read_json data.

MtDersvan on 4 Apr 2017

👍2

can you cat toy.csv to here.

jreback on 4 Apr 2017

Here.
toy.csv.zip

MtDersvan on 4 Apr 2017

This won't be that efficient, but its reasonably idiomatic.

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]: 
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

jreback on 4 Apr 2017

I don't think there is an easy way to do this ATM. In pandas2 this will be more built in.

cc @wesm

jreback on 4 Apr 2017

It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2

wesm on 4 Apr 2017

Just checked on my production dataset:

This won't be that efficient, but its reasonably idiomatic.

First, fortunately it actually get's the job done better than using json_normalize. (no annoying *.string or *.nULL in the column name). So thanks a lot.

Second, unfortunately it is definitely less efficient than using json_normalize as I did previously. When I increased amount of data to 10000 entries (which is still just a fraction of data I need to process) it took ≈ 20x more time.

MtDersvan on 4 Apr 2017

@IFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; json_normalize and the soln I gave are all in python (the actual parser is in c), so these are not very efficient.

stay tuned to https://github.com/pandas-dev/pandas2/issues/71

jreback on 4 Apr 2017

👍1

@jreback thanks for your time, very helpful!

MtDersvan on 4 Apr 2017

@jreback I have 1 more question: Is it possible to achieve formatting like yours

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]:
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

but using json_normalize?

I mean, to get nicely formated column names like _age, description, price_ instead of _age.string, description.nULL, description.string, price.string_ in the columns.

Thanks!

MtDersvan on 4 Apr 2017

@IFantastic you can just post-process the columns

In [12]: i = pd.Index(['price.string', 'description.nuLL', 'age.string'])

In [13]: i.str.extract('(.*)\\.', expand=False)
Out[13]: Index(['price', 'description', 'age'], dtype='object')

jreback on 4 Apr 2017

👍1

@jreback Thanks!

MtDersvan on 4 Apr 2017

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

KristianHolsheimer on 29 May 2019

👍7 ❤3 🚀2 🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Storing a dict in a DataFrame fails

andreas-thomik · 3Comments

Hexbin plots does not display x label and xtick labels

BDannowitz · 3Comments

read_csv(filename_with_asian_locale) failed in python 3.6 for windows

mfmain · 3Comments

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

Better display of negative Timedelta

scls19fr · 3Comments