Pandas: PERF: json_normalize

Created on 8 Mar 2017 · 4Comments · Source: pandas-dev/pandas

I haven't looked much at the implementation, but guessing simpler cases like this could be optimized.

In [63]: data = [
    ...:     {'name': 'Name',
    ...:      'value': 1.0,
    ...:      'value2': 2.0,
    ...:      'nested': {'a': 'aa', 'b': 'bb'}}] * 1000000

In [64]: %timeit pd.DataFrame(data)
1 loop, best of 3: 847 ms per loop

In [65]: %timeit pd.io.json.json_normalize(data)
1 loop, best of 3: 20 s per loop

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: 0.999999999
httplib2: 0.9.2
apiclient: 1.5.3
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: 0.2.1

IO JSON Performance

Source

chris-b1

Most helpful comment

Converting lists of dictionaries faster in json_normalize seems perfectly reasonable. I intend to use RapidJSON (https://github.com/miloyip/nativejson-benchmark) to create a faster native JSON->DataFrame reader, since we can circumvent Python objects altogether that way. This can happen well before pandas2 ships by using Arrow tables an intermediary en route to pandas

wesm on 12 Mar 2017

👍2

All 4 comments

yeah this is all in python code :<

IIRC @wesm has a plan for this in pandas2, so maybe it would be possible to make use of some of that.

jreback on 8 Mar 2017

wesm on 12 Mar 2017

👍2

Not sure if this is still on anyone's radar, but I've been dealing with a performance issue at least partly caused by json_normalize. From some profiling, it seems like the biggest problem for my case is the use of deepcopy. For common relatively simple cases of just dictionaries/lists of string and numeric literals, deepcopy seems like a lot of unnecessary overhead. Even if it's needed for some use cases, calling it recursively (when it is doing its own recursive copy) is surely not optimal.

eewallace on 15 Oct 2019

👍1

any updates on fixing this or suggestions for workarounds (maybe some other library that flattens the dictionary?)

I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize

def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)