Pandas: PERF: json_normalize

Created on 8 Mar 2017  路  4Comments  路  Source: pandas-dev/pandas

I haven't looked much at the implementation, but guessing simpler cases like this could be optimized.

In [63]: data = [
    ...:     {'name': 'Name',
    ...:      'value': 1.0,
    ...:      'value2': 2.0,
    ...:      'nested': {'a': 'aa', 'b': 'bb'}}] * 1000000

In [64]: %timeit pd.DataFrame(data)
1 loop, best of 3: 847 ms per loop

In [65]: %timeit pd.io.json.json_normalize(data)
1 loop, best of 3: 20 s per loop

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: 0.999999999
httplib2: 0.9.2
apiclient: 1.5.3
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: 0.2.1

IO JSON Performance

Most helpful comment

Converting lists of dictionaries faster in json_normalize seems perfectly reasonable. I intend to use RapidJSON (https://github.com/miloyip/nativejson-benchmark) to create a faster native JSON->DataFrame reader, since we can circumvent Python objects altogether that way. This can happen well before pandas2 ships by using Arrow tables an intermediary en route to pandas

All 4 comments

yeah this is all in python code :<

IIRC @wesm has a plan for this in pandas2, so maybe it would be possible to make use of some of that.

Converting lists of dictionaries faster in json_normalize seems perfectly reasonable. I intend to use RapidJSON (https://github.com/miloyip/nativejson-benchmark) to create a faster native JSON->DataFrame reader, since we can circumvent Python objects altogether that way. This can happen well before pandas2 ships by using Arrow tables an intermediary en route to pandas

Not sure if this is still on anyone's radar, but I've been dealing with a performance issue at least partly caused by json_normalize. From some profiling, it seems like the biggest problem for my case is the use of deepcopy. For common relatively simple cases of just dictionaries/lists of string and numeric literals, deepcopy seems like a lot of unnecessary overhead. Even if it's needed for some use cases, calling it recursively (when it is doing its own recursive copy) is surely not optimal.

any updates on fixing this or suggestions for workarounds (maybe some other library that flattens the dictionary?)


I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize

def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)
Was this page helpful?
0 / 5 - 0 ratings