Pandas: Memory leak in `df.to_json`

Created on 23 Jan 2019 · 10Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                              
import numpy as np


df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

while True:
    body = df.T.to_json()
    print("HI")

Problem description

If we repeatedly call to_json() on a dataframe, memory usage grows continuously:

Expected Output

I would expect memory usage to stay constant

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]
/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: 0.25.2
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

IO JSON Performance

Source

skatenerd

👍1

All 10 comments

Source is at https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/objToJSON.c is you're interested in debugging further.

TomAugspurger on 24 Jan 2019

FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome.

chris-b1 on 24 Jan 2019

It's also worth isolating the to_json part from the df.T part.

On Thu, Jan 24, 2019 at 4:03 PM chris-b1 notifications@github.com wrote:

FWIW this seems to take a ton of iterations and doesn't really leak much
memory, but investigations are welcome.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/24889#issuecomment-457375046,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIudiRAcurMKPP8cCqhjRO7t3FAzxks5vGi26gaJpZM4aPu6O
.

TomAugspurger on 24 Jan 2019

👍1

@TomAugspurger sorry, that was left over from some other original code. The leak happens with or without the transpose.

skatenerd on 24 Jan 2019

Does it also happen with other dtypes? How about an empty DataFrame?

jbrockmendel on 25 Jan 2019

Memory use is stable for an empty dataframe

skatenerd on 25 Jan 2019

In simple cases, there's an easy workaround. You can just use df.to_dict() and pass it to python's json.dump. You'll have to make some manual changes, such as manually serializing pandas timestamps.

skatenerd on 25 Jan 2019

Looked at this in more detail as I was reviewing the JSON integration. As mentioned above this issue does not affect empty dataframes, nor does it affect non-numeric frames.

This leaks:

pd.DataFrame([[1]]).to_json()

But this doesn't

pd.DataFrame([['a']])

I believe I've isolated the issue to this line of code:

https://github.com/pandas-dev/pandas/blob/65466f04025474c3ee1bd2f4e49a1c5e24b76f7b/pandas/_libs/src/ujson/python/objToJSON.c#L716

Removing that prevented the example from the OP from leaking. Unfortunately it caused segfaults in the rest of the code base so have to debug that further, but will push a PR if I can figure it out.

@jbrockmendel you might have some thoughts here as well. You called this condition out as looking wonky before and it does seem to cause the leak here, just need to figure out what the actual intent is

WillAyd on 27 Apr 2019

hi, i am facing the same issue about memory leak in df.to_json().

Here I am using df.to_dict() and pass it to python's json.dump, the memory use is stable
to-json-workaround

But when I use the df.to_json()
using_to_json

Code Sample

import json

import pandas as pd


def list_to_df_json( data):
        point_classified = {}
        for i in data:
                if i['point_id'] not in point_classified:
                        point_classified[i['point_id']] = {}
                point_classified[i['point_id']][i['timestamp']] = i['point_value']
        return point_classified


def boo(a):

        data = list_to_df_json(a)

        for point_id, point_value_of_that_id in data.items():
                # logging.info(f"pushing data from pointid : {point_id} ")
                df = pd.DataFrame.from_dict(point_value_of_that_id, orient='index', columns=[point_id])

                # dict_df = df.to_dict(orient='index')
                # workaround
                # json_df = json.dumps(dict_df)

                # memory leak
                json_df = df.to_json(orient='index')
        return json_df

while 1:
      a = [{'point_id': 'a',
              'point_value': 346.9,
              'timestamp': '2019-12-01 08:15:00'},
             {'point_id': 'a',
              'point_value': 247.2,
              'timestamp': '2019-12-01 08:30:00'},
             {'point_id': 'a',
              'point_value': 237.9,
              'timestamp': '2019-12-01 08:45:00'},
             {'point_id': 'a',
              'point_value': 215.2,
              'timestamp': '2019-12-01 09:00:00'},
             {'point_id': 'b',
              'point_value': 276.8,
              'timestamp': '2019-12-01 09:15:00'},
             {'point_id': 'b',
              'point_value': 296.1,
              'timestamp': '2019-12-01 09:30:00'},
             {'point_id': 'b',
              'point_value': 328.0,
              'timestamp': '2019-12-01 09:45:00'}]

        print(boo(a))
        # pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.10.final.0
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 5.4.3
pip: 19.3.1
setuptools: 44.0.0.post20200106
Cython: None
numpy: 1.19.1
scipy: 1.5.2
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2019.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.3.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.5.0
bs4: 4.8.2
html5lib: None
sqlalchemy: 1.3.18
pymysql: 0.9.3
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

kenton18 on 6 Oct 2020

😕1 👍1

I'm still seeing this memory leak on to_json() as of 1.1.4.

A slightly modified version of the original reproduction script:

import pandas as pd
import numpy as np
import memory_utils

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

while True:
    # leak
    body = df.to_json()

    # no leak
    # body = df.to_dict()

    memory_utils.print_memory('to_json leak test')

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.9.8-arch1-1
Version          : #1 SMP PREEMPT Tue, 10 Nov 2020 22:44:11 +0000
machine          : x86_64
processor        :
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None