import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
while True:
body = df.T.to_json()
print("HI")
If we repeatedly call to_json() on a dataframe, memory usage grows continuously:

I would expect memory usage to stay constant
pd.show_versions()[paste the output of pd.show_versions() here below this line]
/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: 0.25.2
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Source is at https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/objToJSON.c is you're interested in debugging further.
FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome.
It's also worth isolating the to_json part from the df.T part.
On Thu, Jan 24, 2019 at 4:03 PM chris-b1 notifications@github.com wrote:
FWIW this seems to take a ton of iterations and doesn't really leak much
memory, but investigations are welcome.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/24889#issuecomment-457375046,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIudiRAcurMKPP8cCqhjRO7t3FAzxks5vGi26gaJpZM4aPu6O
.
@TomAugspurger sorry, that was left over from some other original code. The leak happens with or without the transpose.
Does it also happen with other dtypes? How about an empty DataFrame?
Memory use is stable for an empty dataframe
In simple cases, there's an easy workaround. You can just use df.to_dict() and pass it to python's json.dump. You'll have to make some manual changes, such as manually serializing pandas timestamps.
Looked at this in more detail as I was reviewing the JSON integration. As mentioned above this issue does not affect empty dataframes, nor does it affect non-numeric frames.
This leaks:
pd.DataFrame([[1]]).to_json()
But this doesn't
pd.DataFrame([['a']])
I believe I've isolated the issue to this line of code:
Removing that prevented the example from the OP from leaking. Unfortunately it caused segfaults in the rest of the code base so have to debug that further, but will push a PR if I can figure it out.
@jbrockmendel you might have some thoughts here as well. You called this condition out as looking wonky before and it does seem to cause the leak here, just need to figure out what the actual intent is
hi, i am facing the same issue about memory leak in df.to_json().
Here I am using df.to_dict() and pass it to python's json.dump, the memory use is stable

But when I use the df.to_json()

Code Sample
import json
import pandas as pd
def list_to_df_json( data):
point_classified = {}
for i in data:
if i['point_id'] not in point_classified:
point_classified[i['point_id']] = {}
point_classified[i['point_id']][i['timestamp']] = i['point_value']
return point_classified
def boo(a):
data = list_to_df_json(a)
for point_id, point_value_of_that_id in data.items():
# logging.info(f"pushing data from pointid : {point_id} ")
df = pd.DataFrame.from_dict(point_value_of_that_id, orient='index', columns=[point_id])
# dict_df = df.to_dict(orient='index')
# workaround
# json_df = json.dumps(dict_df)
# memory leak
json_df = df.to_json(orient='index')
return json_df
while 1:
a = [{'point_id': 'a',
'point_value': 346.9,
'timestamp': '2019-12-01 08:15:00'},
{'point_id': 'a',
'point_value': 247.2,
'timestamp': '2019-12-01 08:30:00'},
{'point_id': 'a',
'point_value': 237.9,
'timestamp': '2019-12-01 08:45:00'},
{'point_id': 'a',
'point_value': 215.2,
'timestamp': '2019-12-01 09:00:00'},
{'point_id': 'b',
'point_value': 276.8,
'timestamp': '2019-12-01 09:15:00'},
{'point_id': 'b',
'point_value': 296.1,
'timestamp': '2019-12-01 09:30:00'},
{'point_id': 'b',
'point_value': 328.0,
'timestamp': '2019-12-01 09:45:00'}]
print(boo(a))
# pd.show_versions()
commit: None
python: 3.6.10.final.0
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8
pandas: 0.24.2
pytest: 5.4.3
pip: 19.3.1
setuptools: 44.0.0.post20200106
Cython: None
numpy: 1.19.1
scipy: 1.5.2
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2019.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.3.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.5.0
bs4: 4.8.2
html5lib: None
sqlalchemy: 1.3.18
pymysql: 0.9.3
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
I'm still seeing this memory leak on to_json() as of 1.1.4.
A slightly modified version of the original reproduction script:
import pandas as pd
import numpy as np
import memory_utils
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
while True:
# leak
body = df.to_json()
# no leak
# body = df.to_dict()
memory_utils.print_memory('to_json leak test')
INSTALLED VERSIONS
------------------
commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.9.8-arch1-1
Version : #1 SMP PREEMPT Tue, 10 Nov 2020 22:44:11 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None