Pandas: Pandas's memory management

Created on 24 Mar 2015 · 17Comments · Source: pandas-dev/pandas

Hi,

Lately I have started to use pandas heavily with relatively big files.
When the RAM becomes full I delete (with del and gc) the big objects (Data
frames), though this doesn't clear the memory.

Can some onr please advise how to clear variables from RAM, similar to R with rm(some_object) ; gc()

Additionally did some experiments about this and here are the results :

Started with ~600MB RAM:

Ran two queries that jumped the memorey to ~ 3900MB (query -> to Data frame)
Removing these Data frames with one of the followings :
%xdel df
%reset_selective -f df
def malloc_trim():
import ctypes
ctypes.CDLL('libc.so.6').malloc_trim(0)

and then gc.collect(), reduced the RAM similarly (with all three methods) to ~3000MB

afterwards running %reset -f => most of the RAM was cleaned
The script do some data "Jiu Jitsu" a then store result to HD. Restarting IPython and loading this file barley uses RAM !

What would you recommend ?

Regards ,
Boris

Performance

Source

borisRa

Most helpful comment

Hi,
I have solution for this one !!!

merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()

hardikvagh on 2 Mar 2018

👍3

All 17 comments

Can you provide a small reproducible example?

rockg on 25 Mar 2015

import pandas as pd
import gc

Started with ~600MB RAM

get data 1

df_1 = pd.read_sql(query , connection_1)

get data 2

df_2 = pd.read_sql(query , connection_2)

merged_df = pd.merge(df_1 ,df_2, left_on=['id'],right_on=['id'] , how='left')

Memorey jumped to ~ 3900MB

*****

Option number 1 *

*****

del df_1
del df_2
del merged_df

gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 2 *

*****

%reset_selective -f df_1
%reset_selective -f df_2
%reset_selective -f merged_df
gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 3 *

*****

def malloc_trim():
import ctypes
ctypes.CDLL('libc.so.6').malloc_trim(0)

del df_1
del df_2
del merged_df
malloc_trim()

gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 4 *

*****

%reset -f

Memorey reduced to ~ ~600MB !!!!!!!

*****

Option number 5 *

*****

merged_df.to_csv("merged_df.csv",index=False)

%reset -f

Memorey ~600MB RAM

merged_df = pd.read_csv("merged_df.csv")

Memorey ~800MB RAM

*****

Conclusion **

*****

The only option that freed most of the memory was "%reset -f " which is not a solution.

borisRa on 26 Mar 2015

profile using something like this; ipython tends to hold onto cached references (.e.g the _, __) items, so looking at memory directly is not that useful.
you almost certainly have an open sql db-api connection. This will leave a reference to the memory even if the python references are subsequently removed

jreback on 26 Mar 2015

Ok, how can I remove all those memory references ?

borisRa on 26 Mar 2015

Try closing the connections or freeing the objects from sql as a first step. I typically use objgraph to figure out what is holding onto the objects.

rockg on 26 Mar 2015

Did what suggested, please find bellow the graph :
memory

Any suggestions what can holding the data frame ?
And all the connections were closed .

borisRa on 29 Mar 2015

If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?

bashtage on 30 Mar 2015

Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is.

rockg on 30 Mar 2015

"If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?"
IF I read the same data from HD it barely increases the memory so deleting this DataFrame barely reduces memory usage.

borisRa on 4 Apr 2015

"Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is."

Do you mean the one with "tuple 5 items" caption ? why did you find especially this one interesting ?

borisRa on 4 Apr 2015

Basically it is a reference cycle and memory cannot be freed easily as the garbage collector does not know what order to delete the objects. There is documentation out there on the subject.

So the question might be what is creating all that memory? Is it the underlying SQL objects or something else. I would imagine that the memory footprint reading from SQL and HD are similar.

rockg on 5 Apr 2015

"You can probably get the that type object from objgraph and then figure out what the tuple exactly is."
How can I extract this object from the objgraph ? how to find out this cycle ?

borisRa on 13 Apr 2015

Did anyone figure this out? Why does pd.read_sql() seem to consume 10 times more memory than loading the same table from CSV?!

vatsan on 18 Jan 2018

this is just what python does. it doesn't not release memory to the os.

jreback on 19 Jan 2018

@jreback care to elaborate? I would like to dig more into this

camrongodbout on 26 Feb 2018

this is an old issue and not reproducible. certainly have a look if you want.

jreback on 27 Feb 2018

Hi,
I have solution for this one !!!

merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()

hardikvagh on 2 Mar 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings