Pandas: Pandas's memory management

Created on 24 Mar 2015  路  17Comments  路  Source: pandas-dev/pandas

Hi,

Lately I have started to use pandas heavily with relatively big files.
When the RAM becomes full I delete (with del and gc) the big objects (Data
frames), though this doesn't clear the memory.

Can some onr please advise how to clear variables from RAM, similar to R with rm(some_object) ; gc()

Additionally did some experiments about this and here are the results :

Started with ~600MB RAM:

  1. Ran two queries that jumped the memorey to ~ 3900MB (query -> to Data frame)
  2. Removing these Data frames with one of the followings :
    %xdel df
    %reset_selective -f df
    def malloc_trim():
    import ctypes
    ctypes.CDLL('libc.so.6').malloc_trim(0)

and then gc.collect(), reduced the RAM similarly (with all three methods) to ~3000MB

  1. afterwards running %reset -f => most of the RAM was cleaned
  2. The script do some data "Jiu Jitsu" a then store result to HD. Restarting IPython and loading this file barley uses RAM !

What would you recommend ?

Regards ,
Boris

Performance

Most helpful comment

Hi,
I have solution for this one !!!

merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()

All 17 comments

Can you provide a small reproducible example?

import pandas as pd
import gc

Started with ~600MB RAM

get data 1

df_1 = pd.read_sql(query , connection_1)

get data 2

df_2 = pd.read_sql(query , connection_2)

merged_df = pd.merge(df_1 ,df_2, left_on=['id'],right_on=['id'] , how='left')

Memorey jumped to ~ 3900MB

*****

Option number 1 *

*****

del df_1
del df_2
del merged_df

gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 2 *

*****

%reset_selective -f df_1
%reset_selective -f df_2
%reset_selective -f merged_df
gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 3 *

*****

def malloc_trim():
import ctypes
ctypes.CDLL('libc.so.6').malloc_trim(0)

del df_1
del df_2
del merged_df
malloc_trim()

gc.collect()

Memorey reduced to ~ 3000MB

*****

Option number 4 *

*****

%reset -f

Memorey reduced to ~ ~600MB !!!!!!!

*****

Option number 5 *

*****

merged_df.to_csv("merged_df.csv",index=False)

%reset -f

Memorey ~600MB RAM

merged_df = pd.read_csv("merged_df.csv")

Memorey ~800MB RAM

*****

Conclusion **

*****

The only option that freed most of the memory was "%reset -f " which is not a solution.

  • profile using something like this; ipython tends to hold onto cached references (.e.g the _, __) items, so looking at memory directly is not that useful.
  • you almost certainly have an open sql db-api connection. This will leave a reference to the memory even if the python references are subsequently removed

Ok, how can I remove all those memory references ?

Try closing the connections or freeing the objects from sql as a first step. I typically use objgraph to figure out what is holding onto the objects.

Did what suggested, please find bellow the graph :
memory

Any suggestions what can holding the data frame ?
And all the connections were closed .

If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?

Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is.

"If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?"
IF I read the same data from HD it barely increases the memory so deleting this DataFrame barely reduces memory usage.

"Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is."

Do you mean the one with "tuple 5 items" caption ? why did you find especially this one interesting ?

Basically it is a reference cycle and memory cannot be freed easily as the garbage collector does not know what order to delete the objects. There is documentation out there on the subject.

So the question might be what is creating all that memory? Is it the underlying SQL objects or something else. I would imagine that the memory footprint reading from SQL and HD are similar.

"You can probably get the that type object from objgraph and then figure out what the tuple exactly is."
How can I extract this object from the objgraph ? how to find out this cycle ?

Did anyone figure this out? Why does pd.read_sql() seem to consume 10 times more memory than loading the same table from CSV?!

this is just what python does. it doesn't not release memory to the os.

@jreback care to elaborate? I would like to dig more into this

this is an old issue and not reproducible. certainly have a look if you want.

Hi,
I have solution for this one !!!

merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()

Was this page helpful?
0 / 5 - 0 ratings