Hi,
Lately I have started to use pandas heavily with relatively big files.
When the RAM becomes full I delete (with del and gc) the big objects (Data
frames), though this doesn't clear the memory.
Can some onr please advise how to clear variables from RAM, similar to R with rm(some_object) ; gc()
Additionally did some experiments about this and here are the results :
Started with ~600MB RAM:
and then gc.collect(), reduced the RAM similarly (with all three methods) to ~3000MB
What would you recommend ?
Regards ,
Boris
Can you provide a small reproducible example?
import pandas as pd
import gc
df_1 = pd.read_sql(query , connection_1)
df_2 = pd.read_sql(query , connection_2)
merged_df = pd.merge(df_1 ,df_2, left_on=['id'],right_on=['id'] , how='left')
del df_1
del df_2
del merged_df
gc.collect()
%reset_selective -f df_1
%reset_selective -f df_2
%reset_selective -f merged_df
gc.collect()
def malloc_trim():
import ctypes
ctypes.CDLL('libc.so.6').malloc_trim(0)
del df_1
del df_2
del merged_df
malloc_trim()
gc.collect()
%reset -f
merged_df.to_csv("merged_df.csv",index=False)
%reset -f
merged_df = pd.read_csv("merged_df.csv")
The only option that freed most of the memory was "%reset -f " which is not a solution.
ipython tends to hold onto cached references (.e.g the _, __) items, so looking at memory directly is not that useful.Ok, how can I remove all those memory references ?
Try closing the connections or freeing the objects from sql as a first step. I typically use objgraph to figure out what is holding onto the objects.
Did what suggested, please find bellow the graph :

Any suggestions what can holding the data frame ?
And all the connections were closed .
If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?
Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is.
"If you directly create the DataFrame, and then delete it, do you get the same issue? Or does this only occur when you read SQL?"
IF I read the same data from HD it barely increases the memory so deleting this DataFrame barely reduces memory usage.
"Note the reference cycle on the right side of your graph (type DataFrame <-> tuple). You can probably get the that type object from objgraph and then figure out what the tuple exactly is."
Do you mean the one with "tuple 5 items" caption ? why did you find especially this one interesting ?
Basically it is a reference cycle and memory cannot be freed easily as the garbage collector does not know what order to delete the objects. There is documentation out there on the subject.
So the question might be what is creating all that memory? Is it the underlying SQL objects or something else. I would imagine that the memory footprint reading from SQL and HD are similar.
"You can probably get the that type object from objgraph and then figure out what the tuple exactly is."
How can I extract this object from the objgraph ? how to find out this cycle ?
Did anyone figure this out? Why does pd.read_sql() seem to consume 10 times more memory than loading the same table from CSV?!
this is just what python does. it doesn't not release memory to the os.
@jreback care to elaborate? I would like to dig more into this
this is an old issue and not reproducible. certainly have a look if you want.
Hi,
I have solution for this one !!!
merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()
Most helpful comment
Hi,
I have solution for this one !!!
merged_df.to_csv("merged_df.csv",index=False)
del [[df_1,df_2,merged_df]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
merged_df=pd.DataFrame()