dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)
def foo(table, exceptions):
"""
Modifies the columns of the dataframe in place to be categories, largely to save space.
:type table: pandas.DataFrame
:type exceptions: set columns not to modify.
:rtype: pandas.DataFrame
"""
for c in table:
if c in exceptions:
continue
x = table[c]
if str(x.dtype) != 'category':
x.fillna('null', inplace=True)
table[c] = x.astype('category', copy=False)
return table
I have a 34 GB tsv file and I've been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.
pls show pd.show_versions()
.
if the above is all you are doing, then it should work. exactly where does it run out of memory?
I'm running it on a Jupyter notebook, and it crashes (Kernel dies) after processing 124 chunks of this data.
There is no Error in the output, the notebook crashes before that
Not to be pedantic, but are you sure your file is tab-separated? I've had an issue where I passed the wrong separator, and pandas tried to construction a single giant string which blew up memory.
Yes , I verified that too, it's tab separated :)
The same function worked for a 8 GB version of the file
@gk13 you would have to show more code. It is certainly possible that the reading part is fine, but your chunk processing blows up memory.
Ive updated the code above. It blows up after processing foo 124 times
this keeps the reference around, you an gc.collect()
or better yet is not to even use inplace
.
I tried gc.collect() before returning from foo, didn't help.
Any other suggestions?
@gk13 : I'm in agreement with @TomAugspurger that your file could be malformed, as you have not been able to prove that you were able to read this otherwise (then again, what better way is there to do it than with pandas
馃槃 ).
Why don't you do this:
Instead of reading the entire file into memory, pass in iterator=True
with a specified chunksize
. Using the returned iterator, called .read()
multiple times and see what you get with each chunk. Then we can confirm whether your file is in fact formatted correctly.
I've solved the memory error problem using chunks AND low_memory=False
chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
I've solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn't error out. low_memory=False didn't work
chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
what does axis=0 do?
axis=0 - add/append new rows
axis=1 - add/append columns
Seems like the debugging efforts in the original question stalled while others have had success with using concat
with chunks
and low_memory=False
. Will tag as a usage question and close.
Most helpful comment
I've solved the memory error problem using chunks AND low_memory=False