Pandas: Pandas read_csv out of memory even after adding chunksize

Created on 30 May 2017 · 17Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I've been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

IO CSV Usage Question

Source

gk13

Most helpful comment

I've solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

silva-luana on 23 Apr 2018

👍7 ❤3 👎1

All 17 comments

pls show pd.show_versions().

if the above is all you are doing, then it should work. exactly where does it run out of memory?

jreback on 30 May 2017

I'm running it on a Jupyter notebook, and it crashes (Kernel dies) after processing 124 chunks of this data.

gk13 on 30 May 2017

There is no Error in the output, the notebook crashes before that

gk13 on 30 May 2017

Not to be pedantic, but are you sure your file is tab-separated? I've had an issue where I passed the wrong separator, and pandas tried to construction a single giant string which blew up memory.

TomAugspurger on 30 May 2017

Yes , I verified that too, it's tab separated :)

gk13 on 30 May 2017

The same function worked for a 8 GB version of the file

gk13 on 30 May 2017

@gk13 you would have to show more code. It is certainly possible that the reading part is fine, but your chunk processing blows up memory.

jreback on 30 May 2017

Ive updated the code above. It blows up after processing foo 124 times

gk13 on 30 May 2017

this keeps the reference around, you an gc.collect() or better yet is not to even use inplace.

jreback on 30 May 2017

I tried gc.collect() before returning from foo, didn't help.

gk13 on 30 May 2017

Any other suggestions?

gk13 on 30 May 2017

@gk13 : I'm in agreement with @TomAugspurger that your file could be malformed, as you have not been able to prove that you were able to read this otherwise (then again, what better way is there to do it than with pandas 😄 ).

Why don't you do this:

Instead of reading the entire file into memory, pass in iterator=True with a specified chunksize. Using the returned iterator, called .read() multiple times and see what you get with each chunk. Then we can confirm whether your file is in fact formatted correctly.

gfyoung on 28 Aug 2017

I've solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

silva-luana on 23 Apr 2018

👍7 ❤3 👎1

I've solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn't error out. low_memory=False didn't work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

stock-ds on 5 Mar 2019

👍1

what does axis=0 do?

MHUNCHO on 10 Oct 2019

axis=0 - add/append new rows
axis=1 - add/append columns

stock-ds on 10 Oct 2019

Seems like the debugging efforts in the original question stalled while others have had success with using concat with chunks and low_memory=False. Will tag as a usage question and close.