Pandas: Pandas read_csv out of memory even after adding chunksize

Created on 30 May 2017  路  17Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I've been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

IO CSV Usage Question

Most helpful comment

I've solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

All 17 comments

pls show pd.show_versions().

if the above is all you are doing, then it should work. exactly where does it run out of memory?

I'm running it on a Jupyter notebook, and it crashes (Kernel dies) after processing 124 chunks of this data.

There is no Error in the output, the notebook crashes before that

Not to be pedantic, but are you sure your file is tab-separated? I've had an issue where I passed the wrong separator, and pandas tried to construction a single giant string which blew up memory.

Yes , I verified that too, it's tab separated :)

The same function worked for a 8 GB version of the file

@gk13 you would have to show more code. It is certainly possible that the reading part is fine, but your chunk processing blows up memory.

Ive updated the code above. It blows up after processing foo 124 times

this keeps the reference around, you an gc.collect() or better yet is not to even use inplace.

I tried gc.collect() before returning from foo, didn't help.

Any other suggestions?

@gk13 : I'm in agreement with @TomAugspurger that your file could be malformed, as you have not been able to prove that you were able to read this otherwise (then again, what better way is there to do it than with pandas 馃槃 ).

Why don't you do this:

Instead of reading the entire file into memory, pass in iterator=True with a specified chunksize. Using the returned iterator, called .read() multiple times and see what you get with each chunk. Then we can confirm whether your file is in fact formatted correctly.

I've solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

I've solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn't error out. low_memory=False didn't work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

what does axis=0 do?

axis=0 - add/append new rows
axis=1 - add/append columns

Seems like the debugging efforts in the original question stalled while others have had success with using concat with chunks and low_memory=False. Will tag as a usage question and close.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Abrosimov-a-a picture Abrosimov-a-a  路  3Comments

ericdf picture ericdf  路  3Comments

idanivanov picture idanivanov  路  3Comments

matthiasroder picture matthiasroder  路  3Comments

nathanielatom picture nathanielatom  路  3Comments