Pandas: Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6

Created on 26 Feb 2020 · 7Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import tracemalloc, linecache
import sys, os
import pandas as pd

def display_top_mem(snapshot, key_type='lineno', limit=10):
    """function for displaying lines of code taking most memory"""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def main():
    tracemalloc.start()
    periods = 745
    df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)

    for i in range(100):
        df = df_init.copy()

        df['l:c:B'] = df['c:B'].rolling(periods).min()
        df['h:c:B'] = df['c:B'].rolling(periods).max()

        #df['l:c:B'] = df['c:B'].rolling(periods).mean()
        #df['h:c:B'] = df['c:B'].rolling(periods).median()

        snapshot = tracemalloc.take_snapshot()
        display_top_mem(snapshot, limit=3)
        print(f'df size {sys.getsizeof(df)/1024} KiB')
        print(f'{i} ##################')


if __name__ == '__main__':
    main()

Problem description

Pandas rolling().min() and rolling().max() functions create memory leaks. I've run a tracemalloc line based memory profiling and <__array_function__ internals>:6 seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min() and rolling().max() is changed to rolling().mean()and rolling().median() an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min() and rolling().max() seem to be the problem.

The output of this script running for 100 iterations with <__array_function__ internals>:6 constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq

CSV file mem_debug_data.csv used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv

Expected Output

Running rolling().min() and rolling().max() constantly over time should not grow RAM consumption.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-88-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

Duplicate Performance Window

Source

regmeg

👍5

Most helpful comment

fixed by #33693 in 1.0.4 i think

jreback on 10 Jul 2020

👍2

All 7 comments

Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.

def rolling_window_nan_filled(a_org, window):
    a = np.concatenate(( np.full(window-1,np.nan), a_org))
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def numpy_rolling_min(values, periods):
    return np.min(rolling_window_nan_filled(values, periods), axis=1)

def numpy_rolling_max(values, periods):
    return np.max(rolling_window_nan_filled(values, periods), axis=1)

numpy_rolling_min() and numpy_rolling_max() expect numpy based values from a pandas series that can be achieved by df[column].values.

regmeg on 27 Feb 2020

👍1

@regmeg can you check if you see the same problem with pandas 0.25, or whether it is new in 1.0?

jorisvandenbossche on 29 Feb 2020

Hi @jorisvandenbossche. thanks for your reply. I've just rerun the script with 0.25, the memory does not seem to accumulate, so there is no memory leak.

The script ive submitted is a copy paste script, it should be easy to replicate with 1.0.1, you just need to download the dataset un run the script. The leak occurs on both my local linux machine and the docker linux-python based images on instances.

regmeg on 29 Feb 2020

This is a pretty severe bug in my eyes - so i think this should get higher priority.

It's happening with the latest version of Pandas, too (1.0.3 as of the time of writing, + in the current master) if that helps any.
Can also confirm that it doesn't happen with 0.25.3.

Doing some investigation:

Running git bisect between v1.0.0 and v0.25.3 each testing the above code-segment i got the following output:

6e5d14834072e7856987eb31e574b2a05db9f0b9 is the first bad commit
commit 6e5d14834072e7856987eb31e574b2a05db9f0b9
Author: Matthew Roeschke <[email protected]>
Date:   Thu Nov 21 04:59:30 2019 -0800

    REF: Separate window bounds calculation from aggregation functions (#29428)

:040000 040000 163ddd42a163c6da0f81a44827efb37bf195cefd 642367c538696b935b83329ba4656c47e838d5fc M      pandas
:100755 100755 545765ecb114d20248f81d1bdaacf6bfd3b53050 0915b6aba113a1af9976db69d791c72997feea95 M      setup.py

The last good commit seems to be a46806c3995e2ddc0948f5c8c34f157c92164e42, while the one introducing the problem is 6e5d14834072e7856987eb31e574b2a05db9f0b9 .

Now i fail to see why it would work for mean, but not for min/max ... but i hope this helps someone with more knowledge in the pandas code to find the problem quickly.