Aiohttp: Memory leak when doing get() and scope closes

Created on 17 Jun 2020  路  11Comments  路  Source: aio-libs/aiohttp

馃悶 Describe the bug

When downloading content from a website, memory usage increases roughly proportionally to the size of the content downloaded, with some overhead for the library and other objects in memory.

When the scope where that content is downloaded is closed and inaccessible, I expect the garbage collector to de-allocate the objects which were downloaded, but memory usage appears to remain near the same level as peak memory usage right after downloads finish.

馃挕 To Reproduce

  1. Set up a default ClientSession
  2. Open a connection with any site
  3. Download content from that site
  4. Leave the scope in which that content was downloaded
  5. Examine memory usage during and after download, observing that memory usage has not significantly decreased

馃挕 Expected behavior
Once the downloaded objects are no longer in scope, they should be de-allocated by the garbage collector and memory usage should fall back to a similar level as when the script initialized.

馃搵 Logs/tracebacks

I have no specific logs/tracebacks (no exception occurred), but I have profiled the relevant functions in two scripts with memory_profiler to demonstrate memory usage.

The first script is created with aiohttp, demonstrating the bug:

aiohttp_profile.txt

Filename: ./aiohttp_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    22     26.8 MiB     26.8 MiB           1   @profile
    23                                         async def quicktest():
    24     26.8 MiB      0.0 MiB           1       url = "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf"
    25     26.8 MiB      0.0 MiB           1       session = aiohttp.ClientSession()
    26     26.8 MiB      0.0 MiB           8       jobs = [grab_data(url, session) for _ in range(5)]
    27    260.7 MiB    233.9 MiB           2       jobs_result = await asyncio.gather(*jobs)
    28    260.7 MiB      0.0 MiB           1       if not all(jobs_result):  # sanity check
    29                                                 _log.info("Something didn't give back data")
    30    260.7 MiB      0.0 MiB           1       await session.close()


Filename: ./aiohttp_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    33     26.7 MiB     26.7 MiB           1   @profile
    34                                         def main():
    35     26.7 MiB      0.0 MiB           1       logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
    36     26.8 MiB      0.1 MiB           1       loop = asyncio.get_event_loop()
    37     26.8 MiB      0.0 MiB           1       try:
    38    201.4 MiB    174.6 MiB           1           loop.run_until_complete(quicktest())
    39                                                 # memory usage should be back to normal
    40    201.4 MiB      0.0 MiB           1           _log.info("Completed example")
    41                                             except KeyboardInterrupt:
    42                                                 _log.info("shutting down...")
    43                                                 loop.stop()

The bug is observed here: On line 38, I expect memory usage to reach its peak, and the next line should see memory usage drop to init levels. However, on line 40, memory usage is still at the same level, even though the scope where the memory was allocated is gone and inaccessible.

The second is created using requests in a nearly identical way, but demonstrates the behavior I expect in aiohttp:

requests_profile.txt

Filename: ./requests_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    19     23.9 MiB     23.9 MiB           1   @profile
    20                                         def quicktest():
    21     23.9 MiB      0.0 MiB           1       url = "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf"
    22     23.9 MiB      0.0 MiB           1       session = requests.Session()
    23    126.2 MiB    102.2 MiB           8       jobs_result = [grab_data(url, session) for _ in range(5)]
    24    126.2 MiB      0.0 MiB           1       if not all(jobs_result):  # sanity check
    25                                                 _log.info("Something didn't give back data")
    26    126.2 MiB      0.0 MiB           1       session.close()


Filename: ./requests_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    29     23.9 MiB     23.9 MiB           1   @profile
    30                                         def main():
    31     23.9 MiB      0.0 MiB           1       logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
    32     23.9 MiB      0.0 MiB           1       try:
    33     27.4 MiB      3.5 MiB           1           quicktest()
    34                                                 # memory usage should be back to normal
    35     27.4 MiB      0.0 MiB           1           _log.info("Completed example")
    36                                             except KeyboardInterrupt:
    37                                                 _log.info("shutting down...")

Here the memory usage shown by memory_profiler is considerably different, because the memory usage drops as the function quicktest() on line 33 ends, so the scope of main() never sees the same level of memory usage.

馃搵 Your version of the Python

python --version
Python 3.8.3

馃搵 Your version of the aiohttp/yarl/multidict distributions

python -m pip show aiohttp
Name: aiohttp
Version: 3.6.2
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: Nikolay Kim
Author-email: [email protected]
License: Apache 2
Location: /home/gregdan3/.local/share/virtualenvs/memory_leak_aiohttp-hbLDHJl6/lib/python3.8/site-packages
Requires: attrs, chardet, multidict, async-timeout, yarl
Required-by:

```console
$ python -m pip show multidict
(memory_leak_aiohttp) gregdan3@Theros ~/g/c/memory_leak_aiohttp (master)>
python -m pip show multidict
Name: multidict
Version: 4.7.6
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.[email protected]
License: Apache 2
Location: /home/gregdan3/.local/share/virtualenvs/memory_leak_aiohttp-hbLDHJl6/lib/python3.8/site-packages
Requires:
Required-by: yarl, aiohttp

```console
python -m pip show yarl
Name: yarl
Version: 1.4.2
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache 2
Location: /home/gregdan3/.local/share/virtualenvs/memory_leak_aiohttp-hbLDHJl6/lib/python3.8/site-packages
Requires: multidict, idna
Required-by: aiohttp

馃搵 Additional context

The relevant scripts and my Pipfile* are here:
scripts.zip

#!/usr/bin/env python
import asyncio
import logging
from concurrent.futures import ThreadPoolExecutor

import aiohttp
from memory_profiler import profile

_log = logging.getLogger(__name__)

LOG_FORMAT = (
    "[%(asctime)s] [%(filename)22s:%(lineno)-4s] [%(levelname)8s]    %(message)s"
)


async def grab_data(url, session):
    async with session.get(url) as response:
        data = await response.read()
    return data


@profile
async def quicktest():
    url = "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf"
    session = aiohttp.ClientSession()
    jobs = [grab_data(url, session) for _ in range(5)]
    jobs_result = await asyncio.gather(*jobs)
    if not all(jobs_result):  # sanity check
        _log.info("Something didn't give back data")
    await session.close()


@profile
def main():
    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(quicktest())
        # memory usage should be back to normal
        _log.info("Completed example")
    except KeyboardInterrupt:
        _log.info("shutting down...")
        loop.stop()


if __name__ == "__main__":
    main()
#!/usr/bin/env python
import logging
import requests
from memory_profiler import profile

_log = logging.getLogger(__name__)

LOG_FORMAT = (
    "[%(asctime)s] [%(filename)22s:%(lineno)-4s] [%(levelname)8s]    %(message)s"
)


def grab_data(url, session):
    with session.get(url) as response:
        data = response.content
    return data


@profile
def quicktest():
    url = "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf"
    session = requests.Session()
    jobs_result = [grab_data(url, session) for _ in range(5)]
    if not all(jobs_result):  # sanity check
        _log.info("Something didn't give back data")
    session.close()


@profile
def main():
    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
    try:
        quicktest()
        # memory usage should be back to normal
        _log.info("Completed example")
    except KeyboardInterrupt:
        _log.info("shutting down...")


if __name__ == "__main__":
    main()

And my Pipfile:

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
memory_profiler = {git = "ssh://[email protected]/pythonprofilers/memory_profiler.git"}
aiohttp = "*"
requests = "*"

I have not embedded my Pipfile.lock due to its length, but it is in the scripts.zip above and demonstrates the same versions for my dependencies as I showed before.