Aiohttp: There is no way to set limit size for response.read()

Created on 1 Jan 2018  路  10Comments  路  Source: aio-libs/aiohttp

Long story short

Sometimes, remote websites do not send right Content-Length header, so there are no trivial check how big is remote page.

For example http://www.annonces-net.com/petites_annonces.php?page=principe

Is really big page, due some php errors, when I am doing response.read() process will be terminated by OOM killer.

Aiohttp has response.content.read(chunk_size), but it is really low level option

What about add optional parameter for response.read(max_size=XXX) with some default limit like 10MB.

It might little bit more expected behaviour for end users, as well max_size=None can remove limit

Expected behaviour

ClientResponseError error raised when response._content becomes to large

Actual behaviour

process killed by OOM-killer

Steps to reproduce

import aiohttp
import asyncio


async def main(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            body = await response.read()
            print(body)

url = 'http://www.annonces-net.com/petites_annonces.php?page=principe'
loop = asyncio.get_event_loop()
loop.run_until_complete(main(url))

Your environment

client

Most helpful comment

I understand, what about new optional parameter, which do not break backward compatible?

response.read(max_size=1024 * 1024) which passes max_value to content.read(max_size=max_size)

Default can be None, which means no limits, how it works right now

All 10 comments

The issue is not about adding a parameter actually (content.read() can do it now without code modification) but changing defaults to 10MB.
I'm -0 for the change:

  1. It's not backward compatible, people will never expect such limitation.
  2. requests has no such default, it downloads the whole BODY.
  3. Browser downloads the whole response content too.

I understand, what about new optional parameter, which do not break backward compatible?

response.read(max_size=1024 * 1024) which passes max_value to content.read(max_size=max_size)

Default can be None, which means no limits, how it works right now

I like the second proposal (if you will implement a PR).
The only question is default: should it be None or -1? content.read() uses -1 for the whole content.
The signature is async def read(self, n=-1): ..., let's support the same.

Please keep in mind that response.read() caches content in memory.
Two consequence calls await resp.read(max_size=100) and await resp.read(max_read=500) should be consistent.

Maybe if max_size reached we can raise an exception? Idea was give an option do not download too big pages. Read until size is reached is what can be done right now by content.read. Any concerns regarding exception?

My idea is:

  1. Let's assume previous call requested 100 bytes, resp._content is 100 bytes length.
  2. Next resp.read(50) should return first 50 bytes from resp._content without reading from the stream body (actually return memoryview(resp._content)[:50]) to don't copy memory).
  3. Next resp.read(150) should read 50 bytes from the stream, append it to resp._content and return the whole content bytes with 150 bytes length.

With this schema no exception is needed. Objections?

@asvetlov I don't think so. When you call .read(150), and it returns 150 bytes, you have no idea if the response is exactly 150 bytes or truncated by the limitation. You can distinguish by calling .read(151) and check the length, but it is somehow not obvious.
Moreover, this parameter should be distinguished with content.read() (which are "streaming APIs" by definition), since it is only a security check. We are preventing an unexpected large response from a request, so I think raising an exception should be preferred. It should be rare that a partial result is acceptable and can be processed same with the whole result. In most cases, the processing should fail immediately, and the whole request is dropped (e.g. some bot scripts collecting data from website may drop pages that are too large)
The exception could have an attribute as the partial result, so after catching the exception, the partial result can be retrieved. And the partial result may be cached in your way, so further read() calls do not change their behavior after reordering.

The proposal should fit into existing API seamlessly.
I think we need a Pull Request as a subject for discussion.
Without concrete code the talk doesn't make much sense.

@asvetlov Bumped into a similar issue (see #3926). How about making the following change to read in aiohttp/client_reqrep.py? Will this work reliably?

#!/usr/bin/env python3

import aiohttp
import asyncio


async def read(self, n: int=-1) -> bytes:
    """Read up to 'n' bytes of the response payload.

    If 'n' is -1 (default), read the entire payload.
    """
    if self._body is None:
        try:
            if n is -1:
                self._body = await self.content.read()
            else:
                chunks = []
                i = 0
                while i < n:
                    chunk = await self.content.read(n=n - i)
                    if not chunk:
                        break
                    chunks.append(chunk)
                    i += len(chunk)

                self._body = b''.join(chunks)

            for trace in self._traces:
                await trace.send_response_chunk_received(self._body)

        except BaseException:
            self.close()
            raise
    elif self._released:
        raise ClientConnectionError('Connection closed')

    return self._body


async def f(url, n=-1):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            content = await read(response, n=n)
            print(len(content))


URL = 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/2010-kodiak-bear-1.jpg/320px-2010-kodiak-bear-1.jpg'

asyncio.run(f(URL, 10_000_000))  # 10 MB
asyncio.run(f(URL, 100))
asyncio.run(f(URL))

Looks working but I afraid that we should drop self._body caching in this case.
Otherwise two subsequent await resp.read(10), await resp.read(100) calls return only first 10 bytes even if the content is longer.

Now I'm inclining to the thought that response body caching was a bad and confusing idea.
Cache disabling is a breaking change but
a) We can allow it in aiohttp 4.0
b) I don't recall aiohttp based code that relies on caching behavior. Maybe the change will break something but not too much, sure.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AtomsForPeace picture AtomsForPeace  路  5Comments

JulienPalard picture JulienPalard  路  3Comments

Codeberg-AsGithubAlternative-buhtz picture Codeberg-AsGithubAlternative-buhtz  路  3Comments

Smosker picture Smosker  路  3Comments

amsb picture amsb  路  3Comments