Sometimes, remote websites do not send right Content-Length header, so there are no trivial check how big is remote page.
For example http://www.annonces-net.com/petites_annonces.php?page=principe
Is really big page, due some php errors, when I am doing response.read() process will be terminated by OOM killer.
Aiohttp has response.content.read(chunk_size), but it is really low level option
What about add optional parameter for response.read(max_size=XXX) with some default limit like 10MB.
It might little bit more expected behaviour for end users, as well max_size=None can remove limit
ClientResponseError error raised when response._content becomes to large
process killed by OOM-killer
import aiohttp
import asyncio
async def main(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
body = await response.read()
print(body)
url = 'http://www.annonces-net.com/petites_annonces.php?page=principe'
loop = asyncio.get_event_loop()
loop.run_until_complete(main(url))
client
The issue is not about adding a parameter actually (content.read() can do it now without code modification) but changing defaults to 10MB.
I'm -0 for the change:
requests has no such default, it downloads the whole BODY.I understand, what about new optional parameter, which do not break backward compatible?
response.read(max_size=1024 * 1024) which passes max_value to content.read(max_size=max_size)
Default can be None, which means no limits, how it works right now
I like the second proposal (if you will implement a PR).
The only question is default: should it be None or -1? content.read() uses -1 for the whole content.
The signature is async def read(self, n=-1): ..., let's support the same.
Please keep in mind that response.read() caches content in memory.
Two consequence calls await resp.read(max_size=100) and await resp.read(max_read=500) should be consistent.
Maybe if max_size reached we can raise an exception? Idea was give an option do not download too big pages. Read until size is reached is what can be done right now by content.read. Any concerns regarding exception?
My idea is:
resp._content is 100 bytes length.resp.read(50) should return first 50 bytes from resp._content without reading from the stream body (actually return memoryview(resp._content)[:50]) to don't copy memory).resp.read(150) should read 50 bytes from the stream, append it to resp._content and return the whole content bytes with 150 bytes length.With this schema no exception is needed. Objections?
@asvetlov I don't think so. When you call .read(150), and it returns 150 bytes, you have no idea if the response is exactly 150 bytes or truncated by the limitation. You can distinguish by calling .read(151) and check the length, but it is somehow not obvious.
Moreover, this parameter should be distinguished with content.read() (which are "streaming APIs" by definition), since it is only a security check. We are preventing an unexpected large response from a request, so I think raising an exception should be preferred. It should be rare that a partial result is acceptable and can be processed same with the whole result. In most cases, the processing should fail immediately, and the whole request is dropped (e.g. some bot scripts collecting data from website may drop pages that are too large)
The exception could have an attribute as the partial result, so after catching the exception, the partial result can be retrieved. And the partial result may be cached in your way, so further read() calls do not change their behavior after reordering.
The proposal should fit into existing API seamlessly.
I think we need a Pull Request as a subject for discussion.
Without concrete code the talk doesn't make much sense.
@asvetlov Bumped into a similar issue (see #3926). How about making the following change to read in aiohttp/client_reqrep.py? Will this work reliably?
#!/usr/bin/env python3
import aiohttp
import asyncio
async def read(self, n: int=-1) -> bytes:
"""Read up to 'n' bytes of the response payload.
If 'n' is -1 (default), read the entire payload.
"""
if self._body is None:
try:
if n is -1:
self._body = await self.content.read()
else:
chunks = []
i = 0
while i < n:
chunk = await self.content.read(n=n - i)
if not chunk:
break
chunks.append(chunk)
i += len(chunk)
self._body = b''.join(chunks)
for trace in self._traces:
await trace.send_response_chunk_received(self._body)
except BaseException:
self.close()
raise
elif self._released:
raise ClientConnectionError('Connection closed')
return self._body
async def f(url, n=-1):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
content = await read(response, n=n)
print(len(content))
URL = 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/2010-kodiak-bear-1.jpg/320px-2010-kodiak-bear-1.jpg'
asyncio.run(f(URL, 10_000_000)) # 10 MB
asyncio.run(f(URL, 100))
asyncio.run(f(URL))
Looks working but I afraid that we should drop self._body caching in this case.
Otherwise two subsequent await resp.read(10), await resp.read(100) calls return only first 10 bytes even if the content is longer.
Now I'm inclining to the thought that response body caching was a bad and confusing idea.
Cache disabling is a breaking change but
a) We can allow it in aiohttp 4.0
b) I don't recall aiohttp based code that relies on caching behavior. Maybe the change will break something but not too much, sure.
Most helpful comment
I understand, what about new optional parameter, which do not break backward compatible?
response.read(max_size=1024 * 1024)which passesmax_valuetocontent.read(max_size=max_size)Default can be
None, which means no limits, how it works right now