Requests: Uncertain about content/text vs iter_content(decode_unicode=True/False)

Created on 22 Jun 2016  路  15Comments  路  Source: psf/requests

When requesting an application/json document, I'm seeing next(r.iter_content(16*1024, decode_unicode=True)) returning bytes, whereas r.text returns unicode. My understanding was that both should return a unicode object. In essence, I thought "iter_content" was equivalent to "iter_text" when decode_unicode was True. Have I misunderstood something? I can provide an example if needed.

For reference, I'm using python 3.5.1 and requests 2.10.0.

Thanks!

Bug

Most helpful comment

There's at least one key difference: decode_unicode=True doesn't fall back to apparent_encoding, which means it'll never autodetect the encoding. This means if response.encoding is None it is a no-op: in fact, it's a no-op that yields bytes.

That behaviour seems genuinely bad to me, so I think we should consider it a bug. I'd rather we had the same logic as in text for this.

All 15 comments

what does (your response object).encoding return?

There's at least one key difference: decode_unicode=True doesn't fall back to apparent_encoding, which means it'll never autodetect the encoding. This means if response.encoding is None it is a no-op: in fact, it's a no-op that yields bytes.

That behaviour seems genuinely bad to me, so I think we should consider it a bug. I'd rather we had the same logic as in text for this.

r.encoding returns None.

On a related note, iter_text might be clearer/more consistent than iter_content(decode_unicode=True) if there's room for change in the APIs future (and iter_content_lines and iter_text_lines I guess), assuming you don't see that as bloat.

@mikepelley The API is presently frozen so I don't think we'll be adding those three methods. Besides, iter_text likely wouldn't provide much extra value outside of calling iter_content(decode_unicode=True).

I added a pull request #3362 that adds the same check used in text() to set the encoding type if there is none. It seems to solve the raw bytes issue. Let me know if there's any needed tweaking.

Glad we fixed this, this was definitely an oversight.

I was reading through this and the related issues and PRs and I'd love to help out. However, I'm not sure what direction to head with it (or if it's something more suitable for the next version of requests).

Just so I understand correctly, we want to get it so iter_content and text both return Unicode strings. The original PR solved this by grabbing the apparent_encoding, which caused a problem when using streaming responses (chardet.detect consumes the entire stream).

I'm assuming detecting the encoding incrementally will still lead to problems, as part of the stream will be consumed. Is this something that may be best solved via documentation, suggesting encoding is set before using iter_content? It seems like we'd need some sort of knowledge about the response, which may be hard to do without consuming it (or part of it).

Could we find some middle ground? For streaming responses, documentation calls out that encoding should be set. If it isn't, return bytes and let the users decode to their heart's desire (seemed like it worked well as a workaround). While nonstreaming responses can use the apparent_encoding, as originally suggested.

I know its a bit cold, but thoughts @Lukasa, @sigmavirus24?

I'm inclined to suggest that we just use encoding and, if that is not set, that we throw an exception that requires that users set it, rather than subtly returning an incompatible type.

I've been playing around with this since we reverted and I don't think that there's a nice way to avoid the exception. I'm +1 on that but I'm also thinking that an encoding param for iter_content that @Lukasa suggested in #3481 would be useful. It allows the user to avoid having to check and set Response.encoding on every Response that might be streamed.

I know @sigmavirus24 brought up a concern about raising an exception if encoding isn't specified for the current major version. Would the suggested fix go against the 3.0 branch? @Lukasa, when you say encoding, I'm assuming you're referring to the response's encoding rather than the additional parameter on iter_content?

Any change to the response's encoding logic (e.g. throwing exceptions if it's not present but decode_unicode is) is definitely breaking, and so would have to go against the 3.0.0 branch. I was not planning to add an extra parameter to iter_content.

I think @shellhead's work in #3574 should have this addressed in 3.0.0.

I'm pulling this comment out of #3675 and putting it here to avoid derailing the other issue.

3675 suggests to me again that there's a use case for a user supplied encoding fallback. Right now, you get unexpected bytes with decode_unicode=True. Even with the 3.0.0 fix, anytime decode_unicode is used, it must be wrapped in a try/except or conditional block to guarantee the request will be useful. This isn't the end of the world, but the fact that iter_* will fail on (semi?) common APIs like Amazon S3, Pinterest, and Stripe seems like a gap worth addressing. Why not provide the interface instead of forcing the users to program around it every time?

Users are encouraged to set response.encoding: the docs point out that you can and should do it. I think that means we already provide the interface.

Users are encouraged to set response.encoding: the docs point out that you can and should do it.

I don't know if I would go so far as to assert the docs say you "should do it", in fact I think they give a false sense of assurance about decoding.

The encoding section only discusses how Response.text will either use the header encoding/chardet to decode the body, or you can get raw bytes from Response.content. If you want to change the default encoding you can supply a new one. They don't really touch on that default being a no-op, or in the future halting entirely.

If if r.encoding is None: r.encoding = 'my-chosen-fallback-1' is a requirement to reliably use the streaming interface because decode_unicode=True isn't sufficient, then that should definitely be detailed. I guess I'm beating a dead horse on integrating this into iter_* though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jake491 picture jake491  路  3Comments

avinassh picture avinassh  路  4Comments

Matt3o12 picture Matt3o12  路  3Comments

remram44 picture remram44  路  4Comments

eromoe picture eromoe  路  3Comments