Requests: Uncertain about content/text vs iter_content(decode_unicode=True/False)

Created on 22 Jun 2016 · 15Comments · Source: psf/requests

When requesting an application/json document, I'm seeing next(r.iter_content(16*1024, decode_unicode=True)) returning bytes, whereas r.text returns unicode. My understanding was that both should return a unicode object. In essence, I thought "iter_content" was equivalent to "iter_text" when decode_unicode was True. Have I misunderstood something? I can provide an example if needed.

For reference, I'm using python 3.5.1 and requests 2.10.0.

Thanks!

Bug

Source

mikepelley

Most helpful comment

There's at least one key difference: decode_unicode=True doesn't fall back to apparent_encoding, which means it'll never autodetect the encoding. This means if response.encoding is None it is a no-op: in fact, it's a no-op that yields bytes.

That behaviour seems genuinely bad to me, so I think we should consider it a bug. I'd rather we had the same logic as in text for this.

Lukasa on 22 Jun 2016

👍3

All 15 comments

what does (your response object).encoding return?

birm on 22 Jun 2016

That behaviour seems genuinely bad to me, so I think we should consider it a bug. I'd rather we had the same logic as in text for this.

Lukasa on 22 Jun 2016

👍3

r.encoding returns None.

On a related note, iter_text might be clearer/more consistent than iter_content(decode_unicode=True) if there's room for change in the APIs future (and iter_content_lines and iter_text_lines I guess), assuming you don't see that as bloat.

mikepelley on 22 Jun 2016

@mikepelley The API is presently frozen so I don't think we'll be adding those three methods. Besides, iter_text likely wouldn't provide much extra value outside of calling iter_content(decode_unicode=True).

sigmavirus24 on 22 Jun 2016

I added a pull request #3362 that adds the same check used in text() to set the encoding type if there is none. It seems to solve the raw bytes issue. Let me know if there's any needed tweaking.

nateprewitt on 24 Jun 2016

Glad we fixed this, this was definitely an oversight.

kennethreitz42 on 29 Jun 2016

I was reading through this and the related issues and PRs and I'd love to help out. However, I'm not sure what direction to head with it (or if it's something more suitable for the next version of requests).

Just so I understand correctly, we want to get it so iter_content and text both return Unicode strings. The original PR solved this by grabbing the apparent_encoding, which caused a problem when using streaming responses (chardet.detect consumes the entire stream).

I'm assuming detecting the encoding incrementally will still lead to problems, as part of the stream will be consumed. Is this something that may be best solved via documentation, suggesting encoding is set before using iter_content? It seems like we'd need some sort of knowledge about the response, which may be hard to do without consuming it (or part of it).

Could we find some middle ground? For streaming responses, documentation calls out that encoding should be set. If it isn't, return bytes and let the users decode to their heart's desire (seemed like it worked well as a workaround). While nonstreaming responses can use the apparent_encoding, as originally suggested.

I know its a bit cold, but thoughts @Lukasa, @sigmavirus24?

sentientcucumber on 1 Sep 2016

I'm inclined to suggest that we just use encoding and, if that is not set, that we throw an exception that requires that users set it, rather than subtly returning an incompatible type.

Lukasa on 1 Sep 2016

I've been playing around with this since we reverted and I don't think that there's a nice way to avoid the exception. I'm +1 on that but I'm also thinking that an encoding param for iter_content that @Lukasa suggested in #3481 would be useful. It allows the user to avoid having to check and set Response.encoding on every Response that might be streamed.

nateprewitt on 1 Sep 2016

I know @sigmavirus24 brought up a concern about raising an exception if encoding isn't specified for the current major version. Would the suggested fix go against the 3.0 branch? @Lukasa, when you say encoding, I'm assuming you're referring to the response's encoding rather than the additional parameter on iter_content?

sentientcucumber on 11 Sep 2016

Any change to the response's encoding logic (e.g. throwing exceptions if it's not present but decode_unicode is) is definitely breaking, and so would have to go against the 3.0.0 branch. I was not planning to add an extra parameter to iter_content.

Lukasa on 11 Sep 2016

I think @shellhead's work in #3574 should have this addressed in 3.0.0.

nateprewitt on 23 Sep 2016

I'm pulling this comment out of #3675 and putting it here to avoid derailing the other issue.

3675 suggests to me again that there's a use case for a user supplied encoding fallback. Right now, you get unexpected bytes with `decode_unicode=True`. Even with the 3.0.0 fix, anytime `decode_unicode` is used, it must be wrapped in a try/except or conditional block to guarantee the request will be useful. This isn't the end of the world, but the fact that `iter_*` will fail on (semi?) common APIs like Amazon S3, Pinterest, and Stripe seems like a gap worth addressing. Why not provide the interface instead of forcing the users to program around it every time?

nateprewitt on 13 Nov 2016

Users are encouraged to set response.encoding: the docs point out that you can and should do it. I think that means we already provide the interface.

Lukasa on 14 Nov 2016

Users are encouraged to set response.encoding: the docs point out that you can and should do it.

I don't know if I would go so far as to assert the docs say you "should do it", in fact I think they give a false sense of assurance about decoding.

The encoding section only discusses how Response.text will either use the header encoding/chardet to decode the body, or you can get raw bytes from Response.content. If you want to change the default encoding you can supply a new one. They don't really touch on that default being a no-op, or in the future halting entirely.

If if r.encoding is None: r.encoding = 'my-chosen-fallback-1' is a requirement to reliably use the streaming interface because decode_unicode=True isn't sufficient, then that should definitely be detailed. I guess I'm beating a dead horse on integrating this into iter_* though.

nateprewitt on 14 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings