Swagger-codegen: [python] Binary string response encoding with Python 3

Created on 3 Mar 2016 · 12Comments · Source: swagger-api/swagger-codegen

When using binary string responses under Python 3, the encoding defaults to UTF8 which is incompatible with binary data. This looks to stem from Python 3 having str default to UTF8 and a new type called bytes which represents binary data. This is a significant departure from functionality in Python 2 and you can read more about it.

This issue does not exist in Python 2.

Swagger Revision: 18262c1b6132a28438e2c3d9426458c413f011f8 (master as of right now)

Response schema:

schema:
  type: string
  format: binary

Example Error:

Traceback (most recent call last):
  File "./dr.py", line 13, in <module>
    "document_type": "pdf",                                         # pdf or xls or xlsx
  File "/usr/local/lib/python3.5/site-packages/docraptor/apis/doc_api.py", line 203, in create_doc
    callback=params.get('callback'))
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 322, in call_api
    response_type, auth_settings, callback)
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 149, in __call_api
    post_params=post_params, body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/api_client.py", line 358, in request
    body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 208, in POST
    body=body)
  File "/usr/local/lib/python3.5/site-packages/docraptor/rest.py", line 171, in request
    r.data = r.data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

I'm not sure what a good general fix would look like but I have a work-around for our use by delaying byte decoding until right before JSON deserialization and short-circuiting str. You can see my fix directly on generated code.

Related Issues and PRs:

Python Bug help wanted

Source

jqr

👍5

Most helpful comment

For my case, I'm returning a geojson file. This means that it _is_ able to be decoded to utf8, but this will fail with the __deserialize_file method since that opens the file for binary writing. My solution was to remove the decode from "rest.moustache" (as was suggested), and move it without modification to after the file write conditionally happens in "api_client.moustache" (which was reasonable for my use case -- it's either a file or json/text.

    def deserialize(self, response, response_type):
        """Deserializes response into an object.

        :param response: RESTResponse object to be deserialized.
        :param response_type: class literal for
            deserialized object, or string of class name.

        :return: deserialized object.
        """
        # handle file downloading
        # save response body into a tmp file and return the instance
        if response_type == "file":
            return self.__deserialize_file(response)

        ########### Moved from rest.mustache
        if six.PY3:
            response.data = response.data.decode('utf8')
        logger.debug("response body: %s", response.data)
        ###########

        # fetch data from response object
        try:
            data = json.loads(response.data)
        except ValueError:
            data = response.data

        return self.__deserialize(data, response_type)

rddesmond on 2 Mar 2019

👍2

All 12 comments

@jqr thanks for the details. May I know if you've cycle to contribute the fix with a test case using this endpoint (the method name is automatically generated by codegen since it's empty)?

wing328 on 4 Mar 2016

I'm not entirely sure the fix is generic enough to be used everywhere, as I only tested our use cases.

If someone else wants to run with this, they should as I won't have the time to address this in a generic fashion for a few weeks at least.

jqr on 8 Mar 2016

@jqr please take your time. We've not heard anyone else requesting this feature in Python yet.

wing328 on 9 Mar 2016

Binary file responses are also being parsed as utf-8, causing errors. Does anyone know a temporary workaround for this?

schema:
    type: file
    format: binary

JHibbard on 23 Jun 2017

👍1

@JHibbard What about simply model the response as just type: file without format: binary?

wing328 on 27 Jun 2017

The server sends bytes and client wants to decode it with utf8.
I've resolved it with removing the line: r.data = r.data.decode('utf8') in the rest.py file.

But then in the api_client.py file, in function __deserialize_file(self, response) it wants to write the response.data in an text file.
So I've edited this also, so the function opens the file as "wb":

with open(path, "wb") as f:
    f.write(response.data)

But I don't know that what will now not work, but at least the file download works for me.

microo8 on 15 Aug 2017

@wing328 I see the same happening in python 2.7, and I think is a more generic problem that binary files are not written to disk correctly because they may be corrupted by the "w" mode instead of "wb"
as pointed out by @microo8 .

I think it would be formally more correct to have wb instead since both text and binary file are expected to be deserialized in here.

g-bon on 10 Nov 2017

@g-bon I wonder if you can submit a PR with the suggested fix so that we can review more easily (remember to update the Python Petstore sample so that CIs can run the integration tests)

wing328 on 10 Nov 2017

@wing328, switching to python 3 I realized that there are actually two separate issues.
The first one is what I fixed with the PR 6936 and it affected both 2 and 3

The second one to my understanding is that in rest.py if PY3 is used and _preload_content is True, the response contents is always decoded as utf8. Which doesn't make sense for non-text files.

if six.PY3:
    r.data = r.data.decode('utf8')

https://github.com/swagger-api/swagger-codegen/blob/62444a7aaf480d0fc12c048ce7c86f8978a16a89/modules/swagger-codegen/src/main/resources/python/rest.mustache#L213

A simple approach could be to catch the UnicodeDecodeError and manage / swallow it. Not great because it would also affect eventual real UnicodeDecodeErrors.
Another way could be to add a check on the content_type and make a whitelist/blacklist of content types that should be decoded to utf8.
The third and hardest approach would be to infer the content type from the actual data but feels a fragile and overkill approach to me.

Happy to work on this when I have some confirmation the problem is here / feedback on the right approach.

g-bon on 24 Jan 2018

Up
cc @kenjones-cisco , @wing328

g-bon on 9 Mar 2018

    def deserialize(self, response, response_type):
        """Deserializes response into an object.

        :param response: RESTResponse object to be deserialized.
        :param response_type: class literal for
            deserialized object, or string of class name.

        :return: deserialized object.
        """
        # handle file downloading
        # save response body into a tmp file and return the instance
        if response_type == "file":
            return self.__deserialize_file(response)

        ########### Moved from rest.mustache
        if six.PY3:
            response.data = response.data.decode('utf8')
        logger.debug("response body: %s", response.data)
        ###########

        # fetch data from response object
        try:
            data = json.loads(response.data)
        except ValueError:
            data = response.data

        return self.__deserialize(data, response_type)

rddesmond on 2 Mar 2019

👍2

@rddesmond thanks for the great workaround!

From what I see the issue is not in skipping some particular Content-Type like in #9980 (which will fail for example for application/pdf, images etc), or having some hardcoded 'file' type which will trigger the different response processing logic.

The real issue is that swagger-codegen drops the format: binary from the model for type: string. If you'll check form parameters processing with debugOperations, you'll see the x-is-binary in vendorExtensions, which can be used in the template. But the response parameter doesn't have it, making it impossible to differentiate between strings and binary data.