Pydantic: Support Base64 Type

Created on 26 Jul 2019  Â·  25Comments  Â·  Source: samuelcolvin/pydantic

Feature Request

pydantic does not have a Base64 type. However, Base64 is a standard data type.
OpenAPI has base64 format. Also, Json Schema defines it in contentEncoding attribute.
I expect to use base64 type for token, binary data like image(jpeg, png).

I think the Base64 type should encode/decode data.

from pydantic import BaseModel, Base64


class User(BaseModel):
    name: str
    token: Base64

# give base64 encoded string
user = User('user1', 'MTIzNDU2Nzg5MC09YXNkZmdoamtsOyc=')
## Or encode data from str or bytes
user = User('user1', Base64.encode('1234567890-=asdfghjkl;\''))

print(user.token)
# Base64('MTIzNDU2Nzg5MC09YXNkZmdoamtsOyc=')
print(user.token.decode())
# '1234567890-=asdfghjkl;\'')

print(User.schema())
# {'title': 'User',
#  'type': 'object',
#  'properties': {'name':  {'title': 'Name', 'type': 'string'},
#                 'token':  {'title': 'Name', 'type': 'string',
#                            'contentEncoding': 'base64', 'contentMediaType': 'image/png'}
#                 },
#  'required': ['name', 'token']
#  }

Feedback Wanted feature request help wanted

Most helpful comment

I agree about the idempotent argument.

However I strongly disagree with lazy validation/parsing: for my sanity (as well as the sanity of those using pydantic) parsing/validation should happen once when everything else is parsed.

I therefore think a good compromise would be a Base64Type type similar to Color, something like:

class Base64Type:
    def __init__(self, decoded_bytes: bytes):
        self._decoded_bytes: bytes = decoded_bytes

    def encode(self) -> bytes:
        return base64.b64encode(self._decoded_bytes)

    def encode_str(self) -> str:
        return self.encode().decode()

    def decode(self) -> bytes:
        return self._decoded_bytes

    def decode_str(self) -> str:
        return self._decoded_bytes.decode()

    @classmethod
    def __get_validators__(cls) -> 'CallableGenerator':
        yield cls.validate

    @classmethod
    def validate(cls, value) -> 'Base64Type':
        if isinstance(value, Base64Type):
            return value
        if isinstance(value, str):
            value = value.encode()
        elif isinstance(value, int):
            raise Base64Error
        elif not isinstance(value, (bytes, bytearray, memoryview)):
            value = bytes(value)
        with change_exception(Base64Error, binascii.Error):
            v = base64.b64decode(value, validate=True)
        return Base64Bytes(v)

json() should then use v.encode_str().

This would keep the type idempotent, both when using Model.parse_json(m.json()) and Model(**m.dict()).

It would however require re-encoding for json() but I think on balance that's preferable to keeping both the raw and base64 bytes.

I agree this might not be the most practical solution for everyone but I think on balance it's the best compromise and has the advantage of requiring explicit usage.

The other option is to close this issue and allow people to implement their own validators or custom types which work exactly as they wish?

All 25 comments

sounds good to me. AFAIK there's no Base64 type in the standard library, so we should implement it.

where does contentMediaType come from?

Also just to clarify, what properties/methods does a Base64 object have:

  • decode() which returns the raw bytes
  • bas64() or original() which is the base64 encoded string/byte string

?

where does contentMediaType come from?

Sorry, The example is invalid. I should not write contentMediaType in case.

But, Json Schema wants contentMediaType is defined. I'm thinking about contentMediaType now how do we pass it to Base64 type.
I paste my idea for contentMediaType. I think pydantic model will dump contentMediaType in json schema.
Would you tell your idea ?

class User(BaseModel):
     token: Base64  =  Schema(contentMediaType='image/png')

Also just to clarify, what properties/methods does a Base64 object have:

I imagine this class

class Base64:
    def __init__(self, encoded_data: Union[str, bytes]) -> None:  # encoded_data is encoded
       pass
    @classmethod
    def encode(cls, data: Union[str, bytes]) -> None:  # data is raw
       pass
    def decode(self) -> bytes:  # raw bytes
       pass
    @property
    def base64(self) -> bytes:  # encoded data
       pass

I think the class is not the best.
Also, I want to add method decode and base64 for returning encoded str data.
However, The way is not right, which may be complicated.

I really like this idea, and would actually make use of it today if available.

Some thoughts:

  • I think it would be substantially more useful if you could parse non-bytes/str objects by converting to bytes and encoding the result.

    • I think parsing should succeed for any object that can be converted to bytes, excluding subclasses of str and int, but including anything implementing __bytes__.



      • In particular, support for buffers. (This would make it easy to serialize numpy arrays, for example.)



    • During default parsing, I think subclasses of str should always be assumed to be encoded, and just fail parsing if they aren't valid base64. (You could always encode the str to bytes before parsing if that was your goal.)

    • I think int should always fail to be parsed


Here is an alternative implementation that is more similar to UrlStr than to Color (which is how I imagined it going if building on @koxudaxi 's stubs). I'm not sure whether subclassing bytes might introduce unexpected issues:

from typing import Any
import base64
import binascii

from pydantic.utils import change_exception
from pydantic import PydanticTypeError, BaseModel, ValidationError

class Base64Error(PydanticTypeError):
    msg_template = 'value is not valid base64'

class Base64Bytes(bytes):
    @classmethod
    def encode(cls, data: bytes) -> 'Base64Bytes':
        return Base64Bytes(base64.b64encode(data))

    @classmethod
    def __get_validators__(cls) -> 'CallableGenerator':
        yield cls.validate

    @classmethod
    def validate(cls, value: Any) -> 'Base64Bytes':
        if isinstance(value, (bytes, str, bytearray, memoryview)):
            with change_exception(Base64Error, binascii.Error):
                base64.b64decode(value, validate=True)
            return Base64Bytes(value)
        if isinstance(value, int):
            raise Base64Error
        with change_exception(Base64Error, TypeError):
            encoded = base64.b64encode(bytes(value))
            return Base64Bytes(encoded)

# ##### Basic tests #####

class B64Model(BaseModel):
    encoded: Base64Bytes

encoded = Base64Bytes.encode(b'hello')
print(B64Model(encoded=encoded))
# B64Model encoded=b'aGVsbG8='

import numpy as np
print(B64Model(encoded=np.array([1])))
# B64Model encoded=b'AQAAAAAAAAA='

try:
    B64Model(encoded=b'hello')
except ValidationError as e:
    print(str(e))
"""
1 validation error
encoded
  value is not valid base64 (type=type_error.base64)
"""

@samuelcolvin @koxudaxi thoughts?

@dmontagu
Thank you for your implementation.
I don't have the idea that it is using subclassing bytes.
I feel it's very good and useful.

I have run your implementation.
I found that the validation method must encode str to bytes.

    @classmethod
    def validate(cls, value: Any) -> 'Base64Bytes':
        if isinstance(value, (bytes, str, bytearray, memoryview)):
            with change_exception(Base64Error, binascii.Error):
                base64.b64decode(value, validate=True)
                if isinstance(value, str):
                    value = value.encode()
            return Base64Bytes(value)

I'm still not sure about this. I think the implementation in #698 is relatively confusing.

I would expect the raw value of the attribute where the field was annotated with Base64Type to be the raw bytes resulting from base64.b64decode(...).

I think this is more like Color or JSON.

The user might want any of the follow values from a Base64Type field (using your example from tests):

  • the resulting bytes b'hello world'
  • the resulting str 'hello world' (in some cases this could raise a unicode decode error)
  • the raw value as a str 'aGVsbG8gd29ybGQ='
  • or, the raw value as bytes b'aGVsbG8gd29ybGQ='

I think therefore we should have:

  • Base64Bytes which sets the attribute as b'hello world'
  • Base64Str which sets the attribute as 'hello world'

What do you think?

If users really just want the raw value b'aGVsbG8gd29ybGQ=' or 'aGVsbG8gd29ybGQ=', validated as a valid base64 encoding.

They can implement that themselves using a simple validator.

@samuelcolvin that makes sense to me, either way would be useful for me. When I get the chance, I'll redo the PR refactored to be more similar to Color.

@samuelcolvin
I think both ways are great for me.

However, I'm worried about performance when the class handles large size data.

They can implement that themselves using a simple validator.

I agree.
In some case, base64 data is a large size data like photo, movie, collected big data.
If the validator does not set raw data on an attribute, then the user will re-encode encoded data to use it.
It's meant to waste CPU power. But, if the validator assignment raw data to an attribute, then we lost a lot of memory to keep two types of data.
I know it's a trade-off. I'm thinking about the best balance.

What did you think about it?

PR refactored to be more similar to Color.

Sorry I wasn't that clear, it's actually not like Color. The type is just doing the parsing and is a very simple subtype of str or bytes.

I agree about memory/CPU, that's why we have distinct approaches for the 2 common cases (so we don't have to decode more than once or store two values):

  • Base64Bytes just decode to bytes
  • Base64Str decode to bytes, then decode to str

If people want to check a string is a valid base64 encoding but keep the oraw value, they can use a validator and discard the result of base64.b64decode since I imagine this case will be rare.

@samuelcolvin

I thought more about this, and have a few points I wanted to bounce off of you before proceeding further:

  1. I think there are three salient features of the type: 1) what it expects as parsing input, 2) the form of the raw value on the model, and 3) how it "serializes" when .json() (and ideally also .dict()) is called on a model with the field.

    1. (Feature 1) It sounds like you and @koxudaxi expect base64-encoded bytes as parsing input. This seems right, as long as there is also a constructor that accepts raw bytes.
    2. (Feature 3) I think this type has significant potential value for encoding arbitrary binary data into json response fields.

      1. This is the specific use case for which I want this type (and for which I am currently using a custom workaround).

      2. I know serialization isn't pydantics top priority, but given the direct relationship between base64 encoding and serialization, it seems like it's worth optimizing for here

  2. Based on the use cases proposed for this type (namely, the one provided by @koxudaxi above, and my own use cases), it seems like whether the raw value is encoded or not, it usually needs to change encoding state precisely once. I think this is fine to do manually:

    1. When encoding, typically some application-specific logic is necessary to produce the raw bytes, and adding one-line encoding step seems fine,
    2. When decoding, application-specific logic is usually necessary to consume the raw bytes, so a one line decoding step at that point also seems reasonable.
    3. (To be clear, I think the PR I submitted should be modified so that you aren't forced to validate the data during parsing, for performance reasons.)

My conclusions from the points above are that 1) a Base64Type would be most useful if it "serializes" to/from encoded bytes (or str), and 2) the form of the "raw value" isn't too important as long as there is a method to obtain either the encoded or decoded value.

Some additional concerns that I think should influence the design:

  1. It seems like it would be easiest to accomplish the "serialization" goal if the "raw value" is actually just the encoded bytes.
  2. I find it more mentally burdensome to work with fields that are not idempotent (e.g., the Json type; I think there were some github issues where people found this confusing). So I would prefer that either:

    1. The parsing input should be in the same encoded/decoded state as the "raw value", or

    2. The parsing input can be of different encoded/decoded state than the "raw value", but the parser is able to detect when it receives a previously-parsed Base64Type (to prevent double-decoding/encoding). In this case, for serialization-handling (and confusion-prevention) purposes, I think the "raw value" should probably just be a stand-alone type, rather than a subclass of bytes or str.


Having put in some effort to elucidate my thoughts here, I'm more strongly convinced than before of the approach of having Base64Bytes be a subclass of bytes taking the value of the encoded bytes, with a decode method and an encode constructor. But I'm still open to counter arguments.

If you remain unconvinced though, rather than subclassing a primitive and having the "raw value" be the decoded result (potentially causing double-decode/encode issues), I would argue for the implementation to just be a standalone class (like Color).


I could see an argument made against this approach on the grounds that the resulting type is too simple, but I think it is valuable for the following reasons:

  1. It clearly annotates the expected data format in a way that a plain bytes or str would not.
  2. It provides provides readily-accessed convenience methods for encoding/decoding that might otherwise be missed/replicated in a large codebase.
  3. It is a natural place to add validation that the provided value is valid base64encoded data.

    1. (As I said above, I do still see the value of a way of turning off eager validation for performance reasons, e.g. via a class property that could be changed similar to urlstr and UrlStr work.)

I agree about the idempotent argument.

However I strongly disagree with lazy validation/parsing: for my sanity (as well as the sanity of those using pydantic) parsing/validation should happen once when everything else is parsed.

I therefore think a good compromise would be a Base64Type type similar to Color, something like:

class Base64Type:
    def __init__(self, decoded_bytes: bytes):
        self._decoded_bytes: bytes = decoded_bytes

    def encode(self) -> bytes:
        return base64.b64encode(self._decoded_bytes)

    def encode_str(self) -> str:
        return self.encode().decode()

    def decode(self) -> bytes:
        return self._decoded_bytes

    def decode_str(self) -> str:
        return self._decoded_bytes.decode()

    @classmethod
    def __get_validators__(cls) -> 'CallableGenerator':
        yield cls.validate

    @classmethod
    def validate(cls, value) -> 'Base64Type':
        if isinstance(value, Base64Type):
            return value
        if isinstance(value, str):
            value = value.encode()
        elif isinstance(value, int):
            raise Base64Error
        elif not isinstance(value, (bytes, bytearray, memoryview)):
            value = bytes(value)
        with change_exception(Base64Error, binascii.Error):
            v = base64.b64decode(value, validate=True)
        return Base64Bytes(v)

json() should then use v.encode_str().

This would keep the type idempotent, both when using Model.parse_json(m.json()) and Model(**m.dict()).

It would however require re-encoding for json() but I think on balance that's preferable to keeping both the raw and base64 bytes.

I agree this might not be the most practical solution for everyone but I think on balance it's the best compromise and has the advantage of requiring explicit usage.

The other option is to close this issue and allow people to implement their own validators or custom types which work exactly as they wish?

That makes sense to me. Based on this discussion I think I'm inclined to just throw together a lightweight implementation of my own to handle my use case (it's basically already done :)), since it seems like it may be more atypical than I thought.

Let's leave this open for a week or two and see if we get anymore feedback.

No point in implementing the thing I suggested if it's not what anyone else wants.

Currently I have random binary data which I would like to pass between services, and also retrieve from the environment (which doesn't allow nulls, so can't just use UTF-8) using the BaseSettings class.

So far this sounds good to me, but I'll have more of a read and ponder tonight to see if my _sounds good to me_ is just me doing grabby hands because at a glance it looks like it solves my problem.

@samuelcolvin, i want this feature in pydantic.
I has an API with some fileds is Base64 encoded data and this data should be validated.

@JrooTJunior which of the above options would you like?

As I explained above, the reason we didn't proceed with this (yet) was because we weren't sure how it should work.

It took a while for me to get around to it, but I'm looking into this now and going through the codebase.

I'm currently looking at the existing implementation around bytes:

Given that it just uses bytes.decode() to make a string which uses UTF-8 by default, the approach doesn't allow any possible bytes object to be communicated. E.g. b"\xff".decode() will raise UnicodeDecodeError.

It would be a major version/~breaking change (might argue the existing implementation is broken anyway), but I'm thinking it could be good to change the out-of-box handling of bytes to accept only bytes or standard_b64decodeable strings, then set the encoder to standard_b64encode. It sounds all well and good to me (only a ~20 line change in itself), but I have yet to get stuck into the schema specifying code to specify that the data is base64 encoded.

I'll have a go at coding this up tonight. Is there anything glaringly obvious (or subtle) that I am missing before I go off on this?

@Code0x58 thanks for offering to work on this. I'm not sure what you're proposing, so hard to say if it's missing anything.

Where you thinking of something like my implementation above, or something quite different?

One thing to say, would be that bytes.decode() is only used in the case of serialising to JSON, not so much elsewhere.

I was thinking of replacing the existing handling of bytes, rather than introduce a new class

I think that will be difficult, both in terms of backwards compatibility and clarify.

Let's say we have

class Foo:
    x: bytes

Now consider the following potential values:

  • b'aGVsbG8gd29ybGQ='
  • 'aGVsbG8gd29ybGQ='
  • b'aGVsbG8gd29ybGQ'
  • 'aGVsbG8gd29ybGQ'

(hint base64.b64decode('aGVsbG8gd29ybGQ=') == b'hello world')

What should the value of x be in these 4 cases? I don't see how we can parse base64 by default and not lead to very unexpected behaviour occasionally.

I'm inclined to say any bytes should be the bytes, or a base64 encoded string (really just to handle the serialised format), requiring people to be wary of their types - hopefully the type annotations would help avoid misuse:

  • b'aGVsbG8gd29ybGQ=' → b'aGVsbG8gd29ybGQ='

    • 'aGVsbG8gd29ybGQ=' → b'hello world'

    • b'aGVsbG8gd29ybGQ' → b'aGVsbG8gd29ybGQ'

    • 'aGVsbG8gd29ybGQ' → raise Base64Error(PydanticValueError)

    • some other type → raise BytesError(PydanticTypeError)

It would be a major version change, but the existing handling of bytes doesn't feel particularly meaningful to me.

I guess I disagree.

I think pydantic should only decode base64 data if the user is explicit about wanting that.

I think occasionally decoding base64 data when it happens to be valid would be very confusing. Eg. is 'deaf' the bytes b'deaf' or the bytes b'u\xe6\x9f'?

I'm not really a massive fan of coercing (base64) strings to bytes, so would like to work out another way while still using the bytes type.

I think my real target is the handling of bytes at the moment, it was probably what motivated me to look into something other than your solution which seems good to me - but maybe with dropping the existing handling of bytes if something else isn't reasonably doable.

I've got some time tonight to explore only accepting bytes for bytes, but I'm not too optimistic about being able to in anything other than a hacky way.

_back story:_
The existing handling of bytes feels confusing/non-intuitive/non-transparent to me as a developer. Admittedly my first experience with the bytes type was when code did something like like client.hmac_key: bytes = str(secrets.token_bytes()) (so you get "b'\00...'"), which might have set me off with an emotional bias, but it did lead me to this issue. Allowing a string in at all was one of the issues

I will offer my 2c on this as I have some production experience using Base64 in JSON. I have little to add though as you seem to have some great contributors already :)

Given that it just uses bytes.decode() to make a string which uses UTF-8 by default, the approach doesn't allow any possible bytes object to be communicated. E.g. b"xff".decode() will raise UnicodeDecodeError.

Yes, JSON and bytes are problematic for the exact reason Code0x58 stated.

If you automatically convert string -> bytes (as above) before you base64.b64encode() the receiving end will have no way of knowing it should be decoded back into string. This would be unexpected behavior to me. So I would require the input type to be one of bytes, bytearray, memoryview. Or include 'contentMediaType': 'text/strings' in the schema if it originally came in as a UTF-8 string. (or is there a better media type option?)

Base64 is obviously illegible so anyone debugging will greatly appreciate it if their data is left unencoded as long as possible. BaseModel.dict() would then have the unencoded value to show.

Base64 encoded data can get long. Even a small JPEG is a huge wall of text. It is probably easiest to not show it in __str__ / __repr__ by default (much like SecretBytes). Show up to a certain amount? Watch out for character sequences that can corrupt a posix terminal! Maybe only show if unicode_safe=True (see below).

There is no way to know if the b'string has already been encoded so always assume it has not been encoded. Also base64 is usually greater in size than the unencoded bytes so it makes sense to leave it "compressed" for this reason as well.

Regarding the transmission of bytes in JSON. Me personally, I would not accept any merge request into our code base that transmitted bytes in JSON for the reason mentioned above.

Perhaps adding a Field(unicode_safe=True) (by default to not break backwards compatibility) type of option for bytes fields. When False, it would transmit as Base64 along with 'contentEncoding': 'base64'.

class Foo(BaseModel):
    data: bytes = Field(unicode_safe=False)  # transmitted as base64
Was this page helpful?
0 / 5 - 0 ratings

Related issues

MrMrRobat picture MrMrRobat  Â·  22Comments

DrPyser picture DrPyser  Â·  19Comments

cazgp picture cazgp  Â·  34Comments

samuelcolvin picture samuelcolvin  Â·  30Comments

maxrothman picture maxrothman  Â·  26Comments