pydantic does not have a Base64 type. However, Base64 is a standard data type.
OpenAPI has base64 format. Also, Json Schema defines it in contentEncoding attribute.
I expect to use base64 type for token, binary data like image(jpeg, png).
I think the Base64 type should encode/decode data.
from pydantic import BaseModel, Base64
class User(BaseModel):
name: str
token: Base64
# give base64 encoded string
user = User('user1', 'MTIzNDU2Nzg5MC09YXNkZmdoamtsOyc=')
## Or encode data from str or bytes
user = User('user1', Base64.encode('1234567890-=asdfghjkl;\''))
print(user.token)
# Base64('MTIzNDU2Nzg5MC09YXNkZmdoamtsOyc=')
print(user.token.decode())
# '1234567890-=asdfghjkl;\'')
print(User.schema())
# {'title': 'User',
# 'type': 'object',
# 'properties': {'name': {'title': 'Name', 'type': 'string'},
# 'token': {'title': 'Name', 'type': 'string',
# 'contentEncoding': 'base64', 'contentMediaType': 'image/png'}
# },
# 'required': ['name', 'token']
# }
sounds good to me. AFAIK there's no Base64 type in the standard library, so we should implement it.
where does contentMediaType come from?
Also just to clarify, what properties/methods does a Base64 object have:
decode() which returns the raw bytesbas64() or original() which is the base64 encoded string/byte string?
where does
contentMediaTypecome from?
Sorry, The example is invalid. I should not write contentMediaType in case.
But, Json Schema wants contentMediaType is defined. I'm thinking about contentMediaType now how do we pass it to Base64 type.
I paste my idea for contentMediaType. I think pydantic model will dump contentMediaType in json schema.
Would you tell your idea ?
class User(BaseModel):
token: Base64 = Schema(contentMediaType='image/png')
Also just to clarify, what properties/methods does a Base64 object have:
I imagine this class
class Base64:
def __init__(self, encoded_data: Union[str, bytes]) -> None: # encoded_data is encoded
pass
@classmethod
def encode(cls, data: Union[str, bytes]) -> None: # data is raw
pass
def decode(self) -> bytes: # raw bytes
pass
@property
def base64(self) -> bytes: # encoded data
pass
I think the class is not the best.
Also, I want to add method decode and base64 for returning encoded str data.
However, The way is not right, which may be complicated.
I really like this idea, and would actually make use of it today if available.
Some thoughts:
str and int, but including anything implementing __bytes__.str should always be assumed to be encoded, and just fail parsing if they aren't valid base64. (You could always encode the str to bytes before parsing if that was your goal.)int should always fail to be parsedHere is an alternative implementation that is more similar to UrlStr than to Color (which is how I imagined it going if building on @koxudaxi 's stubs). I'm not sure whether subclassing bytes might introduce unexpected issues:
from typing import Any
import base64
import binascii
from pydantic.utils import change_exception
from pydantic import PydanticTypeError, BaseModel, ValidationError
class Base64Error(PydanticTypeError):
msg_template = 'value is not valid base64'
class Base64Bytes(bytes):
@classmethod
def encode(cls, data: bytes) -> 'Base64Bytes':
return Base64Bytes(base64.b64encode(data))
@classmethod
def __get_validators__(cls) -> 'CallableGenerator':
yield cls.validate
@classmethod
def validate(cls, value: Any) -> 'Base64Bytes':
if isinstance(value, (bytes, str, bytearray, memoryview)):
with change_exception(Base64Error, binascii.Error):
base64.b64decode(value, validate=True)
return Base64Bytes(value)
if isinstance(value, int):
raise Base64Error
with change_exception(Base64Error, TypeError):
encoded = base64.b64encode(bytes(value))
return Base64Bytes(encoded)
# ##### Basic tests #####
class B64Model(BaseModel):
encoded: Base64Bytes
encoded = Base64Bytes.encode(b'hello')
print(B64Model(encoded=encoded))
# B64Model encoded=b'aGVsbG8='
import numpy as np
print(B64Model(encoded=np.array([1])))
# B64Model encoded=b'AQAAAAAAAAA='
try:
B64Model(encoded=b'hello')
except ValidationError as e:
print(str(e))
"""
1 validation error
encoded
value is not valid base64 (type=type_error.base64)
"""
@samuelcolvin @koxudaxi thoughts?
@dmontagu
Thank you for your implementation.
I don't have the idea that it is using subclassing bytes.
I feel it's very good and useful.
I have run your implementation.
I found that the validation method must encode str to bytes.
@classmethod
def validate(cls, value: Any) -> 'Base64Bytes':
if isinstance(value, (bytes, str, bytearray, memoryview)):
with change_exception(Base64Error, binascii.Error):
base64.b64decode(value, validate=True)
if isinstance(value, str):
value = value.encode()
return Base64Bytes(value)
I'm still not sure about this. I think the implementation in #698 is relatively confusing.
I would expect the raw value of the attribute where the field was annotated with Base64Type to be the raw bytes resulting from base64.b64decode(...).
I think this is more like Color or JSON.
The user might want any of the follow values from a Base64Type field (using your example from tests):
b'hello world''hello world' (in some cases this could raise a unicode decode error)'aGVsbG8gd29ybGQ='b'aGVsbG8gd29ybGQ='I think therefore we should have:
Base64Bytes which sets the attribute as b'hello world'Base64Str which sets the attribute as 'hello world'What do you think?
If users really just want the raw value b'aGVsbG8gd29ybGQ=' or 'aGVsbG8gd29ybGQ=', validated as a valid base64 encoding.
They can implement that themselves using a simple validator.
@samuelcolvin that makes sense to me, either way would be useful for me. When I get the chance, I'll redo the PR refactored to be more similar to Color.
@samuelcolvin
I think both ways are great for me.
However, I'm worried about performance when the class handles large size data.
They can implement that themselves using a simple validator.
I agree.
In some case, base64 data is a large size data like photo, movie, collected big data.
If the validator does not set raw data on an attribute, then the user will re-encode encoded data to use it.
It's meant to waste CPU power. But, if the validator assignment raw data to an attribute, then we lost a lot of memory to keep two types of data.
I know it's a trade-off. I'm thinking about the best balance.
What did you think about it?
PR refactored to be more similar to Color.
Sorry I wasn't that clear, it's actually not like Color. The type is just doing the parsing and is a very simple subtype of str or bytes.
I agree about memory/CPU, that's why we have distinct approaches for the 2 common cases (so we don't have to decode more than once or store two values):
Base64Bytes just decode to bytesBase64Str decode to bytes, then decode to strIf people want to check a string is a valid base64 encoding but keep the oraw value, they can use a validator and discard the result of base64.b64decode since I imagine this case will be rare.
@samuelcolvin
I thought more about this, and have a few points I wanted to bounce off of you before proceeding further:
I think there are three salient features of the type: 1) what it expects as parsing input, 2) the form of the raw value on the model, and 3) how it "serializes" when .json() (and ideally also .dict()) is called on a model with the field.
Based on the use cases proposed for this type (namely, the one provided by @koxudaxi above, and my own use cases), it seems like whether the raw value is encoded or not, it usually needs to change encoding state precisely once. I think this is fine to do manually:
My conclusions from the points above are that 1) a Base64Type would be most useful if it "serializes" to/from encoded bytes (or str), and 2) the form of the "raw value" isn't too important as long as there is a method to obtain either the encoded or decoded value.
Some additional concerns that I think should influence the design:
Json type; I think there were some github issues where people found this confusing). So I would prefer that either:bytes or str.Having put in some effort to elucidate my thoughts here, I'm more strongly convinced than before of the approach of having Base64Bytes be a subclass of bytes taking the value of the encoded bytes, with a decode method and an encode constructor. But I'm still open to counter arguments.
If you remain unconvinced though, rather than subclassing a primitive and having the "raw value" be the decoded result (potentially causing double-decode/encode issues), I would argue for the implementation to just be a standalone class (like Color).
I could see an argument made against this approach on the grounds that the resulting type is too simple, but I think it is valuable for the following reasons:
bytes or str would not.urlstr and UrlStr work.)I agree about the idempotent argument.
However I strongly disagree with lazy validation/parsing: for my sanity (as well as the sanity of those using pydantic) parsing/validation should happen once when everything else is parsed.
I therefore think a good compromise would be a Base64Type type similar to Color, something like:
class Base64Type:
def __init__(self, decoded_bytes: bytes):
self._decoded_bytes: bytes = decoded_bytes
def encode(self) -> bytes:
return base64.b64encode(self._decoded_bytes)
def encode_str(self) -> str:
return self.encode().decode()
def decode(self) -> bytes:
return self._decoded_bytes
def decode_str(self) -> str:
return self._decoded_bytes.decode()
@classmethod
def __get_validators__(cls) -> 'CallableGenerator':
yield cls.validate
@classmethod
def validate(cls, value) -> 'Base64Type':
if isinstance(value, Base64Type):
return value
if isinstance(value, str):
value = value.encode()
elif isinstance(value, int):
raise Base64Error
elif not isinstance(value, (bytes, bytearray, memoryview)):
value = bytes(value)
with change_exception(Base64Error, binascii.Error):
v = base64.b64decode(value, validate=True)
return Base64Bytes(v)
json() should then use v.encode_str().
This would keep the type idempotent, both when using Model.parse_json(m.json()) and Model(**m.dict()).
It would however require re-encoding for json() but I think on balance that's preferable to keeping both the raw and base64 bytes.
I agree this might not be the most practical solution for everyone but I think on balance it's the best compromise and has the advantage of requiring explicit usage.
The other option is to close this issue and allow people to implement their own validators or custom types which work exactly as they wish?
That makes sense to me. Based on this discussion I think I'm inclined to just throw together a lightweight implementation of my own to handle my use case (it's basically already done :)), since it seems like it may be more atypical than I thought.
Let's leave this open for a week or two and see if we get anymore feedback.
No point in implementing the thing I suggested if it's not what anyone else wants.
Currently I have random binary data which I would like to pass between services, and also retrieve from the environment (which doesn't allow nulls, so can't just use UTF-8) using the BaseSettings class.
So far this sounds good to me, but I'll have more of a read and ponder tonight to see if my _sounds good to me_ is just me doing grabby hands because at a glance it looks like it solves my problem.
@samuelcolvin, i want this feature in pydantic.
I has an API with some fileds is Base64 encoded data and this data should be validated.
@JrooTJunior which of the above options would you like?
As I explained above, the reason we didn't proceed with this (yet) was because we weren't sure how it should work.
It took a while for me to get around to it, but I'm looking into this now and going through the codebase.
I'm currently looking at the existing implementation around bytes:
Given that it just uses bytes.decode() to make a string which uses UTF-8 by default, the approach doesn't allow any possible bytes object to be communicated. E.g. b"\xff".decode() will raise UnicodeDecodeError.
It would be a major version/~breaking change (might argue the existing implementation is broken anyway), but I'm thinking it could be good to change the out-of-box handling of bytes to accept only bytes or standard_b64decodeable strings, then set the encoder to standard_b64encode. It sounds all well and good to me (only a ~20 line change in itself), but I have yet to get stuck into the schema specifying code to specify that the data is base64 encoded.
I'll have a go at coding this up tonight. Is there anything glaringly obvious (or subtle) that I am missing before I go off on this?
@Code0x58 thanks for offering to work on this. I'm not sure what you're proposing, so hard to say if it's missing anything.
Where you thinking of something like my implementation above, or something quite different?
One thing to say, would be that bytes.decode() is only used in the case of serialising to JSON, not so much elsewhere.
I was thinking of replacing the existing handling of bytes, rather than introduce a new class
I think that will be difficult, both in terms of backwards compatibility and clarify.
Let's say we have
class Foo:
x: bytes
Now consider the following potential values:
b'aGVsbG8gd29ybGQ=''aGVsbG8gd29ybGQ='b'aGVsbG8gd29ybGQ''aGVsbG8gd29ybGQ'(hint base64.b64decode('aGVsbG8gd29ybGQ=') == b'hello world')
What should the value of x be in these 4 cases? I don't see how we can parse base64 by default and not lead to very unexpected behaviour occasionally.
I'm inclined to say any bytes should be the bytes, or a base64 encoded string (really just to handle the serialised format), requiring people to be wary of their types - hopefully the type annotations would help avoid misuse:
b'aGVsbG8gd29ybGQ=' → b'aGVsbG8gd29ybGQ=''aGVsbG8gd29ybGQ=' → b'hello world'b'aGVsbG8gd29ybGQ' → b'aGVsbG8gd29ybGQ''aGVsbG8gd29ybGQ' → raise Base64Error(PydanticValueError)It would be a major version change, but the existing handling of bytes doesn't feel particularly meaningful to me.
I guess I disagree.
I think pydantic should only decode base64 data if the user is explicit about wanting that.
I think occasionally decoding base64 data when it happens to be valid would be very confusing. Eg. is 'deaf' the bytes b'deaf' or the bytes b'u\xe6\x9f'?
I'm not really a massive fan of coercing (base64) strings to bytes, so would like to work out another way while still using the bytes type.
I think my real target is the handling of bytes at the moment, it was probably what motivated me to look into something other than your solution which seems good to me - but maybe with dropping the existing handling of bytes if something else isn't reasonably doable.
I've got some time tonight to explore only accepting bytes for bytes, but I'm not too optimistic about being able to in anything other than a hacky way.
_back story:_
The existing handling of bytes feels confusing/non-intuitive/non-transparent to me as a developer. Admittedly my first experience with the bytes type was when code did something like like client.hmac_key: bytes = str(secrets.token_bytes()) (so you get "b'\00...'"), which might have set me off with an emotional bias, but it did lead me to this issue. Allowing a string in at all was one of the issues
I will offer my 2c on this as I have some production experience using Base64 in JSON. I have little to add though as you seem to have some great contributors already :)
Given that it just uses bytes.decode() to make a string which uses UTF-8 by default, the approach doesn't allow any possible bytes object to be communicated. E.g. b"xff".decode() will raise UnicodeDecodeError.
Yes, JSON and bytes are problematic for the exact reason Code0x58 stated.
If you automatically convert string -> bytes (as above) before you base64.b64encode() the receiving end will have no way of knowing it should be decoded back into string. This would be unexpected behavior to me. So I would require the input type to be one of bytes, bytearray, memoryview. Or include 'contentMediaType': 'text/strings' in the schema if it originally came in as a UTF-8 string. (or is there a better media type option?)
Base64 is obviously illegible so anyone debugging will greatly appreciate it if their data is left unencoded as long as possible. BaseModel.dict() would then have the unencoded value to show.
Base64 encoded data can get long. Even a small JPEG is a huge wall of text. It is probably easiest to not show it in __str__ / __repr__ by default (much like SecretBytes). Show up to a certain amount? Watch out for character sequences that can corrupt a posix terminal! Maybe only show if unicode_safe=True (see below).
There is no way to know if the b'string has already been encoded so always assume it has not been encoded. Also base64 is usually greater in size than the unencoded bytes so it makes sense to leave it "compressed" for this reason as well.
Regarding the transmission of bytes in JSON. Me personally, I would not accept any merge request into our code base that transmitted bytes in JSON for the reason mentioned above.
Perhaps adding a Field(unicode_safe=True) (by default to not break backwards compatibility) type of option for bytes fields. When False, it would transmit as Base64 along with 'contentEncoding': 'base64'.
class Foo(BaseModel):
data: bytes = Field(unicode_safe=False) # transmitted as base64
Most helpful comment
I agree about the idempotent argument.
However I strongly disagree with lazy validation/parsing: for my sanity (as well as the sanity of those using pydantic) parsing/validation should happen once when everything else is parsed.
I therefore think a good compromise would be a
Base64Typetype similar toColor, something like:json()should then usev.encode_str().This would keep the type idempotent, both when using
Model.parse_json(m.json())andModel(**m.dict()).It would however require re-encoding for
json()but I think on balance that's preferable to keeping both the raw and base64 bytes.I agree this might not be the most practical solution for everyone but I think on balance it's the best compromise and has the advantage of requiring explicit usage.
The other option is to close this issue and allow people to implement their own validators or custom types which work exactly as they wish?