Describe the bug
Currently AvroSchema makes all fields allowNull, which makes the generated schema doesn't respect what POJO defines.
Expected behavior
AvroSchema should respect to what POJO defines
Additional context
N/A
@codelipenghui 's team is working on this issue.
@sijie I don't quite follow. If a class has a field "Integer" than the field should allow null vs if the field is "int" then it shouldn't:
public static class Foo {
int field1;
Integer field2;
}
log.info("{}", new String(AvroSchema.of(Foo.class).getSchemaInfo().getSchema()));
produces:
{"type":"record","name":"Foo","namespace":"org.company.Test4$","fields":[{"name":"field1","type":"int"},{"name":"field2","type":["null","int"],"default":null}]}
which is correct and adheres to primitive vs class semantics in Java . Am I missing something?
@jerrypeng
the main problem of this is : if we are serializing POJO into avro formats, it should be respecting to how AVRO generates schema for POJOs. Otherwise, if you are migrating other AVRO data from other data sources to Pulsar, or from Pulsar to other data destinations, it will raise schema incompatibility issue. Because Pulsar schema generated schema always allow null, while the schema generated by other systems don't allow null, even they are generated from same POJO.
AVRO is a pretty standard data format. If pulsar is using AVRO schema, we should respect to its rules, otherwise it is impossible for pulsar to be used for exchanging AVRO data.
@sijie I don't understand why you say we are not respecting the rules of AVRO, we are using the official java AVRO library to schema generation. And if another system is using AVRO java library to generate schemas from java classes then they will get the same result. Are you talking about cross language support? The only flag that could be settable is the allow null flag that gets set in Pulsar's Avro schema. We can expose that flag to users.
@jerrypeng usually we create POJO by defining avro schema file and create it by using avro maven plugin. Once we provide the generated POJO to pulsar producer the schema registered in registry not identical to schema we have written. This is because of allowNull as pointed out by sijieg . This came as an surprise . When we tried consuming with Python with the same schema it failed because of schema incompatibility. If people want null to all the fields then they can add it at the avro schema.
Exposing the flag may also work with default as allowNull to false .
@skyrocknroll @sijie I don't think the current API is the problem, I think we just need to separate API to specify a custom schema. We currently only allow users to derive a schema from a Java class using the Avro library. This generates an Avro schema that matches the semantics of java i.e. classes can be null and primitives are not. However, if I have my own custom avro schema, I can arbitrarily set one class to be nullable while the another class is not. There is not way for Avro to know that information from just a java class definition. Even by allowing users to set the flag "allowNull" will not cover all cases as user can arbitrarily create avro schemas will null or not null fields. I think we can do the following:
Irrespective of the allowNull=true/false question, I think:
The easiest solution would be to detect that a POJO is generated by Avro and use the getClassSchema() instead of doing the reflection based approach.
I don't understand why you say we are not respecting the rules of AVRO
The only flag that could be settable is the allow null flag that gets set in Pulsar's Avro schema. We can expose that flag to users.
@jerrypeng
That is the problem I am reporting this in the issue. Because Pulsar Avro Schema always set AllowNull and pulsar's users don't have the ability to bypass the AllowNull. My point here is Pulsar AVRO should not try to attach AllowNull.
If people want null to all the fields then they can add it at the avro schema.
@skyrocknroll exactly.
I don't think the current API is the problem, I think we just need to separate API to specify a custom schema.
Jerry, that's a different feature.
The point here for a given POJO, Pulsar should maintain a basic contract: the schema generated by Pulsar AVRO should be same/compatible with what other tools generated. Pulsar AVRO should not automatically attach AllowNull for all fields, otherwise we shouldn't call it AVRO schema, we should be calling it AlwaysAllowNullAvroSchema.
This seems to be a problem only with Avro generated code, in which there was a schema defined and we're extracting the wrong schema from the generated POJO
This shouldn't be an issue when one defines an ad-hoc POJO (without starting from Avro schema), because annotations can be used to request nullable/non-nullable.
@merlimat : that's not the only case. People can use ReflectData to generate a schema and write events in a different system (let's call it source). when people migrate data from source to pulsar, people use same POJO, these two schemas are not compatible, which will result in events are not able to be written to pulsar.
The easiest solution would be to detect that a POJO is generated by Avro and use the getClassSchema() instead of doing the reflection based approach.
That's a very limited solution to cover only one use case.
In summary, the questions is about "shall Pulsar automatically attach AllowNull to POJO generated schema". Attaching AllowNull changes the default behavior of how AVRO generates POJO and makes Pulsar inconsistent with how other systems use AVRO.
If the team decides that AllowNull should be the default behavior in AvroSchema, it should be well documented and highlighted to all avro schema users. and a separate AvroSchema or a flag to disable AllowNull should also be provided for users to not use AllowNull.
I don't think detecting if it is an Avro generated POJO is the right approach. We should just remove (or provide the ability to) AllowNull when constructing AvroSchema.
that's not the only case. People can use ReflectData to generate a schema and write events in a different system (let's call it source). when people migrate data from source to pulsar, people use same POJO, these two schemas are not compatible, which will result in events are not able to be written to pulsar.
Given a POJO generated by Avro, there is no way to determine whether this POJO was generated with a schema that allowed or not for null types.
That's a very limited solution to cover only one use case.
It's a 100% correct solution for that case. I don't see what's limited about that.
Though, I too think we should remove the AllowNull from the generator, but that's separate from the point that that alone won't guarantee we can generate the correct schema starting from the generated POJO.
The point here for a given POJO, Pulsar should maintain a basic contract: the schema generated by > Pulsar AVRO should be same/compatible with what other tools generated. Pulsar AVRO should not > > automatically attach AllowNull for all fields, otherwise we shouldn't call it AVRO schema, we should > > be calling it AlwaysAllowNullAvroSchema.
@sijie the contract you are creating when generating an avro schema using the ReflectData api is with the java class itself not some other tool or system. If you want to provide a custom schema, we should have the way to do so. These issues are connected.
Like I mentioned before, a user can create a Avro schema file with arbitrary fields set to null. There is no way to get such information from a java class along. Thus, the api currently does not suffice for general use cases when users are bring in generic avro schemas irregardless of whether allowNull is set by default.
To summarize, we can NOT set allowNull by default but it hardly solves the underlying issue.
Given a POJO generated by Avro, there is no way to determine whether this POJO was generated with a schema that allowed or not for null types.
but that's separate from the point that that alone won't guarantee we can generate the correct schema starting from the generated POJO.
@merlimat
The issue I am creating here is for AllowNull. We found AllowNull is a problem from 2 use cases: 1) the one reported by @skyrocknroll 2) the other use case that @codelipenghui hit.
The whole picture of @skyrocknroll 's problem is avro file => avro generated pojo. a schema generated by avro file is not compatible by schema generated by pulsar.AvroSchema(generated pojo). One of the problems is AllowNull completely changes the schema definitions. Whether removing AllowNull can address this problem or not is a separate issue to be addressed. Although I would expect Avro can handle this well. We shouldn't couple the discussion of this issue with a broad issue introduced by AllowNull.
It's a 100% correct solution for that case. I don't see what's limited about that.
@merlimat
Generated POJO is the use case reported and discussed at slack channel. If you just handle generated POJO by using getClassSchema, you are not covering many other data sources which generate AVRO schema using ReflectData.
the contract you are creating when generating an avro schema using the ReflectData api is with the java class itself not some other tool or system.
@jerrypeng
I am not creating any contract. They are from real use cases. Also the whole discussion is around POJO only, it is not even a cross-language issue or any user customized schema issue. (it was found when being used in cross-language)
1) if a user uses an AVRO schema file to generate a schema using avro tools (that's call this schema A), and generate a POJO class. then the user use the POJO class and use pulsar avro schema to generate another schema B. Ideally A and B should be compatible.
schema A => generated pojo => (pulsar AvroSchema) => schema B
||
\/
schema A => generated pojo => ReflectData.AllowNull.parse => schema B
AllowNull is the problem to prevent them being compatible. I don't know if removing AllowNull can fully address this problem or not. That's a specific issue to address for Avro generated POJO. I hope Avro provides the right tools to convert back-and-forth between schema and pojos, otherwise IMO it is a problem of Avro.
If we removed AllowNull, the flow will be changed to following. The scope of the problem is different - whether ReflectData is the right tool to handle Avro generated POJO or does Avro even provide tools to guarantee such conversations.
schema A => generated pojo => ReflectData.parse => schema B
2) Image I have a data system A (e.g. Spark or Flink) and Pulsar. I have a POJO class (e.g. UserProfile) defined across the whole organization. The schema generated in different systems are completely not compatible even they are using same POJO. When data is flowing between pulsar and other systems, the data might not be processed properly due to incompatible schema. Pulsar is the message bus for exchanging data between other system. If it produces an incompatible schema than other systems, IMO that's a very serious bug.
@sijie that not going to work. If you take look at the java code generated by avro tools with a given schema file its going to be not what you expect. And from that generated class there is no way for reflect data avro library to generate the same schema unless we do the trick @merlimat mentioned
It is kind of a problem with Avro
Edit: it does seem to work so please disregard my comment
@jerrypeng
And from that generated class there is no way for reflect data avro library to generate the same schema unless we do the trick @merlimat mentioned
we can do the trick for what @merlimat mentioned for addressing the specific problem of avro generated code. but that's the separate problem from AllowNull.
@sijie actually this does work:
schema A => generated pojo => ReflectData.parse => schema B
Schema A == . Schema B
Using the reflect data parser in avro will produce the same schema.
I agree we should remove AllowNull to be set by default, but we need a migration plan
@jerrypeng
Using the reflect data parser in avro will produce the same schema.
Glad that we are on the same page now.
I agree we should remove AllowNull to be set by default, but we need a migration plan
I don't think we need a migration plan. We just need a flag to tell whether AllowNull is used or not. Explained in #3752 . Migration is a nightmare. I don't want to touch it.
We are using avro because avro being an standard and guaranteed to work across different part of the whole data pipeline. Any system we use in the pipeline we will just check does it adhere to avro standards. To add more clarity , We force every engineer to first create an avro schema and get it reviewed by data platform team. We generate pojo for java , Python and go struts. Any system violating the avro standards will breaks the pipeline.
@skyrocknroll I guess, for Python it might make sense to also allow to pass the JSON avro definition
@merlimat Yes, that would be awesome!
@sijie @skyrocknroll I just experimented a little.
Given the schema that @skyrocknroll posted on the pulsar chat as an example:
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favorite_color",
"type": "string"
},
{
"name": "age",
"type": "int",
"default": 19
}
]
}
I did some tests using the generated java class with the ReflectData schema generator:
log.info("{}", new String(AvroSchema.of(User.class).getSchemaInfo().getSchema()));
log.info("{}", ReflectData.get().getSchema(User.class));
log.info("{}", ReflectData.AllowNull.get().getSchema(User.class));
They all produced the same result:
21:07:27.741 [main:org.company.Test4@44] INFO org.company.Test4 - {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
21:07:27.746 [main:org.company.Test4@46] INFO org.company.Test4 - {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
21:07:27.746 [main:org.company.Test4@48] INFO org.company.Test4 - {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
regardless of whether AllowNull is set or not the Avro schema generated is the same as the one in the Avro schema file. The ReflectData implementation is probably getting the schema from the generated java class via getClassSchema() as @merlimat suggested before. So my question still holds what exactly does removing AllowNull solve?
Here is actually the line of code that actually just gets the "SCHEMA" field regardless of the java class:
https://github.com/apache/avro/blob/release-1.8.2/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificData.java#L277
@jerrypeng
I am surprised to see {"name":"favorite_color","type":"string"}is not a union type with null after using AllowNull.
so my question is how do you do the test? Is the User any kind of POJO class, or avro generated POJO class?
@sijie The User class is an Avro generated class based on the schema file.
Here is the file:
https://gist.github.com/jerrypeng/6bd3501ec7f8f4e1f07aa328a06169b1
@sijie That is the workflow you were worried about right? i.e.
schema A => generated pojo => ReflectData.parse => schema B
and schema A should equal schema B
It should already work?
@jerrypeng I got a different result than yours
log.info("Pulsar AllowNull :{}", new String(AvroSchema.of(User.class).getSchemaInfo().getSchema()));
log.info("Avro Not AllowNull :{}", ReflectData.get().getSchema(User.class));
log.info("Avro AllowNull :{}", ReflectData.AllowNull.get().getSchema(User.class));
Pulsar AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"favorite_number","type":["null","int"],"default":null},{"name":"favorite_color","type":["null","string"],"default":null},{"name":"age","type":"int"}]}
Avro Not AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
Avro AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
@sijie I think the reason why AvroSchema.of(User.class) is producing a different result is because we shade all the Avro dependencies and that is interfering it from detecting that its a generated class, but Avro Not AllowNull is the same as Avro Not AllowNull which we are using the exactly the later in AvroSchema. Maybe we should figure out a way to getting the shading to work with Avro?
Here is a Test project i have created:
https://github.com/jerrypeng/TestAvro
I verified the reason why the AvroSchema part of the shaded pulsar-client library outputs a different schema is because of shading
@jerrypeng correct. I added one more to complete the picture.
log.info("Pulsar AllowNull :{}", new String(AvroSchema.of(User.class).getSchemaInfo().getSchema()));
log.info("Pulsar shaded AVRO NOT AllowNull :{}", org.apache.pulsar.shade.org.apache.avro.reflect.ReflectData.get().getSchema(User.class));
log.info("Avro AllowNull :{}", ReflectData.AllowNull.get().getSchema(User.class));
log.info("Avro Not AllowNull :{}", ReflectData.get().getSchema(User.class));
Pulsar AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"favorite_number","type":["null","int"],"default":null},{"name":"favorite_color","type":["null","string"],"default":null},{"name":"age","type":"int"}]}
Pulsar shaded AVRO NOT AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":"int"},{"name":"favorite_color","type":"string"},{"name":"age","type":"int"}]}
Avro AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
Avro Not AllowNull :{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":"string"},{"name":"age","type":"int","default":19}]}
As you described above, AllowNull doesn't matter if the POJO is generated by AVRO, because AVRO will try to use SCHEMA$ field to retrieve the schema. But If the POJO is not generated by AVRO, AVRO will attempt to parse it and AllowNull will generating two different incompatible schemas.
my tests covered all the cases.
1) Avro AllowNull and Avro Not AllowNull are using unshaded avro library to parse avro generated POJO. in this way, they will always produce the same schema no matter AllowNull used or not.
2) Pulsar AllowNull and Pulsar shade avro Not Allow Null are using shaded avro library to parse avro generated POJO. In this way, they will produce a completely different schema. AllowNull matters here.
The 2) indicates two problems:
a) for normal POJO (not generated by AVRO), AllowNull is producing incompatible schemas. so we need #3752
b) for avro generated POJO, shaded AVRO library (the one used by Pulsar) doesn't recognize it is an AVRO generated POJO. We need a separate fix to make sure Pulsar shaded AVRO working with avro generated POJO.
@sijie yup though I am not sure setting AllowNull is producing an incompatible schema. Incompatible to what? I guess if a user wrote some code:
ReflectData.get().getSchema(MyClass.class)
and they didn't set allow null then it wouldn't be compatible with the current schema generated from AvroSchema but my implementation could just have easily set allow null. Thus, this doesn't seem like big deal and arbitrary.
However in Java, classes can be null and primitives cannot be. Should we not preserve that semantic? Is Integer the same as int? e.g. should the following class:
public static class Foo {
int field1;
}
generate the same AVRO schema as
public static class Foo {
Integer field1;
?
I am fine with removing AllowNull by default if everyone wants to do it but I think the bigger problem here is the one with shading and how generated java classes are handled since that is a API/standard that should work across systems.
yup though I am not sure setting AllowNull is producing an incompatible schema. Incompatible to what?
@jerrypeng ReflectData.AllowNull and ReflectData are generating two different incompatible schemas. I have been explaining to you guys in previous comments: my broader concern is NOT about generated POJO, is about using a same user defined POJO class across many different systems using AVRO as the standard serialization format where people are using ReflectData not ReflectData.AllowNull (the second case explained at https://github.com/apache/pulsar/issues/3741#issuecomment-469506795).
my implementation could just have easily set allow null. Thus, this doesn't seem like big deal and arbitrary.
well it is a big deal when exchanging data between systems using same POJO.
Should we not preserve that semantic?
If Pulsar uses AVRO on POJOs, I don't think it is the responsibility for Pulsar to preserve its own semantic. It should preserve what AVRO preserves for everyone because this is the rules of all other systems follow when using AVRO. That's the most important thing when AVRO is used as a standard data format across data systems. Currently Pulsar's AllowNull implementation violates it.
but I think the bigger problem here is the one with shading
The shading bug is new to me. I agreed that we need to fix it.
Most helpful comment
Irrespective of the
allowNull=true/falsequestion, I think:The easiest solution would be to detect that a POJO is generated by Avro and use the
getClassSchema()instead of doing the reflection based approach.