Hi,
I am running a Clickhouse cluster on k8s and reading from Kafka topics. I am using Clickhouse 20.4.4 .18 and the 0.9.9 version of the Clickhouse operator. The topics are in AvroConfluent format and everything worked well until I started having an exception for on eof the topics where the complaint is :"Invalid name space"
The name of namespace is something like xxx-yyyy.prod.
Do you have any idea what this can be about?
Many thanks!
Just to follow up, it can be that the 'dash' in the namespace name is actually not acceptable and the Avro serializer is doing its job. But we cannot remove this at this point from the namespace. If this is the case, is there a way to work around this and disable the name validation? We have successfully read from this topic using other Avro serializers, it can be that those skip the validation.
Maybe kafka_skip_broken_messages may help?
I tried that but the topic does not allow it.
I tried that but the topic does not allow it.
Not sure how is it possible? Can you post the query and the error
Hi,
(I am from the same company that jmalicevic is from). I investigated this a bit further and this is basically problem with the Avro namespace naming convention. According to the Avro docs:
The name portion of a fullname, record field names, and enum symbols must:
start with [A-Za-z_] .
subsequently contain only [A-Za-z0-9_] .
A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null
namespace. Equality of names (including field names and enum symbols) as well as fullnames is case-sensitive.
However the Java schema validation checks overlook this and let the '-' slip inside
The CH validation checks however do not (contrib/avro/lang/c++/impl/Node.cc)
static bool invalidChar1(char c)
{
return !isalnum(c) && c != '_' && c != '.' && c != '$';
}
Since we can't really change the namespace naming (even if it is improper according to the Avro docs), maybe a flag can be added to turn off this check?
DB::Exception: Invalid namespace: my-namespace: while fetching schema id = X: (Input format doesn't allow to skip errors): (at row 1)
: While executing SourceFromInputStream.
The query ends with:
ENGINE = Kafka()
SETTINGS kafka_broker_list = 'brokre_host:port', kafka_topic_list = 'my-namespace.prod.topic', kafka_group_name = 'me.prod.test123', kafka_format = 'AvroConfluent', kafka_skip_broken_messages=10
I annonymized the fields. But the schema ID is indeed from the corresponding schema in our registry.
I investigated a bit the codebase, from what I see in order to use the option suggested by @azat , allowSyncAfterError needs to be added to the input format. While this exists for some input format (like JSONEachRowRowInputFormat, CSVRowInputFormat.h), it is not implemented for AvroRowInputFormat. In any case, it does not solve the real problem, even if it was implemented we would just skip all messages.
src/Processors/Formats/IRowInputFormat.cpp
if (!allowSyncAfterError())
{
e.addMessage("(Input format doesn't allow to skip errors)");
throw;
}
Some ideas:
http://schema-registry/schemas/ids/{id}. You could use http://nginx.org/en/docs/http/ngx_http_sub_module.html or just write a very simple app in golang/etc.or
Maybe kafka_skip_broken_messages may help?
allowSyncAfterError needs to be added to the input format
Error processing in Kafka should happen in a completely different way that it is now.
Now we parse each message separately, and can really skip the broken messages separately (not the broken rows somewhere in the middle of the stream).
Just to follow up, we have temporarly by-passed the issue by using a proxy as suggested by @oandrew and replacing the '-' with '_' .
Most helpful comment
Hi,
(I am from the same company that jmalicevic is from). I investigated this a bit further and this is basically problem with the Avro namespace naming convention. According to the Avro docs:
The name portion of a fullname, record field names, and enum symbols must: start with [A-Za-z_] . subsequently contain only [A-Za-z0-9_] . A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null namespace. Equality of names (including field names and enum symbols) as well as fullnames is case-sensitive.However the Java schema validation checks overlook this and let the '-' slip inside
The CH validation checks however do not (contrib/avro/lang/c++/impl/Node.cc)
static bool invalidChar1(char c) { return !isalnum(c) && c != '_' && c != '.' && c != '$'; }Since we can't really change the namespace naming (even if it is improper according to the Avro docs), maybe a flag can be added to turn off this check?