Vector: Add `encoding.charset` option to sources and sinks

Created on 27 Mar 2020 · 5Comments · Source: timberio/vector

Currently, Vector doesn't handle character encodings different from UTF-8.

In order to be compatible with some existing systems, it is necessary to support reading and writing data in other popular encodings. The conversion cannot be done in a transform because, for example, file source would not be able to detect line endings if it expects UTF-8, but the file is encoded in UTF-16BE. Similarly, sinks encoding events as JSON would encode them as UTF-8, even if some of the fields store text in other encodings in the underlying Bytes structure.

I propose to add add new encoding.charset option to relevant sources and sinks. It would accept case-insensitively both names and aliases of supported character encodings as defined by IANA. The default values for it should be utf-8.

It can be implemented using encoding_rs crate which is already used as an indirect dependency.

Relevant components:

Sources
- [ ] [file](https://vector.dev/docs/reference/sources/file/)
- [ ] [http](https://vector.dev/docs/reference/sources/http/) - caveat: if the clients specify encoding in the Content-Type header, it should take the precedence.
- [ ] [kafka](https://vector.dev/docs/reference/sources/kafka/)
- [ ] [socket](https://vector.dev/docs/reference/sources/socket/)
- [ ] [stdin](https://vector.dev/docs/reference/sources/stdin/)
Sinks
- [ ] [aws_s3](https://vector.dev/docs/reference/sinks/aws_s3/)
- [ ] [console](https://vector.dev/docs/reference/sinks/console/)
- [ ] [file](https://vector.dev/docs/reference/sinks/file/)
- [ ] [gcp_cloud_storage](https://vector.dev/docs/reference/sinks/gcp_cloud_storage/)
- [ ] [http](https://vector.dev/docs/reference/sinks/http/) - should not only use the specified encoding for the body of the request, but also set the charset parameter in the Content-Type header.
- [ ] [kafka](https://vector.dev/docs/reference/sinks/kafka/)
- [ ] [socket](https://vector.dev/docs/reference/sinks/socket/)

sinks sources should feedback enhancement

Source

a-rodin

👍3

Most helpful comment

@anupdhml That'd be great. Ping @lukesteensen or @Hoverbear if you need guidance!

jamtur01 on 19 Nov 2020

❤2 👍1

All 5 comments

This could be implemented incrementally, starting from file source and sink.

a-rodin on 27 Mar 2020

We are trying to ship microsoft sql server logs with vector (log files are encoded in utf-16be) and would love to see this feature in vector, especially for the file source.

If this is not something planned for the near future, I am happy to attempt a PR here, with guidance from the vector team :smile:

anupdhml on 19 Nov 2020

@anupdhml That'd be great. Ping @lukesteensen or @Hoverbear if you need guidance!

jamtur01 on 19 Nov 2020

❤2 👍1

Thanks @jamtur01!

Before I proceed, wanted to check if the proposal by @a-rodin is still valid -- it's going to introduce a new option encoding.charset to only a few sources and over discord, @hoverbear mentioned that this is something not fully agreed upon yet. If we choose not to go this route, what are the other alternatives here?

FWIW filebeat (which we are using at the moment) has a similar config for its log source: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#_encoding_4

anupdhml on 19 Nov 2020

@anupdhml Thanks for taking this on!

The main thing to know here is that Vector's internal data representation doesn't actually care about strings or encoding. So with something like the kafka source, for example, we take the bytes directly from kafka and store them as-is. The only time we start to assume UTF-8 is when something wants to treat that data as a string. So if your pipeline looks like kafka -> kafka or something similar, we'll never need to do anything with the data's encoding.

It seems to me that the simplest way for this feature to work would be to do a conversion from (for sources) or to (for sinks) the specified charset iff this option is set, where we continue to assume that any string-like data is stored as UTF-8 internally. Does that make sense and align with your needs?

This could also involve some small tweaks to things like the file source, which currently uses a hardcoded b'\n' as the delimiter between lines.