Vector: Feature request: Zstandard compression for sinks

Created on 12 Apr 2020 · 6Comments · Source: timberio/vector

Vector mostly only supports gzip compression in its sinks, which is to say, a compressor specified in 1990 based on already 20 year old methods, that performs string deduplication over a tiny 32KiB window. Deflate is neither quick to compress nor decompress, has tragic ratios for bulk data, and has been roundly obsoleted in every metric except mass adoption by numerous compressors over the past 30 years.

Of those modern compressors, LZMA and Zstandard have some level of adoption and are fit for general use, but for logs analysis in particular, Zstandard hits a massive sweet spot with state of the art compression ratios combined with best in class decompression speed.

It's possible to get 20x compression of logs with Zstandard and decompress those logs for analysis at almost 2GiB/sec with a single thread. This allows a 20 core machine (theoretically) to process 40 GiB/s of decompressed logs while saturating an underlying 2 GiB/s NVMe storage device (assuming no other work except decompression was being performed).

LZMA is competitive with Zstandard in ratio and overall performance, but Zstandard still enjoys a significant lead in terms of absolute decompression performance, which for me is a major deciding factor in long term logs storage.

This is a request to consider modern gzip alternatives, or if there is no time for that, perhaps consider only my suggestion to go the Zstandard route. ;)

Thanks

compression networking sinks nice approval rfc

Source

occasionallydavid

👍3

Most helpful comment

Just to be clear, I don't think we have to implement anything for kafka beyond passing the configs down and enabling the relevant features on the crate.

lukesteensen on 23 Apr 2020

👍2

All 6 comments

We definitely want to build on our compression feature in the near future! I think giving folks the option can be done similar to how we do encoding.

Hoverbear on 13 Apr 2020

For whoever wants to tackle this: I think adding an encoding.compression field might be the way to go?

Hoverbear on 13 Apr 2020

👍2

@bruceg before we begin work, we should identify sinks where this is compatible.

binarylogic on 20 Apr 2020

Sinks currently using gzip compression:
Sink | Allowed Methods | Status
-----|------------------------|----------
aws_s3 | any |
clickhouse | brotli, deflate, gzip (reference)
elasticsearch | gzip (?)
gcp_cloud_storage | any
http | any
kafka | gzip, lz4, snappy, zstd (reference) | supported via librdkafka
splunk_hec | gzip (?)

bruceg on 23 Apr 2020

So it looks like aws_s3, gcp_cloud_storage, http, and kafka are good sinks to target first.

binarylogic on 23 Apr 2020

Just to be clear, I don't think we have to implement anything for kafka beyond passing the configs down and enabling the relevant features on the crate.

lukesteensen on 23 Apr 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings