Vector mostly only supports gzip compression in its sinks, which is to say, a compressor specified in 1990 based on already 20 year old methods, that performs string deduplication over a tiny 32KiB window. Deflate is neither quick to compress nor decompress, has tragic ratios for bulk data, and has been roundly obsoleted in every metric except mass adoption by numerous compressors over the past 30 years.
Of those modern compressors, LZMA and Zstandard have some level of adoption and are fit for general use, but for logs analysis in particular, Zstandard hits a massive sweet spot with state of the art compression ratios combined with best in class decompression speed.
It's possible to get 20x compression of logs with Zstandard and decompress those logs for analysis at almost 2GiB/sec with a single thread. This allows a 20 core machine (theoretically) to process 40 GiB/s of decompressed logs while saturating an underlying 2 GiB/s NVMe storage device (assuming no other work except decompression was being performed).
LZMA is competitive with Zstandard in ratio and overall performance, but Zstandard still enjoys a significant lead in terms of absolute decompression performance, which for me is a major deciding factor in long term logs storage.
This is a request to consider modern gzip alternatives, or if there is no time for that, perhaps consider only my suggestion to go the Zstandard route. ;)
Thanks
We definitely want to build on our compression feature in the near future! I think giving folks the option can be done similar to how we do encoding.
For whoever wants to tackle this: I think adding an encoding.compression field might be the way to go?
@bruceg before we begin work, we should identify sinks where this is compatible.
Sinks currently using gzip compression:
Sink | Allowed Methods | Status
-----|------------------------|----------
aws_s3 | any |
clickhouse | brotli, deflate, gzip (reference)
elasticsearch | gzip (?)
gcp_cloud_storage | any
http | any
kafka | gzip, lz4, snappy, zstd (reference) | supported via librdkafka
splunk_hec | gzip (?)
So it looks like aws_s3, gcp_cloud_storage, http, and kafka are good sinks to target first.
Just to be clear, I don't think we have to implement anything for kafka beyond passing the configs down and enabling the relevant features on the crate.
Most helpful comment
Just to be clear, I don't think we have to implement anything for kafka beyond passing the configs down and enabling the relevant features on the crate.