I am just wondering, if you have investigated a specialized mode for compressing data with fixed strides.
For some machine generated data, you can have fixed strides of data, meaning each block of data consists of exactly n bytes of data.
I was wondering if you have investigated whether this could be taken advantage of for fast, but still quite efficient compression. For the compression mode, you would specify a start offset to account for the possibilities of a header and a fixed stride where you would expect matches.
The 'header' would of course be compressed by conventional means, but each block would then only be compared to the 'n' previous blocks. The previous 'n' blocks could then potentially be further reduced by a simple xor and counting the number of '0' vs non-zero. SSE2/AVX, etc can of course do this very fast.
The search would then entirely be done against the selected blocks and the output would be matches between them and literals. I imagine the use of repeat offsets could be quite efficient here.
This might be a silly idea, and I am not sure how much of this data there is "out there", but I wanted to know if you'd tried it before I gave it a shot.
These are indeed real cases.
It seems I understand 2 parts in your question :
1) How to compress faster, by only sampling positions of interest
2) How to compress better, by doing transformation to the data ("xor") before classical compression stages.
For case 1, we haven't done any work on this yet.
One way to achieve something equivalent could be to trigger some negative compression level.
The rough idea is that level -2 will only sample once every 2 bytes, -4 will sample once every 4 bytes, etc.
Among the problems : there is no possibility to select an "initial" offset, and the sampling is just statistical and not "rigid", meaning it can drift over time. So it's not exactly your request.
Also, negative levels disable Huffman compression by default, but this can be re-enabled explicitly, using advanced parameters.
It will clearly increase speed, though it's unclear how much compression ratio will be lost in the process. Experience suggests that this might be non-trivial. Even when data is aligned, there are often matches which happen to "cross boundaries", or which match only a part of the block, so restricting the sampling to only the beginning of each block can miss a lot of opportunities. I guess the outcome will be content dependent.
The second point is more interesting, as there are indeed some possibilities of important wins when block contents are similar.
However, there is no way we can retrofit the compression format now to make such transformation internally. The format is now constrained by its published RFC, and we are restricted to only implement features which are compatible with this RFC. As a consequence, data transformations must be considered an external topic, achieved in a "user layer", in front of Zstandard. Such transformed data would have no ambition of decodability outside of their specific associated decoder.
Finally, yes, I would expect repeat offsets to provide important gains for such data types, and I believe that's already the case, even without prior transformation stages. In essence, if a previous offset happens to be the size of a "data block", it will catch similar sections, and summarize them as a repeat offset and a length code, which is efficient and effectively quite similar to what a "xor" could achieve, though without messing the dissimilar bytes, which is beneficial for later statistics stages.
No, I am only taking about xor for determining similarity between blocks for finding good match candidates. This made to be within the zstd spec, so I am aware that transformations are not possible.
What I probably didn't make very clear is that this would be for larger blocks, like 64 bytes and up.
A practical example: Say you have blocks of 123 bytes, you would compare the previous 16 blocks of 123 bytes.
You find the 3 blocks that have the most matching bytes (PSADBW could help make this fast). These are your candidates for matching. Instead of using the hash table for finding matches, you compare your current block against the 3 candidates.
The "16" and "3" is of course just a number I picked. The memory would be read 16x the input size, but data would be in L1 cache and SIMD could make up for some of that speed, and nothing would have to be written to hash tables. Even though it is pretty brute force, maybe it could be interesting.
I am mainly thinking this to be effective at the lower levels of compression (1-4).
OK, so that's indeed significantly different from what we are doing at low levels.
It's unclear what will be the speed benefit. Maybe it's large, maybe it's not, it's hard to tell.
We have no plan on short term on implementing something like that.
A difficult point is that it's not "generic" : at a minimum, size information must be provided by the caller.
But of course, if someone wants to try this approach, we'll be very glad to follow the outcome.
I will probably give it a shot once I've released "level 3" compression: https://github.com/klauspost/compress/pull/118
You could even store a bitmask with the matching bytes using PMOVMSKB, at least on x86.
Still not sure how much real data could use this, though.
Closing for now.
I've landed here while looking for some general information on compressing binary records of fixed length.
Just to throw some data in. I have files that hold (key, value) pairs where the key is 9 bytes and the value is 11 bytes. The entries are ordered by key. Because they are ordered it makes adjacent keys similar; and the nature of the values I hold makes them kinda similar too (only about 8 bytes in each entry could be considered "random" apart from lower bytes of the key). I took a real life sample of a 45 MB file.
For this particular data when doing just zstd the compression ratios are between 88% and 68% for compression levels of 1 through 20.
When tranposing the file such that first bytes of each entries are contiguous, then the second bytes of all entries, and so on..., I get compression ratio of 65%! for all compression levels. This almost matches zip and bz2, so may be close to the practical limit.
Now, if it were possible to specify a stride hint this transposition could be done transparently when compressing a block (blocks would need to be of size stride*m?). The inverse would have to be applied on decompression, this may pose a problem because it requires storing an additional parameter in the metadata; not sure about the internals/spec of zstd and how big of an issue this would be. A default of stride=1 would be identical to the current behaviour.
This is certainly something I can easily emulate outside of the library, but at least to me compression of fixed length records sounds like a common enough use-case to consider something in this direction, especially because this is also meant to be used as a standalone tool.
I've done some more research into this.
I tried transpositions on different granularity (bytewise transposition, and bitwise transposition), in blocks of at most 128kiB (floored to required multiple) (that's why the results are slightly worse, in the original the whole file was transposed).
I tried it on two similar datasets, but one has more bit-packed data. Small, fixed size records.
Previous dataset:

New dataset, more bitpacked records.

Transposing allows better compression ratio at very low compression levels, and sometimes even beyond (but sometimes is a regression).
I used modified and optimized code from here https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/ to perform bitwise transposition. I haven't looked much into optimizing bytewise transposition (the naive solution is almost 3 times slower than optimized bitwise transposition. Ironically, I think it's easier to do bitwise than bytewise - movemask is great for this).
I can get speeds of ~600MiB/s for blocked bitwise transposition, averaged over transposing both ways (with small strides reversing the transposition [decompressing] is more costly due to memory access patterns) on an old i7-920. This algorithm can easly be implemented using AVX2/AVX-512 on wider registers.
I also tried the algorithm from hacker's delight, and while it's about twice slower on my PC it shows better performance on quickbench, so not sure about that. https://quick-bench.com/q/obKoJDGaYQstMzHm-NtidjQYNA4