I am debugging an issue with the huffman compression and stumbled upon some weirdness centered around this line:
https://github.com/facebook/zstd/blob/dev/lib/compress/huf_compress.c#L162
For my test I am only using the huffman encoder and to mimic my failure case I supplied it with this input:
for i := 0; i < 255; i++ {
buf0 := make([]byte, BlockSizeMax)
for j := range buf0 {
buf0[j] = byte(int64(i) + (rng.Int63() & 3))
}
b, reUsed, err := Compress4X(buf0)
}
This works fine until i gets to 126. My input basically looks like this: [127 129 128 127 126 128 127...
Then the FSE encoder fails with a "use RLE" error, because the input is [1,1,1,1....]. In the reference code this means HUF_compressWeights returns 1 and therefore doesn't return. huffLog is 2.
Since my maxSymbolValue is now 129 (biggest input) the line above fails and it refuses to compress my input. The comment suggests that this is "should not happen" and "likely means source cannot be compressed" - but that doesn't seem correct to me.
Am I doing something wrong here, or is it just a very special case - and is it missing a way to represent RLE content?
Hmmm... Seeing some weirdness local to my code... Maybe my test was better at finding the issue than I expected.
Checking some stuff.
Think I found it. My compression table was not being cleared properly when allowing reuse and it not reusing.
On reaching line 162, the decision is to use the "raw weight format", which is limited to represent up to 129 symbols (hence values 0-128) only. Going above this upper bound makes the table encoding fails.
In general, this case is reached when compressing the huffman weights produces worse result than a direct "raw" format. This can happen when the alphabet size is very small. But as the alphabet size grows, it becomes less and less likely, and even downright impossible, if only because weights values are necessarily bounded between 0 and 11 (included), and therefore can be compressed to better than 4 bit per symbol.
Here, this is different. Compressing the weights is possible, but it seems to produce a long sequence of 1, thus requiring the use of RLE mode.
Unfortunately, FSE decompression does not handle "natively" the RLE case. This is supposed to be routed to a different code by the user layer, typically using memset().
So FSE cannot compress that, so it defers to the "raw" mode, which fails, hence the error.
It seems this is a case which would require FSE decompression to handle RLE "natively". It's not easily possible : the "amount" of RLE symbols to regenerate must be written somewhere. The problem is, any change now must remain compatible with the RFC8478 : it's no longer possible to _add_ a decoder requirement at this stage.
Having all weights with a value 1 means that all symbols use the same nb of bits. This happens when data is not compressible, but it should not have reached this stage to begin with. That's in substance what the comment says.
Another possibility is that all symbols use the same nb of bits, but this nb is < 8, hence data is compressible. That's possible, but the sum of these weights must still be a power of 2, which means the nb of symbols present in the alphabet must 2, 4, 8, 16 etc. This includes 128 and 256, but 128 can be represented in the "raw" table mode, while 256 means it can't be compressed, so should not have reached this stage. Critically, no value between 129 and 255 is possible, since the sum would no longer be a power of 2.
Now, there is another subtlety : the last weight is not expressed. So we could have a situation where data is actually compressible, while weights to represent in the table are all 1.
But there is another condition : it must still represent a perfect huffman distribution, so the sum of all 2^weight must be a power of 2.
This leaves very little options to the table.
Basically, the number of represented weights must be a power of 2, and the unrepresented last weight must be as important as the sum of all preceding weights.
For example : 16 x 1, implying a last 17th weight of 4.
32x1, implying a last 33rd weight of 5. etc.
The only dangerous case is 128x1, implying a last 129th weight of 7.
(256x1 + 8 is not possible, since alphabet size is necessarily <= 256).
Now this is an alphabet of size 129.
But the thing is, we only represent 128 weights, so it still fits in the "raw" format, which can represent up to 128 weights (included).
Hence, if the analysis is correct, there should be no valid distribution of weights that can't be represented. If it nonetheless does, it might be because the distribution of weights is invalid, i.e. cannot be completed to represent a valid huffman tree. That also should should not happen, and if it does, it could mean that the code defining the huffman tree is buggy.
Thanks a lot. Yeah I was debugging a "corrupt input: last value not power of two" in huff0 table reading - this cropped up when I was writing a reproducer. Basically a side effect of the real problem.