I am finding that when I compress/decompress a large amount of data across a LAN, I receive the occasional ZSTD_error_corruption_detected error. This seems to happen when network congestion is high, which may lend to support the idea that streaming with varying number of bytes may have an issue somewhere.
I have been looking at all the places that this error can be thrown in the ZSTD_decompressStream( ), and there is a section that seems like it could throw off the input pointer by one because it is not resetting the hostageByte upon release.
if (!zds->hostageByte) { /* output not fully flushed; keep last byte as hostage; will be released when all output is flushed */
input->pos--; /* note : pos > 0, otherwise, impossible to finish reading last block */
zds->hostageByte=1;
}
It seems that once a the hostageByte is set to 1 and input->pos is decremented, hostageByte it is never set back to 0 after input->pos is incremented. Could this not cause pos to be erroneously incremented when there is not a hostageByte?
if (zds->hostageByte) {
...
input->pos++; /* release hostage */
zds->hostageByte = 0; <-- this line current does not exist, should it be added?
} /* zds->hostageByte */
Should zds->hostageByte = 0; be added to the above snippet from zstd_decompress.c?
I suspect your suggestion will not hurt,
but I also suspect it will not solve your problem.
On reaching this point, the function return 0,
which means it must be called again.
At new entry, function reads zds->streamStage,
which value is zdss_init,
so zds->hostageByte will be forced back to 0.
Some questions :
1) do you have a reproducible test case that triggers ZSTD_error_corruption_detected ?
2) if yes, could you proceed with your suggested change, and tell if it solves the problem ?
The major point, I believe, is the creation of a reproducible test case.
Since you mention that the network is congested, you should also make sure that the network layer is performing correctly, providing to decoder all packets emitted and in the correct order.
@Cyan4973 I think you're right. I didn't realize at first that zdss_read went back to zdss_init. I tried to find a potential path where it could somehow pass through that area twice and mess up the hostage situation, but it seems like all permutations are covered.
I added in DEBUGLEVEL 5 and changed the DEBUGLOG to write to a file. The stderr output was difficult for me to capture because I'm calling these routines from powershell, and it seemed easier than trying to figure out the correct way to capture the stderr... Yes, that file is growing fast, but I need to capture it to find out what is going on.
I should hopefully have some relevant DEBUGLOG entries soon...
It might be worth mentioning that I found the version I was using prior was 1.3.5, and I see that you have recently released 1.3.8. There is always the possibility that this issue was corrected in a previous release.
Have you ruled out corruption by the Ethernet cards?
We checksum each block to prove this.
@stati64 yes, they are intel server NICs, but I turned off all TCP offloading just in case. I was more suspect of my own code and ran tests at each layer, which did end up turning up something.
TLDR; I believe the issue is in my code. It has now streamed over 4 TB of continuing data without an error!
(end TLDR, now for details on what I believe was the issue)
Diagram of the stream path:
MSSQL -> VDI -> (Stream Write)
ZStd.NET Comp -> TCPStream -> LAN -> TCPStream -> EchoStream -> ZStd.NET Decomp
(Stream Read) -> VDI -> MSSQL
You may be wondering, what is that EchoStream component?
EchoStream is a System::Net::Stream class I created to Read/Write from/to the same stream. However, I discovered a bug in .NET 4.5 yesterday, where the built-in WriteAsync( )/ReadAsync( ) functions share the same state variables, because they don't expect you to call both of them simultaneously. I believe this was causing an issue where a buffer was lost if a ReadAsync was called when there wasn't any data available, causing the WriteAsync to fail because the read had to first complete (because they are sharing the same state variables). To fix this, I overrode the WriteAsync function and used my own Task variable.
The reason the EchoStream sits between the TCPStream and the ZStandard.NET Stream is because I needed a way to call both Read and Write on the same stream, because I may also Tee the incoming stream to a file, but I want to keep it compressed.
I have not received a further corruption error. Closing issue. Thanks!