Reported for version 6.7.1, these errors were logged.
[2019-06-21T02:52:00,586][ERROR][CLengthEncodedInputParser.cc@169] Parsed field length 808333362 is suspiciously large - assuming corrupt input stream
[2019-06-21T02:52:00,586][ERROR][CLengthEncodedInputParser.cc@56] Failed to parse length encoded header from stream
[2019-06-21T02:52:00,586][FATAL][Main.cc@133] Failed to handle input to be normalized
[2019-06-21T02:52:00,588][ERROR][CLengthEncodedInputParser.cc@147] Incorrect number of fields in input stream record: expected 9 but got 0
[2019-06-21T02:52:00,588][ERROR][CLengthEncodedInputParser.cc@82] Failed to parse length encoded data record from stream
[2019-06-21T02:52:00,588][FATAL][Main.cc@133] Failed to handle input to be normalized
The corruption could have been caused by an OS level error with the named pipe or it could be that 2 threads were writing to the named pipe at the same time. There is some evidence to support the later as elsewhere in the file the following was reported
[2019-06-20T11:27:01,130][ERROR][CNamedPipeFactory.cc@206] Unable to create named pipe /tmp/elasticsearch-3389134197385528191/normalize_mljob-XXX: File exists
The error comes after mkfifo which is called when lstat fails as it is expected to do because the file should not exist. lstat may have failed for a different reason - e.g. no execute permission - or the file is created between the calls to lstat and mkfifo suggesting 2 normalize processes are at work. Note the named pipe file names are based on the job Id so different processes would create the same file for the same job and from the java side 2 threads could write to the same pipe.
However, ShortCircuitingRenormalizer has a semaphore protecting against concurrent invocations of doRenormalizations and I cannot see a way that a job could have 2 normalize processes.
Pinging @elastic/ml-core
I think I understand this now.
There are some installations where the normalize process starts up _incredibly_ slowly, and we get a timeout on trying to connect the first named pipe. At that point the Java code assumes that normalize process failed to start, but it _did_ actually start, just incredibly slowly.
When the next set of quantiles is received the Java code will fire up a new normalize process, but because it's for the same job the named pipes will have exactly the same names as the named pipes for the first normalize process. This then opens up the possibility of files existing unexpectedly and/or weird corruption as multiple processes share the same files.
We can avoid the possibility of duplicates by adding an extra number into the named pipe file names used by normalize. This number can just be incremented each time the normalizer is called.
Most helpful comment
I think I understand this now.
There are some installations where the
normalizeprocess starts up _incredibly_ slowly, and we get a timeout on trying to connect the first named pipe. At that point the Java code assumes thatnormalizeprocess failed to start, but it _did_ actually start, just incredibly slowly.When the next set of quantiles is received the Java code will fire up a new
normalizeprocess, but because it's for the same job the named pipes will have exactly the same names as the named pipes for the firstnormalizeprocess. This then opens up the possibility of files existing unexpectedly and/or weird corruption as multiple processes share the same files.We can avoid the possibility of duplicates by adding an extra number into the named pipe file names used by
normalize. This number can just be incremented each time the normalizer is called.