__System info:__
uname -a: "Linux cncftest.io 4.10.0-42-generic #46-Ubuntu SMP Mon Dec 4 14:38:01 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux"influx -version: "InfluxDB shell version: 1.3.6"__Steps to reproduce:__
Happens very rare, under heavy load. Up to 48 threads (machine is 48 cores).
Piece of code:
q := client.Query{
Command: query,
Database: ctx.IDBDB,
}
response, err := con.Query(q)
FatalOnError(err)
FatalOnError(response.Error()) // <--- this line fails with "engine is closed"
return response.Results
__Expected behavior:__ No error
__Actual behavior:__ error
__Additional info:__ --- (was not able to find a way to reproduce, happens once per few days).
Would be great to at least know what does it mean "engine is closed" - what causes that error?
Engine is closed usually means the system is just starting up or is shutting down. Can you check your logs to see if anything that looks like that may exist? Each shard in the system is considered its own "engine" and a large system can sometimes take awhile to open all of its shards.
I'll check next time when it happens (it didn't happen for 5+ days already).
I'm 100% sure nothing is shutting down or starting - unless Influxd does it without my knowledge sometimes?
Where should I look?
I would say the log file will give the best insight. Like if the server crashed for whatever reason and was starting up again that would be a good indicator. If the engine is closed happens a bunch and then just stops happening then that would be another sign. Since you say it hasn't happened for 5 days, I'm guessing that means it has been running stable for at least 5 days and that's why you haven't encountered it.
Yes, it is very rare and only under heavy load.
The engine is closed error occurs when writes or queries run against a shard that is not open/ready. If you are getting this during a query, it's likely that the planning step picked an old shard and before the query ran on the shard, the retention service closed and started removing the shard.
Since this is marked as "revisit" and "more info", this might be helpful in the reproduction of the error or in cases this error occurs:
I was getting a lot of "engine is closed" errors during a scripted online restore process. In my case this happend because I ran a SELECT query immediately after the influxd restore command finished. So I gave InfluxDB a second (sleeping my programm execution) before running my SELECT statement and this fixed the "engine is closed" error for me.
My script works as follows:
The error would always occur at step 2. So I put a sleep second between step 1 and 2 - and the error was gone.
I can reproduce these errors by concurrently removing independent continuous queries, measurements, and retention policies.
In my tests, I have multiple measurements each with a seperate retention policy (call each pair A). Each measurement has a continuous query which inserts into another measurement which has its own retention policy (call each triple B). For each group, I remove A (measurement then RP), and B (CQ, measurement, then RP) concurrently. I get errors similar to these:
Statement error: shard 14: engine is closed
Statement error: shard 16: engine is closed
The error is returned when deleting measurement in group B.
Most helpful comment
Since this is marked as "revisit" and "more info", this might be helpful in the reproduction of the error or in cases this error occurs:
I was getting a lot of
"engine is closed"errors during a scripted online restore process. In my case this happend because I ran aSELECTquery immediately after theinfluxd restorecommand finished. So I gave InfluxDB a second (sleeping my programm execution) before running mySELECTstatement and this fixed the"engine is closed"error for me.My script works as follows:
The error would always occur at step 2. So I put a sleep second between step 1 and 2 - and the error was gone.