Influxdb: [Bug] "engine is closed" - reported by "response.Error()"

Created on 12 Jan 2018  路  8Comments  路  Source: influxdata/influxdb

Bug report

__System info:__

  • Ubuntu 17: uname -a: "Linux cncftest.io 4.10.0-42-generic #46-Ubuntu SMP Mon Dec 4 14:38:01 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux"
  • Influx 1.3.6: influx -version: "InfluxDB shell version: 1.3.6"

__Steps to reproduce:__

Happens very rare, under heavy load. Up to 48 threads (machine is 48 cores).

Piece of code:

  q := client.Query{
    Command:  query,
    Database: ctx.IDBDB,
  }
  response, err := con.Query(q)
  FatalOnError(err)
  FatalOnError(response.Error()) // <--- this line fails with "engine is closed"
  return response.Results

__Expected behavior:__ No error

__Actual behavior:__ error

__Additional info:__ --- (was not able to find a way to reproduce, happens once per few days).

arestorage revisit in the future

Most helpful comment

Since this is marked as "revisit" and "more info", this might be helpful in the reproduction of the error or in cases this error occurs:
I was getting a lot of "engine is closed" errors during a scripted online restore process. In my case this happend because I ran a SELECT query immediately after the influxd restore command finished. So I gave InfluxDB a second (sleeping my programm execution) before running my SELECT statement and this fixed the "engine is closed" error for me.
My script works as follows:

  1. restore a portable backup to a temporary database
  2. select the data of temporary database into existing database
  3. drop temporary database

The error would always occur at step 2. So I put a sleep second between step 1 and 2 - and the error was gone.

All 8 comments

Would be great to at least know what does it mean "engine is closed" - what causes that error?

Engine is closed usually means the system is just starting up or is shutting down. Can you check your logs to see if anything that looks like that may exist? Each shard in the system is considered its own "engine" and a large system can sometimes take awhile to open all of its shards.

I'll check next time when it happens (it didn't happen for 5+ days already).
I'm 100% sure nothing is shutting down or starting - unless Influxd does it without my knowledge sometimes?
Where should I look?

I would say the log file will give the best insight. Like if the server crashed for whatever reason and was starting up again that would be a good indicator. If the engine is closed happens a bunch and then just stops happening then that would be another sign. Since you say it hasn't happened for 5 days, I'm guessing that means it has been running stable for at least 5 days and that's why you haven't encountered it.

Yes, it is very rare and only under heavy load.

The engine is closed error occurs when writes or queries run against a shard that is not open/ready. If you are getting this during a query, it's likely that the planning step picked an old shard and before the query ran on the shard, the retention service closed and started removing the shard.

Since this is marked as "revisit" and "more info", this might be helpful in the reproduction of the error or in cases this error occurs:
I was getting a lot of "engine is closed" errors during a scripted online restore process. In my case this happend because I ran a SELECT query immediately after the influxd restore command finished. So I gave InfluxDB a second (sleeping my programm execution) before running my SELECT statement and this fixed the "engine is closed" error for me.
My script works as follows:

  1. restore a portable backup to a temporary database
  2. select the data of temporary database into existing database
  3. drop temporary database

The error would always occur at step 2. So I put a sleep second between step 1 and 2 - and the error was gone.

I can reproduce these errors by concurrently removing independent continuous queries, measurements, and retention policies.

In my tests, I have multiple measurements each with a seperate retention policy (call each pair A). Each measurement has a continuous query which inserts into another measurement which has its own retention policy (call each triple B). For each group, I remove A (measurement then RP), and B (CQ, measurement, then RP) concurrently. I get errors similar to these:

Statement error: shard 14: engine is closed
Statement error: shard 16: engine is closed

The error is returned when deleting measurement in group B.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

FGRibreau picture FGRibreau  路  45Comments

mvadu picture mvadu  路  60Comments

toddboom picture toddboom  路  69Comments

beckettsean picture beckettsean  路  44Comments

beckettsean picture beckettsean  路  105Comments