While running InfluxDB 1.7.5, some time after start while ingesting data, the Influx daemon stops responding. Writes via the HTTP endpoint time out, can't run any SELECT queries and through the influx cli, any commands that perform reads (show measurements, show tag keys, etc...) hang. There are no log messages, no CPU usage, no memory exhaustion, etc... when this happens.
Stopping influxdb leads to a hard shutdown:
Mar 29 07:30:21 influx1 systemd: Stopping InfluxDB is an open-source, distributed, time series database...
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179652Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179766Z lvl=info msg="Waiting for clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179892Z lvl=info msg="Listener closed" log_id=0ESuf_v0000 service=snapshot
Mar 29 07:30:51 influx1 influxd: ts=2019-03-29T07:30:51.179939Z lvl=info msg="Time limit reached, initializing hard shutdown" log_id=0ESuf_v0000
Mar 29 07:30:51 influx1 systemd: Stopped InfluxDB is an open-source, distributed, time series database.
Note that this instance is in the process of backfilling with out-of-order data. I've now downgraded to 1.7.4 and so far it has not hung.
I second this. When I upgraded Influx from 1.7.4-alpine to 1.7.5-alpine all of my writes gave a 500. Additionally, I saw a bunch of these logs when I was bringing things up and down a few times:
Mar 29 01:00:12 myhost docker[30915]: ts=2019-03-29T08:00:12.110995Z lvl=info msg="Write failed" log_id=0EU5vEvG000 service=write shard=748 error="store is closed"
I tried backing up my current Influx dir and starting fresh like I had never spun up Influx before. Even with that things broke in the same way. Downgrading to 1.7.4-alpine brings things back to a working state.
I had the same experience a couple of minutes after upgrading from 1.7.4 queries started hanging and data stopped being written, basically the entire service seemed to be blocked. Even internal stats stopped being written. I rolled back to 1.7.4, this is the second upgrade since 1.7.0 that failed me 馃槥 .
Ran into same issue here. Spin up an influx DB, set up some CQ and RPs on the database, and everything seems to work fine until a new host writes new data in (the tags contain host names). At that point everything hangs until I restart the service. Every new host will hang the system.
@RedShift1 @conet @jmurrayufo @aaronjwood could you please provide details of the index you're running, and if possible a stack dump of goroutines? You can SIGQUIT the process once it deadlocks.
Running the default index (did not change the default configuration file), I will try to make a stack dump of goroutines.
Update from our side.
We think we know which change introduced this deadlock. The best course of action if you encounter it would be to rollback to 1.7.4.
However, we would really appreciate seeing a stack trace from someone who is deadlocked on this issue. You can send the process a SIGQUIT via kill -s SIGQUIT <process_id> or if Influxd is in the foreground with ctrl \.
Here it is: influxdb-stacktace.log
@conet Thank you for the stack trace. Are you seeing any panics in your log before the SIGQUIT?
@benbjohnson I can't find any occurrences of the word panic in the logs preceding the deadlock (or anywhere in the logs). I can approximate the moment of the deadlock based on the fact that write request start to fail with a status of 500.
I've come across, what appears to be the same issue.
influx release: 1.7.5 data and meta
ts=2019-04-03T14:54:05.632738Z lvl=info msg="InfluxDB Meta starting" log_id=0E_vw1G0000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8 tags=unknown
...
ts=2019-04-03T14:54:07.875194Z lvl=info msg="InfluxDB starting" log_id=0E_vwA0l000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8
telegraf release: checked with both 1.9.4 and 1.10.2
Usecase
I'm updating the installation script to use 1.7.5. It creates a test cluster for influx enterprise using docker instances and adds the rest of the TICK stack. Current script found here -
Local version attached.
setup_test_environment_enterprise_version.sh.txt
After seeing httpd responding with 204 to telegraf, I run the first basic selenium test - connection_wizard.
Roughly every second run at the end of the connection_wizard test httpd starts responding with 500 on telegraf writes.
Other Chronograf tests using data explorer or dashboards cannot be executed because queries to influxdb fail.
Console logs (taken from docker) attached.
console-tele-1_9_4.log
console-tele-1_10_2.log
Screenshots

I hit this when spinning up a new InfluxDB instance on a new machine using docker: docker run ... influxdb. I managed to resolve by explicitly specifying the version tag influxdb:1.7.4. However anyone running the InfluxDB docker container with tag latest (or no tag), 1.7, or 1.7-alpine will be liable to run into this. You might want to consider changing these tags to refer to 1.7.4 rather than 1.7.5 until the issue is fixed.
@conet We added what we believe is a fix (#13150). Would you be able to build a369cf413e0b80c4ded1eab51fbd27be3cec4dd1 and verify that it fixes the issue?
@benbjohnson if you give me a rpm I can test that, building from sources... is too much to ask.
@conet No problem. I built an RPM for that commit and posted it here:
@benbjohnson I'd say that the fix is good, running for an hour now without any issues, previously it would hang in the first couple of minutes, nevertheless I'll leave it running and get with another report tomorrow.
@conet Great, thanks for taking a look so quickly. I appreciate it.
The fix is still good after more than 10 hours, please go ahead and release it.
Thanks @conet!
When is 1.7.6 going to be released? 1.7.5 is still advertised as the latest stable and the issues about 1.7.5 keep piling up, some examples: #13256 and this comment.
At least remove 1.7.5 so that others will not be affected by this.
We just burned a lot of time debugging this in production. Please update your Dockerhub and Github releases with at least a disclaimer.
We did land a notification of this in the release notes here: https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/
Happy to take feedback as to where "else" you would like to find this notification if the doc location is not somewhere you are visiting.
We are experiencing an error ERR: max-concurrent-queries limit exceeded(20, 20) in a prototype situation where it is highly unlikely that there are 20 concurrent queries running.
We have also configured INFLUXDB_COORDINATOR_QUERY_TIMEOUT=5s
There is nothing showing in the logs.
Could this also be caused by this bug?
@DerMika Yes, it is likely that the lock issue is causing queries to block and back up.
Thanks, I don't seem to have the problem after downgrading to 1.7.4.
Please update your Dockerhub and Github releases with at least a disclaimer.
@timhallinflux
Thanks @bndw 1.7.6 is being published now.
https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/
@timhallinflux is there an ETA for this landing in Dockerhub?
should be there now. There is usually a <24 hour delay once we create the build and releases appearing in Docker hub.
I see random "Internal server errors" while ingesting data with 1.7.7. If I retry writing points, the operation succeeds. /var/log/influxdb/ is empty.
Most helpful comment
At least remove
1.7.5so that others will not be affected by this.