Influxdb: Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not

Created on 29 Mar 2019 · 30Comments · Source: influxdata/influxdb

While running InfluxDB 1.7.5, some time after start while ingesting data, the Influx daemon stops responding. Writes via the HTTP endpoint time out, can't run any SELECT queries and through the influx cli, any commands that perform reads (show measurements, show tag keys, etc...) hang. There are no log messages, no CPU usage, no memory exhaustion, etc... when this happens.

Stopping influxdb leads to a hard shutdown:

Mar 29 07:30:21 influx1 systemd: Stopping InfluxDB is an open-source, distributed, time series database...
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179652Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179766Z lvl=info msg="Waiting for clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179892Z lvl=info msg="Listener closed" log_id=0ESuf_v0000 service=snapshot
Mar 29 07:30:51 influx1 influxd: ts=2019-03-29T07:30:51.179939Z lvl=info msg="Time limit reached, initializing hard shutdown" log_id=0ESuf_v0000
Mar 29 07:30:51 influx1 systemd: Stopped InfluxDB is an open-source, distributed, time series database.

Note that this instance is in the process of backfilling with out-of-order data. I've now downgraded to 1.7.4 and so far it has not hung.

1.x kinbug

Source

RedShift1

👍10

Most helpful comment

At least remove 1.7.5 so that others will not be affected by this.

conet on 9 Apr 2019

👍5

All 30 comments

I second this. When I upgraded Influx from 1.7.4-alpine to 1.7.5-alpine all of my writes gave a 500. Additionally, I saw a bunch of these logs when I was bringing things up and down a few times:

Mar 29 01:00:12 myhost docker[30915]: ts=2019-03-29T08:00:12.110995Z lvl=info msg="Write failed" log_id=0EU5vEvG000 service=write shard=748 error="store is closed"

I tried backing up my current Influx dir and starting fresh like I had never spun up Influx before. Even with that things broke in the same way. Downgrading to 1.7.4-alpine brings things back to a working state.

aaronjwood on 29 Mar 2019

I had the same experience a couple of minutes after upgrading from 1.7.4 queries started hanging and data stopped being written, basically the entire service seemed to be blocked. Even internal stats stopped being written. I rolled back to 1.7.4, this is the second upgrade since 1.7.0 that failed me 😞 .

conet on 29 Mar 2019

Ran into same issue here. Spin up an influx DB, set up some CQ and RPs on the database, and everything seems to work fine until a new host writes new data in (the tags contain host names). At that point everything hangs until I restart the service. Every new host will hang the system.

jmurrayufo on 29 Mar 2019

👍3

@RedShift1 @conet @jmurrayufo @aaronjwood could you please provide details of the index you're running, and if possible a stack dump of goroutines? You can SIGQUIT the process once it deadlocks.

e-dard on 1 Apr 2019

Running the default index (did not change the default configuration file), I will try to make a stack dump of goroutines.

RedShift1 on 1 Apr 2019

Update from our side.

We think we know which change introduced this deadlock. The best course of action if you encounter it would be to rollback to 1.7.4.

However, we would really appreciate seeing a stack trace from someone who is deadlocked on this issue. You can send the process a SIGQUIT via kill -s SIGQUIT <process_id> or if Influxd is in the foreground with ctrl \.

e-dard on 2 Apr 2019

Here it is: influxdb-stacktace.log

conet on 3 Apr 2019

@conet Thank you for the stack trace. Are you seeing any panics in your log before the SIGQUIT?

benbjohnson on 3 Apr 2019

@benbjohnson I can't find any occurrences of the word panic in the logs preceding the deadlock (or anywhere in the logs). I can approximate the moment of the deadlock based on the fact that write request start to fail with a status of 500.

conet on 3 Apr 2019

I've come across, what appears to be the same issue.

influx release: 1.7.5 data and meta

ts=2019-04-03T14:54:05.632738Z lvl=info msg="InfluxDB Meta starting" log_id=0E_vw1G0000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8 tags=unknown
...
ts=2019-04-03T14:54:07.875194Z lvl=info msg="InfluxDB starting" log_id=0E_vwA0l000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8

telegraf release: checked with both 1.9.4 and 1.10.2

Usecase

I'm updating the installation script to use 1.7.5. It creates a test cluster for influx enterprise using docker instances and adds the rest of the TICK stack. Current script found here -

Local version attached.
setup_test_environment_enterprise_version.sh.txt

After seeing httpd responding with 204 to telegraf, I run the first basic selenium test - connection_wizard.

Roughly every second run at the end of the connection_wizard test httpd starts responding with 500 on telegraf writes.

Other Chronograf tests using data explorer or dashboards cannot be executed because queries to influxdb fail.

Console logs (taken from docker) attached.
console-tele-1_9_4.log
console-tele-1_10_2.log

Screenshots
influxdb-telegraf-no-response

karel-rehor on 3 Apr 2019

I hit this when spinning up a new InfluxDB instance on a new machine using docker: docker run ... influxdb. I managed to resolve by explicitly specifying the version tag influxdb:1.7.4. However anyone running the InfluxDB docker container with tag latest (or no tag), 1.7, or 1.7-alpine will be liable to run into this. You might want to consider changing these tags to refer to 1.7.4 rather than 1.7.5 until the issue is fixed.

johnbatty on 4 Apr 2019

@conet We added what we believe is a fix (#13150). Would you be able to build a369cf413e0b80c4ded1eab51fbd27be3cec4dd1 and verify that it fixes the issue?

benbjohnson on 4 Apr 2019

@benbjohnson if you give me a rpm I can test that, building from sources... is too much to ask.

conet on 4 Apr 2019

😄1 👍1

@conet No problem. I built an RPM for that commit and posted it here:

https://s3-us-west-1.amazonaws.com/support-sftp/home/influxdb13010/influxdb-v1.7.6_a369cf4.x86_64.rpm

benbjohnson on 4 Apr 2019

@benbjohnson I'd say that the fix is good, running for an hour now without any issues, previously it would hang in the first couple of minutes, nevertheless I'll leave it running and get with another report tomorrow.

conet on 4 Apr 2019

🎉1

@conet Great, thanks for taking a look so quickly. I appreciate it.

benbjohnson on 4 Apr 2019

The fix is still good after more than 10 hours, please go ahead and release it.

conet on 5 Apr 2019

Thanks @conet!

benbjohnson on 5 Apr 2019

When is 1.7.6 going to be released? 1.7.5 is still advertised as the latest stable and the issues about 1.7.5 keep piling up, some examples: #13256 and this comment.

conet on 9 Apr 2019

At least remove 1.7.5 so that others will not be affected by this.

conet on 9 Apr 2019

👍5

We just burned a lot of time debugging this in production. Please update your Dockerhub and Github releases with at least a disclaimer.

bndw on 10 Apr 2019

We did land a notification of this in the release notes here: https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

Happy to take feedback as to where "else" you would like to find this notification if the doc location is not somewhere you are visiting.

timhallinflux on 11 Apr 2019

We are experiencing an error ERR: max-concurrent-queries limit exceeded(20, 20) in a prototype situation where it is highly unlikely that there are 20 concurrent queries running.

We have also configured INFLUXDB_COORDINATOR_QUERY_TIMEOUT=5s

There is nothing showing in the logs.

Could this also be caused by this bug?

DerMika on 16 Apr 2019

@DerMika Yes, it is likely that the lock issue is causing queries to block and back up.

benbjohnson on 16 Apr 2019

Thanks, I don't seem to have the problem after downgrading to 1.7.4.

DerMika on 16 Apr 2019

Please update your Dockerhub and Github releases with at least a disclaimer.

@timhallinflux

bndw on 16 Apr 2019

Thanks @bndw 1.7.6 is being published now.
https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

timhallinflux on 17 Apr 2019

🎉1

@timhallinflux is there an ETA for this landing in Dockerhub?

bndw on 17 Apr 2019

should be there now. There is usually a <24 hour delay once we create the build and releases appearing in Docker hub.

timhallinflux on 18 Apr 2019

👍1

I see random "Internal server errors" while ingesting data with 1.7.7. If I retry writing points, the operation succeeds. /var/log/influxdb/ is empty.