Influxdb: tsm1_compact_group error - wait exceed's limiter's burst

Created on 21 May 2018 · 11Comments · Source: influxdata/influxdb

May 21 21:49:23 <hostname> influxd[28202]: ts=2018-05-21T21:49:23.178413Z lvl=info msg="Error compacting TSM files" log_id=08DAIK40000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=08DAh3CG000 op_name=tsm1_compact_group error="rate: Wait(n=116648469) exceeds limiter's burst 50331648"

Trying to diagnose high memory/cpu use on a server I recently upgraded to 1.5. Any pointers on what this error message indicates? Did not find any previous issues on this topic from searching.

1.x

Source

jonathanstrong

👍1

Most helpful comment

Ok, so I added the following line to /etc/default/influxdb:

INFLUXDB_DATA_COMPACTION_THROUGHPUT=536870912

The original limit was 48 * 1024 * 1024 (hence why you couldn't find the number displayed in the error message by searching), so I set it to 512 * 1024 * 1024, but I think any setting at all just disables the entire limit. Which, is what we want, since the limit broke influx for us.

Restarted the server. Got this log message:

Jun 22 15:40:15 <host> influxd[20820]: ts=2018-06-22T15:40:15.736564Z lvl=info msg="Compaction throughput limit disabled" log_id=08r2IP_l000 service=store

influx proceeded to do a shitload of compactions that had been prevented by this bug for a long time. Now everything appears to be restored to normal - my server is usable again!

Would have been nice if someone who knows the codebase could have chimed in, say, a month ago, when I first posted this issue!

jonathanstrong on 22 Jun 2018

🎉2 👍2

All 11 comments

a couple addenda to this:

I traced the error to a call to s.compactor.CompactFull(group) where s is a *compactionStrategy and group is s.group in func(s *compactionStrategy) compactGroup() in tsdb/engine/tsm1/engine.go - this is the only match to the string "Error compacting" in the v1.5.2 source code. However, I can't identify where the "rate: Wait(n=115868815) exceeds limiter's burst 50331648" part of the error message comes from.
I ended up deleting the database entirely that was the subject of these errors, and recreating it. The error messages subsided for a bit, but now they're back.
Is cpu usage expected to be much higher for 1.5 vs 1.4? Perhaps I'm trying to diagnose something that is expected behavior. My 1.5 install is using 20x what a 1.4 install was doing for the same workload. Memory usage is up significantly as well, although I haven't measure it.

jonathanstrong on 23 May 2018

additional steps I have tried:

upgrading tsm to tsi (using the migration tool), and setting the config file to use tsi
dropping the database that was the subject of this error (again)
pointing influx to a new data and meta/wal dir and starting fresh (with tsi enabled)

Each time, the error comes back after about 45 minutes of operation.

Investigating how to downgrade, but previous versions do not seem to be available via the ppa.

jonathanstrong on 30 May 2018

Hitting a similar issue in our 1.5.3 database.
What we did:

Upgraded 1.3.9 to 1.5.3
During the upgraded, moved from tsm to tsi (following procedures)

Expected behavior:

That the database works happily ever after.
What actually happens:
Database fails while compacting some shards with the error message "Wait(n=xxxxx) exceeds limiter's burst", then retries to compact these shards in an (allegedly) infinite loop, using all IO on the machine. This happens with several shards, so if we limit the number of simultaneous compactions to 2, it will be continuosly trying to compact 2 shards.
Database server is unusable due to aforementioned procedure.

Questions:

Any idea on how to change this limiter´s burst? We don´t find it in the configuration.

Workaround (just for our case) and toughs:

The shards were having old data. We applied a retention policy of 1 year and were lucky that the shards were deleted (no data means no compaction hehe).
We do not dare to migrate our real system. If this happens and the data is current, there is neither parameter to adjust nor a reliable workaround for the issue.

marcosmoreno on 15 Jun 2018

Breakthrough! I found where the limiter rate is set in the code!

in tsdb/store.go:

    // Env var to disable throughput limiter.  This will be moved to a config option in 1.5.
    if os.Getenv("INFLUXDB_DATA_COMPACTION_THROUGHPUT") == "" {
        s.EngineOptions.CompactionThroughputLimiter = limiter.NewRate(48*1024*1024, 48*1024*1024)
    } else {
        s.Logger.Info("Compaction throughput limit disabled")
    }

Guess they did not get around to putting this in the config options? haha.

I will play around with changing this number and building to see what happens.

jonathanstrong on 22 Jun 2018

Ok, so I added the following line to /etc/default/influxdb:

INFLUXDB_DATA_COMPACTION_THROUGHPUT=536870912

Restarted the server. Got this log message:

Jun 22 15:40:15 <host> influxd[20820]: ts=2018-06-22T15:40:15.736564Z lvl=info msg="Compaction throughput limit disabled" log_id=08r2IP_l000 service=store

influx proceeded to do a shitload of compactions that had been prevented by this bug for a long time. Now everything appears to be restored to normal - my server is usable again!

Would have been nice if someone who knows the codebase could have chimed in, say, a month ago, when I first posted this issue!

jonathanstrong on 22 Jun 2018

🎉2 👍2

I observed the same problem on 1.5.3 and 1.5.4. (I'm running InfluxDB on Kubernetes)

It was simple enough to define the environment variable INFLUXDB_DATA_COMPACTION_THROUGHPUT and set it to 536870912 though as noted, the source code indicates the actual value doesn't matter.

brianbaquiran on 29 Jun 2018

user beware: it seems this mystery setting has migrated to the influxdb.conf file in version 1.7, leaving you vulnerable to inexplicably terrible performance once again.

default settings:

compact-throughput = "48m"
compact-throughput = "48m"

(Note: regarding "exceeds limiter's burst 50331648" -- 1024 * 1024 * 48 == 50331648)

What I changed the settings to on a machine with 16gb ram:

compact-throughput = "512m"
compact-throughput = "1024m"

The initial compaction backlog pushed the system to its limits but made it through.

To reiterate: as of 1.7 it appears that setting INFLUXDB_DATA_COMPACTION_THROUGHPUT will not longer protect you from "exceeds limiter's burst" misery. Instead change the new, badly-set defaults in influxdb.conf.

jonathanstrong on 2 May 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 31 Jul 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

stale[bot] on 7 Aug 2019

We have pinned this bug down to a quirk in the golang rate package. Specifically, rate.Limiter.WaitN(ctx, n)

returns an error if n exceeds the Limiter's burst size

InfluxDB uses WaitN() to rate limit writes in an io.Writer wrapper, passing the length of the byte slice for n. So, when TSM compaction writes a file larger than compact-throughput-burst, the write fails, rather than being throttled. This causes the entire compaction to unwind, retry, etc.

Notice, in original comment, the log says "exceeds limiter's burst 50331648". Also notice that 50331648 == 1024 * 1024 * 48, the default value for compact-throughput-burst.