Influxdb: revert to go1.4.2 (and wait for go1.6+)

Created on 23 Dec 2015  路  14Comments  路  Source: influxdata/influxdb

The recent shift to go1.5.2 has resulted in some drastic changes to overall load, throughput and utilization in our cluster. I know that we recently went back and forth between version, finally settling on go1.5.2 but it looks like there may be some regressions in terms of performance that are difficult to overlook.

As an exercise, we compiled the same version of the influxdb client using each of go1.4.2, go1.5.2 and go1.6beta1 to varying degrees of 'success'. In terms of overall performance, the ranking shakes out as follows:

  1. go1.4.2
  2. go1.6beta1
  3. go1.5.2

Image at bottom of Issue Tracker...

Overall, we've noticed a dramatic increase in CPU utilization and system load when using go1.5.2. In our case we're very lucky to have very powerful hardware to use, but handling a sustained load ~40-50 is a lot...especially given that the same binary using go1.4.2 hovers ~5.

In addition, pointsRx variance drops by an order of magnitude going from go1.5.2 to go1.4.2/go1.6beta1 with a much smaller standard deviation over time. It seems as if the binary's ability to ingest metrics from the UDP sockets is drastically improved in both versions over go1.5.2.

RSS does stay relatively consistent over time across all versions, but we have not spun up GODEBUG=gctrace=1,schedtrace=10000 across each iteration to truly see how things are working under the cover (that's next).

Good news on the query front is that all queries stay relatively consistent in terms of return times, averaging about 700ms for the image below (all versions of go yield a similar result). This means the changes largely impact the metrics ingestion portion of influxdb, and my strong suspicion without any data (yet) is down to GC.

If you'd like any logs, let me know.

golang_comparison

Most helpful comment

Confirmed that go1.6.2 is looking very, very good! As @methane pointed out, the GC numbers have virtually dropped to 0, which is legendary...no data loss, full capabilities to date. Great work!

Attached a screenshot of our setup, disregard that small window (admin work on the machine, unrelated to influxdb)

today

All 14 comments

FWIW, go 1.6-beta showed increased CPU time when run for a period of time and we reverted our cluster back to 1.4.2 which is stable (and consistent) for many days - which has proven solid under load. I'm sure somebody with experience could use this as a useful test for Go 1.6 beta testing to try to get InfluxDB performance back to that achieved closer to 1.4.2 - see https://groups.google.com/forum/#!topic/golang-nuts/24zV9JeBoEE - but sadly I dont have that expertise.

Its ironic that we see less jitter in response times (in and out) with the pre-new-GC in Go, but we do! The effect of the higher CPU usage (which is an order of magnitude for us) far outstrips the benefit in reduced GC sweep times. For now our suggestion is to stick with 1.4.2 until somebody has time to dig into the regression revealed by 1.6 (skipping 1.5 is a no-brainer)

If anybody has any suggestions for environment vars, or output that we could provide, we can double-write to an identical machine running 1.6 fairly easily.

This is going to be resolved by #5331, so I'm closing this out. Thanks for all the benchmarks!

The 1.6 vs 1.4 perf issues is being looked at here: https://github.com/golang/go/issues/14189

@sebito91 If you get a chance, I'd be curious to see how things look with Go 1.4.3 and 1.6rc1 if you use GOGC=1600. We're tracking the issue above, as @jwilder mentioned, but it seems that the GC tuning improves performance on both versions in our internal testing.

@sebito91 @daviesalex We've been doing a lot of work to reduce allocations, and we're now seeing better performance on Go 1.6.2 than Go 1.4.3 for some synthetic tests. Would you guys be able to test Go 1.6.2 with master and let us know what performance looks like for you? Thanks!

Sure thing, we'll take a look now with current HEAD.

FYI, current HEAD may not so GC heavy as before since #5522 reduces many pointers in tsm1 buffer.

Each entry in tsm1 buffer had:

  1. time.Time which has *time.Location.
  2. Value interface.

I removed 1, but 2 remains. So number of pointers in tsm1 buffer is half of before.

Confirmed that go1.6.2 is looking very, very good! As @methane pointed out, the GC numbers have virtually dropped to 0, which is legendary...no data loss, full capabilities to date. Great work!

Attached a screenshot of our setup, disregard that small window (admin work on the machine, unrelated to influxdb)

today

@sebito91 Is that with any changes to GOGC or just default?

@sebito91 Go 1.4.3 and Go 1.6.2 use same version of influxdb?

@methane We removed many other uses of time.Time and have been aggressively removing allocations and pointers in the code since that change.

@jwilder no changes to gogc whatsover, purely stock implmentation. We built from head, commit 1d9919a using go1.6.2 instead of standard go1.4.3 and dropped the binaries into place. Pretty sweet changes to be honest!

@methane, yes they are the same stock version but the more recent data is built from HEAD vs stock rpm.

@toddboom, would be awesome to hear your plans for upgrading to go1.6.2 now that things seem to have improved. Any thoughts on rough timing?

@sebito91 We just released v0.13.0 today, and it's all on Go 1.6.2 now. Thanks for all your help!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

davidgubler picture davidgubler  路  3Comments

Witee picture Witee  路  3Comments

shilpapadgaonkar picture shilpapadgaonkar  路  3Comments

Raniz85 picture Raniz85  路  3Comments

acarapetis picture acarapetis  路  3Comments