K6: stats/cloud: TestCloudCollector is flaky

Created on 28 Oct 2019 · 8Comments · Source: loadimpact/k6

Running the test:

$ cd  stats/cloud
$ go test -count=100 -race

Sample error output:

--- FAIL: TestCloudCollector (0.07s)
        Error Trace:    collector_test.go:58
                        server.go:2007
                        server.go:2387
                        server.go:2802
                        server.go:1890
                        asm_amd64.s:1357
        Error:          "[{Points http_req_li_all %!s(*cloud.SampleDataMap=&{{143841000 63707865704 0x22e8cc0} 0 0xc000225ca0 map[http_req_blocked:100.20751 http_req_connecting:200.41502 http_req_duration:1503.112655 http_req_receiving:601.245062 http_req_sending:400.830041 http_req_tls_handshaking:300.622531 http_req_waiting:501.037552 http_reqs:1]})} {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{143841000 63707865704 0x22e8cc0} 0 0xc000225e00 map[http_req_blocked:239.951032 http_req_connecting:479.902064 http_req_duration:3599.265483 http_req_receiving:1439.706193 http_req_sending:959.804129 http_req_tls_handshaking:719.853096 http_req_waiting:1199.755161 http_reqs:1]})} {AggregatedPoints http_req_li_all %!s(*cloud.SampleDataAggregatedHTTPReqs=&{{145000000 63707865704 0x22e8cc0} aggregated_trend 24 0xc000225fc0 {{0 0 0 1510.304694 1574.301317 1545.3932262083333} {0 0 0 100.686979 104.953421 103.02621479166667} {0 0 0 201.373959 209.906842 206.05242991666668} {0 0 0 302.060939 314.860263 309.07864512500004} {0 0 0 402.747918 419.813684 412.1048602083333} {0 0 0 503.434898 524.767106 515.131075375} {0 0 0 604.121878 629.720527 618.157290625}}})}]" should have 2 item(s), but has 3
        Test:           TestCloudCollector

high prio tests

Source

cuonglm

Most helpful comment

OK, so I think I know what's going on :slightly_smiling_face: and thankfully, it doesn't seem like a bug, rather the opposite - the outlier detection might be working too well for our test case :relieved: and @cuonglm, you are partially right, the problem occurs because we have such a low value for aggregationMinSamples.

To demonstrate the issue, I set it to something even lower, aggregationMinSamples=10. And then added some code to print the generated similarTrail.Duration values after sorting them. This are the values from a test run that failed:

1.505784819s
1.510385610s
1.543288910s
1.546911213s
1.548892804s
1.559255172s
1.560175732s
1.564243347s
1.564885599s
1.573093152s

And the results from reqDurations.SelectGetNormalBounds() on that particular test run were minReqDur=1.515903443s and maxReqDur=1.589868208s. So, basically, the code considered any metric with duration value below 1.515903443s or above 1.589868208s to be an outlier, which in this case meant that the first two values of 1.505784819s and 1.510385610s were too low to be aggregated:

[
    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc000396720 map[http_req_blocked:100.692374 http_req_connecting:201.384748 http_req_duration:1510.38561 http_req_receiving:604.154244 http_req_sending:402.769496 http_req_tls_handshaking:302.077122 http_req_waiting:503.46187 http_reqs:1]})}

    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc0003968c0 map[http_req_blocked:100.385654 http_req_connecting:200.771309 http_req_duration:1505.784819 http_req_receiving:602.313928 http_req_sending:401.542618 http_req_tls_handshaking:301.156964 http_req_waiting:501.928273 http_reqs:1]})}

    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc000396a80 map[http_req_blocked:272.170007 http_req_connecting:544.340015 http_req_duration:4082.550117 http_req_receiving:1633.020047 http_req_sending:1088.680031 http_req_tls_handshaking:816.510023 http_req_waiting:1360.850039 http_reqs:1]})}

    {AggregatedPoints http_req_li_all %!s(*cloud.SampleDataAggregatedHTTPReqs=&{{805000000 63708124343 0x16d0bc0} aggregated_trend 8 0xc000396c20 {{0 0 0 1543.28891 1573.093152 1557.593241125} {0 0 0 102.885927 104.872876 103.839548875} {0 0 0 205.771854 209.745753 207.679098375} {0 0 0 308.657782 314.61863 311.518648} {0 0 0 411.543709 419.491507 415.3581975} {0 0 0 514.429637 524.364384 519.197747} {0 0 0 617.315564 629.237261 623.037296625}}})}
]

should have 2 item(s), but has 4

And this intuitively makes some sense. Look at how the metrics are grouped, there's really a cluster around 1.54-1.56. It's very visually apparent when you see the plot, number line, or especially the box-and-whisker chart in WolframAlpha.

So, k6 aggregation seems to be working perfectly - it found the outliers... :sweat_smile: The problem is that when we generate random values, if we have too few of them, there's bound to be a gap/clustering somewhere in the data some of the time, and that means outliers... So, I can think of 2 different solutions:

Just increase the aggregationMinSamples to something like 50 or 60. This is still below the aggregationOutlierAlgoThreshold, so the sorting algorithm and interpolation would still be used (i.e. the test would be functionally identical), it's just going to be much more unlikely that there will be such big gaps and clusters in the metrics.
Reduce the smallSkew to something even smaller or generate more tightly grouped values in some other way, and at the same time, change the outlier detection algorithms so that SortGetNormalBounds() and SelectGetNormalBounds() have a minimum threshold for their returned min and max values of something like 1ms. That is, however tightly grouped the metrics are, we always consider values from 0.5ms before and 0.5ms after the median to be aggregatable. And mind you, even that might not be enough to fully fix the test...
Something else? I can't think of anything else sensible, but I might be missing something obvious.

I personally prefer the first approach, because it simpler and sufficient, but mostly because in the second one we're kind of modifying the aggregation implementation to fix its test... In the real-world, response times are very rarely so closely grouped together that there's a sub-millisecond grouping... So it doesn't make sense to add the additional complexity and cost of the min threshold checks.

So yeah, I'll make a PR in a bit that just does 1., and also adds a test case that would also test SelectGetNormalBounds().

na-- on 1 Nov 2019

🎉2

All 8 comments

I think this can wait until we improve the cloud aggregation, so 0.27.0 - it's not urgent to fix this specific test. I'll link to this issue when I get around to writing the one for the aggregation improvements

na-- on 28 Oct 2019

👍1

I take my words back - this test has been acting up a lot now, when it wasn't a problem before... So it's both annoying and it might be a signal that there's an actual issue that was recently introduced, so it's worth investigating and fixing it before 0.26.0 is released

na-- on 30 Oct 2019

@na-- After investigating, I guess the problem comes from aggregationMinSamples, which default to 25. Because of skewTrail uses rand.Float64(), sometimes, under heavy load or when rand.Float64() goes bad, it produces samples less/greater than aggregationMinSamples, causing aggregateHTTPTrails produces random output.

If I increase aggregationMinSamples to 50, then:

go test -count=500 -race

never fails.

If it makes sense to you, I will raise the PR.

cc @MStoykov @imiric

cuonglm on 31 Oct 2019

Sorry, it doesn't make sense to me, I'm particularly confused how rand.Float64() can "go bad" :confused:

na-- on 31 Oct 2019

@na-- I'm not sure, but the skewTrail sometimes produces bad trail, make it becomes an outlier. I did:

for _, trail := range httpTrails {
    if trail.ConnDuration < minConnDur ||
        trail.ConnDuration > maxConnDur ||
        trail.Duration < minReqDur ||
        trail.Duration > maxReqDur {
        // Seems like an outlier, add it as a standalone metric
        println("HERE")
        newSamples = append(newSamples, NewSampleFromTrail(trail))
    } else {
        // Aggregate the trail
        aggrData.Add(trail)
    }
}

When test fails, HERE is printed twice, while it's printed only once in normal.

cuonglm on 31 Oct 2019

1.505784819s
1.510385610s
1.543288910s
1.546911213s
1.548892804s
1.559255172s
1.560175732s
1.564243347s
1.564885599s
1.573093152s

[
    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc000396720 map[http_req_blocked:100.692374 http_req_connecting:201.384748 http_req_duration:1510.38561 http_req_receiving:604.154244 http_req_sending:402.769496 http_req_tls_handshaking:302.077122 http_req_waiting:503.46187 http_reqs:1]})}

    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc0003968c0 map[http_req_blocked:100.385654 http_req_connecting:200.771309 http_req_duration:1505.784819 http_req_receiving:602.313928 http_req_sending:401.542618 http_req_tls_handshaking:301.156964 http_req_waiting:501.928273 http_reqs:1]})}

    {Points http_req_li_all %!s(*cloud.SampleDataMap=&{{807386000 63708124343 0x16d0bc0} 0 0xc000396a80 map[http_req_blocked:272.170007 http_req_connecting:544.340015 http_req_duration:4082.550117 http_req_receiving:1633.020047 http_req_sending:1088.680031 http_req_tls_handshaking:816.510023 http_req_waiting:1360.850039 http_reqs:1]})}

    {AggregatedPoints http_req_li_all %!s(*cloud.SampleDataAggregatedHTTPReqs=&{{805000000 63708124343 0x16d0bc0} aggregated_trend 8 0xc000396c20 {{0 0 0 1543.28891 1573.093152 1557.593241125} {0 0 0 102.885927 104.872876 103.839548875} {0 0 0 205.771854 209.745753 207.679098375} {0 0 0 308.657782 314.61863 311.518648} {0 0 0 411.543709 419.491507 415.3581975} {0 0 0 514.429637 524.364384 519.197747} {0 0 0 617.315564 629.237261 623.037296625}}})}
]

should have 2 item(s), but has 4

Just increase the aggregationMinSamples to something like 50 or 60. This is still below the aggregationOutlierAlgoThreshold, so the sorting algorithm and interpolation would still be used (i.e. the test would be functionally identical), it's just going to be much more unlikely that there will be such big gaps and clusters in the metrics.
Reduce the smallSkew to something even smaller or generate more tightly grouped values in some other way, and at the same time, change the outlier detection algorithms so that SortGetNormalBounds() and SelectGetNormalBounds() have a minimum threshold for their returned min and max values of something like 1ms. That is, however tightly grouped the metrics are, we always consider values from 0.5ms before and 0.5ms after the median to be aggregatable. And mind you, even that might not be enough to fully fix the test...
Something else? I can't think of anything else sensible, but I might be missing something obvious.

So yeah, I'll make a PR in a bit that just does 1., and also adds a test case that would also test SelectGetNormalBounds().

na-- on 1 Nov 2019

🎉2

Flaky still happens https://circleci.com/gh/loadimpact/k6/7827

cuonglm on 7 Nov 2019