Jaeger: Flaky test: `TestReload`

Created on 9 Nov 2020  路  3Comments  路  Source: jaegertracing/jaeger

Describe the bug
https://travis-ci.com/github/jaegertracing/jaeger/jobs/428075939#L2104

=== RUN   TestReload
    cert_watcher_test.go:101: 
            Error Trace:    cert_watcher_test.go:101
            Error:          Not equal: 

To Reproduce
Steps to reproduce the behavior:

  1. Place the following code after cert_watcher_test.go:L88
    cert, err = tls.LoadX509KeyPair(filepath.Clean(clientCert), clientKey)
    assert.Equal(t, &cert, watcher.certificate())
  1. Run test.

Expected behavior
The test passes.

Note, the above reproduce step will always fail the test; the intention is to reproduce the symptoms and its purpose is to better illustrate the race condition. That is, that the assertion is made with only the client cert modification, without the private key modification.

Version (please complete the following information):

  • OS: MacOS
  • Jaeger version: master

What troubleshooting steps did you try?
Print the logObserver contents each time the cert and key files are modified.

Additional context
The root cause is similar to #2586 in that there is a race condition between when the private key file is update is written, to when the derived certificate is updated.

Specifically, in the failing travis job, the certificate is initially created with the server public/private key pair. The certificate is then modified with the client's public/private key pair, with an assertion that the modified certificate should resemble a pre-generated client certificate from the same client public+private key pair.

The race condition here is that the modified cert isn't yet updated with the client's private key, even though the key write completed. The reason why the assertion was made too early is because the waitUntil only checks for the existence of the certFile log entry when it should also be checking for _both_ the certFile and keyFile.

Proposed Fix
Modify the waitUntil block to:

    waitUntil(func() bool {
        return logObserver.FilterField(zap.String("certificate", certFile.Name())).Len() > 0 &&
            logObserver.FilterField(zap.String("certificate", keyFile.Name())).Len() > 0
    }, 100, time.Millisecond*200)

Validation
To prove this fix doesn't just pass due to timing out the waitUntil block (and hence slow down test execution time significantly), the test was run locally _without the fix_ which took 0.02s. _With the fix_, the execution time was 0.02-0.04s, which is well below the 20s timeout window.

bug needs-triage

All 3 comments

+1

I looked at this a few weeks ago (because my PR hit this issue), but I forgot that I had a local branch. I think the proposed fix reduces the flakyness significantly to close to zero, but the log message is still printed before the certificate is actually loaded.


To clean up the code completely I think a log statement should be create or moved until after the certificate is loaded:

--- a/pkg/config/tlscfg/cert_watcher.go
+++ b/pkg/config/tlscfg/cert_watcher.go
@@ -112,7 +112,12 @@ func (w *certWatcher) watchChangesLoop(rootCAs, clientCAs *x509.CertPool) {
                                w.mu.Unlock()
                                err = e
                        }
-                       if err != nil {
+                       if err == nil {
+                               w.logger.Info("Loaded modified certificate",
+                                       zap.String("certificate", event.Name),
+                                       zap.String("event", event.Op.String()))
+
+                       } else {
                                w.logger.Error("Failed to load certificate",
                                        zap.String("certificate", event.Name),
                                        zap.String("event", event.Op.String()),

and then use .FilterMessage("Loaded modified certificate").

But with the proposed fix the likelyhood seems very low, especially since the Sleep is so long. I tried locally with go test -test.count=5000 -test.run=TestReload$ ./pkg/config/tlscfg/ and a shorter sleep }, 100*200, time.Millisecond*1)), and only hit the issue once in 5 000 runs :smile: So I think the proposed fix will be enough in practice.

@chlunde thanks for calling this out and you're correct, the fix is working off event data that is logged before the related action takes place. I feel it's important to get this right, given that's our event signal. Your diff looks sensible to me, so I'll include it in the PR. 馃槃

Was this page helpful?
0 / 5 - 0 ratings