Vault: Potential for an open connection leak

Created on 1 Apr 2020  路  17Comments  路  Source: hashicorp/vault

I was trying to analyze memory usage of my application when I noticed the following pprof output:

list RawRequestWithContext
Total: 180.79MB
ROUTINE ======================== github.com/hashicorp/vault/api.(*Client).RawRequestWithContext in github.com/hashicorp/vault/api/client.go
    1.12MB    42.63MB (flat, cum) 23.58% of Total
         .          .    795:   }
         .          .    796:
         .          .    797:   var result *Response
         .          .    798:   resp, err := client.Do(req)
         .          .    799:   if resp != nil {
         .     4.45MB    800:       result = &Response{Response: resp}
         .          .    801:   }
         .          .    802:   if err != nil {
         .          .    803:       if strings.Contains(err.Error(), "tls: oversized") {
         .          .    804:           err = errwrap.Wrapf(
         .          .    805:               "{{err}}\n\n"+
         .          .    806:                   "This error usually means that the server is running with TLS disabled\n"+
         .          .    807:                   "but the client is configured to use TLS. Please either enable TLS\n"+
         .          .    808:                   "on the server or run the client with -address set to an address\n"+
         .          .    809:                   "that uses the http protocol:\n\n"+
         .          .    810:                   "    vault <command> -address http://<address>\n\n"+
         .          .    811:                   "You can also set the VAULT_ADDR environment variable:\n\n\n"+
         .          .    812:                   "    VAULT_ADDR=http://<address> vault <command>\n\n"+
         .          .    813:                   "where <address> is replaced by the actual address to the server.",
         .          .    814:               err)
         .          .    815:       }
         .          .    816:       return result, err
         .          .    817:   }
         .          .    818:
         .          .    819:   // Check for a redirect, only allowing for a single redirect
         .          .    820:   if (resp.StatusCode == 301 || resp.StatusCode == 302 || resp.StatusCode == 307) && redirectCount == 0 {
         .     4.32MB    821:       // Parse the updated location
         .          .    822:       respLoc, err := resp.Location()
         .     3.43MB    823:       if err != nil {
         .          .    824:           return result, err
         .          .    825:       }
         .          .    826:
         .          .    827:       // Ensure a protocol downgrade doesn't happen
         .          .    828:       if req.URL.Scheme == "https" && respLoc.Scheme != "https" {
         .          .    829:           return result, fmt.Errorf("redirect would cause protocol downgrade")
         .          .    830:       }
         .          .    831:
         .          .    832:       // Update the request
         .          .    833:       r.URL = respLoc
         .          .    834:
         .          .    835:       // Reset the request body if any
         .          .    836:       if err := r.ResetJSONBody(); err != nil {
         .          .    837:           return result, err
         .          .    838:       }
         .          .    839:
    1.12MB     1.12MB    840:       // Retry the request
         .          .    841:       redirectCount++
         .          .    842:       goto START
         .          .    843:   }
         .    29.31MB    844:
         .          .    845:   if err := result.Error(); err != nil {
         .          .    846:       return result, err
         .          .    847:   }
         .          .    848:
         .          .    849:   return result, nil

I am trying to understand why 29.31MB was accumulated at line 844. My hypothesis is that at line 846 https://github.com/hashicorp/vault/blob/master/api/client.go#L846 in function RawRequestWithContext a response object is created whose body is to be closed by the caller of this function.

However consider the case of a redirect where at line 890 the statement goto START is executed. The line 846 will be executed again creating another response object. The previously created response object's body is never closed causing a connection and memory leak for every redirect.

bug corclient

All 17 comments

Hi! Thanks for the report! Can you provide further steps to reproduce this? Information such as what Vault version you're on, what API calls you're making, and any further information that might help us reproduce this locally on our own machines. Also, are you seeing ongoing growth in memory usage, or does memory tend to stabilize like this?

I am observing the same memory leak on my Vault 1.4.0 server running on Kubernetes.

You can see the memory usage graph below:

image

It might be useful to note that this replica became the active server some time on April 22.

I have a bunch of secret engines mounted:

  • GCP
  • AWS
  • KV
  • Postgres

And the following auth mounts:

  • Kubernetes
  • LDAP

I am also having Prometheus constantly scrape Vault server for metrics.

Is there anything more I can do to provide more information?

The pod will probably OOM soon and be killed by Kubernetes.

@lawliet89 You can run Vault debug (https://www.vaultproject.io/docs/commands/debug) which will, among other things, capture heap usage stats. We can help analyze these logs (they may be emailed or otherwise shared outside of GitHub).

Thanks @kalafut

May I know who I should email the logs to?

@lawliet89 You can send them to me (email is in Vault's commits).

@lawliet89 Thanks for sending those. There is almost 1GB in use by the write buffer for GCS storage (in a function in the GCS sdk). I'm not familiar enough with the GCS storage details to know why, however. I do wonder what effect (if any) https://www.vaultproject.io/docs/configuration/storage/google-cloud-storage#chunk_size would have.

image

@kalafut

I haven't really looked into it. but maybe this can be fixed with a bump to the Google Cloud SDK.

An update was recently reverted in https://github.com/hashicorp/vault/commit/ea1564b3badd27f4e02cfd870d9759be903b91de due to https://github.com/etcd-io/etcd/issues/11154 of which the fix should be released shortly.

I will try halving the chunk size to see if it makes any difference.

FWIW, I'm in the process of migrating to Raft storage. This might not be a fix for future readers, however.

image

Maybe this is too early to say, but halving the chunk size seems to have fixed the memory leak issue. I will report back if this is not the case next week.

EDIT: One whole week of data. Looks good.

image

Thanks, that is a useful datapoint.

I may also be seeing similar memory usage growth, though we've been running into memory exhaustion issues generally that make the graphs not super clear. Though I am on version 1.1.0.

@lawliet89 I assume you halved it to 4096?

@sidewinder12s Yes.

釀夅叧釀忈叧釀呩叺釂剦釁a喓 2020-08-13 釀嬦叐釀掅叜 10 01 51
https://github.com/hashicorp/vault/blob/master/api/logical.go#L132

experienced the same issue and it was created by the write api calls. read didn't make this.

For other folks, I suspect this issue may be compounding any performance issues I'm having with GCS: https://github.com/hashicorp/vault/issues/8761

I've got 1000s of new nodes doing GCP logins daily that probably are not helping my GCS performance.

@kalafut I'm investigating a somewhat related memory leak and noticed the following using vault 1.5.4:

(pprof) list wrapGenericHandler
Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:       maxRequestDuration = props.ListenerConfig.MaxRequestDuration
         .          .    268:       maxRequestSize = props.ListenerConfig.MaxRequestSize
         .          .    269:   }
         .          .    270:   if maxRequestDuration == 0 {
         .          .    271:       maxRequestDuration = vault.DefaultMaxRequestDuration
         .   396.62MB    272:   }
         .          .    273:   if maxRequestSize == 0 {
         .          .    274:       maxRequestSize = DefaultMaxRequestSize
         .          .    275:   }
         .          .    276:   return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    277:       // Set the Cache-Control header for all the responses returned
         .          .    278:       // by Vault
         .          .    279:       w.Header().Set("Cache-Control", "no-store")
         .          .    280:
         .   184.52MB    281:       // Start with the request context
         .          .    282:       ctx := r.Context()
         .          .    283:       var cancelFunc context.CancelFunc
         .          .    284:       // Add our timeout, but not for the monitor endpoint, as it's streaming
       8MB       50MB    285:       if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    286:           ctx, cancelFunc = context.WithCancel(ctx)
      19MB       60MB    287:       } else {
         .   512.12kB    288:           ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .   407.57MB    289:       }
         .          .    290:       // Add a size limiter if desired
         .          .    291:       if maxRequestSize > 0 {
         .          .    292:           ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    293:       }
         .          .    294:       ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .          .    295:       r = r.WithContext(ctx)
         .          .    296:       r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    297:
         .          .    298:       switch {
         .          .    299:       case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    300:           newR, status := adjustRequest(core, r)
         .          .    301:           if status != 0 {
         .          .    302:               respondError(w, status, nil)
         .          .    303:               cancelFunc()
         .          .    304:               return
         .          .    305:           }
         .          .    306:           r = newR
         .          .    307:
         .     8.09GB    308:       case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    309:       default:
         .          .    310:           respondError(w, http.StatusNotFound, nil)
         .          .    311:           cancelFunc()
         .          .    312:           return
         .          .    313:       }

Seems like the string matching is very expensive here

     .   184.52MB    281:     // Start with the request context

The line references on mem usage like the one above seem to be off, can you do pprof with the -source_path flag and point it to the local vault repository location with the commit for 1.5.4 checked out (should be 1a730771ec70149293efe91e1d283b10d255c6d1)?

@calvn my bad, I was on master.

Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:       maxRequestSize = DefaultMaxRequestSize
         .          .    268:   }
         .          .    269:   return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    270:       // Set the Cache-Control header for all the responses returned
         .          .    271:       // by Vault
         .   396.62MB    272:       w.Header().Set("Cache-Control", "no-store")
         .          .    273:
         .          .    274:       // Start with the request context
         .          .    275:       ctx := r.Context()
         .          .    276:       var cancelFunc context.CancelFunc
         .          .    277:       // Add our timeout, but not for the monitor endpoint, as it's streaming
         .          .    278:       if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    279:           ctx, cancelFunc = context.WithCancel(ctx)
         .          .    280:       } else {
         .   184.52MB    281:           ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .          .    282:       }
         .          .    283:       // Add a size limiter if desired
         .          .    284:       if maxRequestSize > 0 {
       8MB       50MB    285:           ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    286:       }
      19MB       60MB    287:       ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .   512.12kB    288:       r = r.WithContext(ctx)
         .   407.57MB    289:       r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    290:
         .          .    291:       switch {
         .          .    292:       case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    293:           newR, status := adjustRequest(core, r)
         .          .    294:           if status != 0 {
         .          .    295:               respondError(w, status, nil)
         .          .    296:               cancelFunc()
         .          .    297:               return
         .          .    298:           }
         .          .    299:           r = newR
         .          .    300:
         .          .    301:       case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    302:       default:
         .          .    303:           respondError(w, http.StatusNotFound, nil)
         .          .    304:           cancelFunc()
         .          .    305:           return
         .          .    306:       }
         .          .    307:
         .     8.09GB    308:       h.ServeHTTP(w, r)
         .          .    309:
         .          .    310:       cancelFunc()
         .          .    311:       return
         .          .    312:   })
         .          .    313:}

I'm happy to provide more debugging context and info on the issue, can I email you directly?

@elsesiy sure thing! You can send it to calvin{at}hashicorp{dot}com.

Was this page helpful?
0 / 5 - 0 ratings