envoy 🚀 - Per listener and per cluster memory overhead is too high

@PiotrSikora thanks for the writeup and data, this is a good start to understanding the problem. I'd address the solution differently for clusters/listeners.

For clusters, we already have a plan-of-record to do (3) in the long term, this has been formalized in the incremental xDS API additions in https://github.com/envoyproxy/envoy/blob/e341004f0f1800c7f2a44caa49ad1876f29db848/api/envoy/api/v2/discovery.proto#L97 that @nt and I worked on. This has not been implemented yet, but there is no barrier around eventual consistency I think.

I think (3) is where we want to shoot for in a lasting solution that will scale to very high numbers of clusters (e.g. 100k), (2) would only be a stop gap and not support some applications like serverless that inherently need on-demand xDS. We have internal demand from teams that want this.

I realize that the current goal is to reduce overheads when we are talking about 1-10k clusters. In that case, I think we should just fix stats for now and no other solution is needed. We also have the problem that the LB data structures (depending on LB type) are O(num clusters * num workers * num endpoints) in memory use, which also needs some work.

For listeners, how many listeners that are actually binding to ports do we require? I'd like to understand how the listener count is climbing > O(10) in the application, do we really have O(1k) listeners? Again, stats improvement should be helpful here, but not a silver bullet, where is the rest of the memory coming from?

htuch on 20 Aug 2018

I think (3) is where we want to shoot for in a lasting solution that will scale to very high numbers of clusters (e.g. 100k), (2) would only be a stop gap and not support some applications like serverless that inherently need on-demand xDS. We have internal demand from teams that want this.

Agreed, except that (3) is probably months away, and (1) and possibly (2) could be done in a few days.

For listeners, how many listeners that are actually binding to ports do we require? I'd like to understand how the listener count is climbing > O(10) in the application, do we really have O(1k) listeners? Again, stats improvement should be helpful here, but not a silver bullet, where is the rest of the memory coming from?

Istio is using iptables interception and use_original_dst, so we have 1 virtual listener for each TCP destination, but all HTTP destinations are sent to a wildcard listener, where they are routed based on the Host header, so we have only 1 of those for each destination port. Because of that, the listeners are not a huge problem for us (at least not yet), but the per cluster overhead is blocking deployment for our customers, and disabling per cluster stats (that we don't use anyway), which lowers the overhead from ~72kB to ~14kB (so by ~80%) is a huge win, so I'd like to purse this option, since that's the quickest thing to implement.

I realize that the current goal is to reduce overheads when we are talking about 1-10k clusters. In that case, I think we should just fix stats for now and no other solution is needed.

When you say "just fix stats", do you have anything particular in mind? Even if we optimize strings (each fully evaluated stat name resides 3x in memory) with something like #3927, the maximum savings for the service_google_xxxx cluster name used in my experiments are around ~12kB (~17%), compared to ~58kB (~80%) with disabled stats, so the difference is quite significant, and I think that disabling stats would be preferable solution for Istio until we have incremental xDS.

Alternatively, we could delay generateStats() until first access to stats(), which would save the memory for configured but unused clusters.

PiotrSikora on 21 Aug 2018

The way I think of #3927 https://github.com/envoyproxy/envoy/pull/3927 is
that it should be very impactful to stat name memory, once we knock off a
bunch of other copies of the stat names first. That 17% looks
disappointing now, but the absolute byte-count saved on James' benchmark
but will be a much higher percentage when the other copies have been
removed.

However there's still a non-zero per-stat overhead that is unavoidable,
even if the names are free. And depending on the deployment, aggregating
stats across clusters may be desirable anyway. In another product I know,
sharing stats is the default:
https://www.modpagespeed.com/doc/admin#virtual-hosts-and-stats

On Tue, Aug 21, 2018 at 8:12 AM Piotr Sikora notifications@github.com
wrote:

I think (3) is where we want to shoot for in a lasting solution that will
scale to very high numbers of clusters (e.g. 100k), (2) would only be a
stop gap and not support some applications like serverless that inherently
need on-demand xDS. We have internal demand from teams that want this.

Agreed, except that (3) is probably months away, and (1) and possibly (2)
could be done in a few days.

For listeners, how many listeners that are actually binding to ports do we
require? I'd like to understand how the listener count is climbing > O(10)
in the application, do we really have O(1k) listeners? Again, stats
improvement should be helpful here, but not a silver bullet, where is the
rest of the memory coming from?

Istio is using iptables interception and use_original_dst, so we have 1
virtual listener for each TCP destination, but all HTTP destinations are
sent to a wildcard listener, where they are routed based on the Host
header, so we have only 1 of those for each destination port. Because of
that, the listeners are not a huge problem for us (at least not yet), but
the per cluster overhead is blocking deployment for our customers, and
disabling per cluster stats (that we don't use anyway), which lowers the
overhead from ~72kB to ~14kB (so by ~80%) is a huge win, so I'd like to
purse this option, since that's the quickest thing to implement.

I realize that the current goal is to reduce overheads when we are talking
about 1-10k clusters. In that case, I think we should just fix stats for
now and no other solution is needed.

When you say "just fix stats", do you have anything particular in mind?
Even if we optimize strings (each fully evaluated stat name resides 3x in
memory) with something like #3927
https://github.com/envoyproxy/envoy/pull/3927, the maximum savings for
the service_google_xxxx cluster name used in my experiments are around
~12kB (~17%), compared to ~58kB (~80%) with disabled stats, so the
difference is quite significant, and I think that disabling stats would be
preferable solution for Istio until we have incremental xDS.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/envoyproxy/envoy/issues/4196#issuecomment-414652430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2kPWB9SwqSf2h3k8ri4C-S1p0mzFD0ks5uS_kogaJpZM4WChjD
.

jmarantz on 21 Aug 2018

@PiotrSikora (3) should not take months to implement on Envoy side, the implementation is O(1-2 weeks) IMHO. Support on the management side is probably where the bottleneck is.

I think we're in agreement that the biggest benefit will come from reducing cluster stats. I still think fundamentally, given what we're tracking in both key/value teams, the essential memory overhead is more like O(1kb), so we should be able to fix it.

htuch on 21 Aug 2018

@htuch it seems that the string optimizations didn't help as much as we hoped, and even with #4281, we're still looking at ~69kB overhead for each static cluster, which is way too high.

Could you please reconsider disabling stats (since we don't have any use for them anyway)? This could be achieved by adding either no-op or lazily-loaded Stats::Store.

cc @ambuc

PiotrSikora on 6 Sep 2018

@PiotrSikora lazy loading Stats::Store seems reasonable if you can do it with low complexity.

htuch on 6 Sep 2018

I think disabling stats via flag / build parameter / etc is the most expedient solution to Istio's problem, and is much easier than incremental XDS / lazy-loaded stats / whatever. There are probably lots of users who don't care about stats, and until we have the _right_ solution, this would work best.

ambuc on 6 Sep 2018

@ambuc yes, disabling per cluster (or all?) stats at compile-time via flags would be by far the fastest thing to implement (and is in fact what I did for the measurements). @htuch would such solution be acceptable to you?

PiotrSikora on 6 Sep 2018

@PiotrSikora doesn't Istio make use of cluster stats? It's basically impossible to diagnose what goes wrong on Envoy if you don't have cluster level stats. I think almost every user cares about these stats, even if they don't know it.

htuch on 6 Sep 2018

I thought a fair number of Envoy users use out-of-the-box binaries. Is it hard to do via flag?

Also is what we want to disable stats? Or just make them non-cluster-specific? Or both?

jmarantz on 6 Sep 2018

I think a flag to disable all stats (or just cluster stats) and a targeted lazy load (lazy loading as a rule for all stats would take a decent amount of work and might add some real overhead to the request path) would both be relatively easy. I think there's already a flag to disable/enable some subset of the stats today, but I can't remember which component.

@htuch I could see a use case where some users might care enough about memory consumption to only turn on stats on a small subset of Envoy instances to get just enough info to create an approximate sample of fleet-wide stats, but I'm not very aware of the typical Envoy deployment today.

mrice32 on 6 Sep 2018

I am not fully understanding the lazy-loading solution; that may be a great way to go. My first thought on how I would disable stats is I'd make a NullStatAllocator which would provide methods to generate NullCounter, NullGauge, NullHistogram, which would have all the methods and do none of the work and take minimal space, for some definition of 'minimal'.

Maybe truncating all stat names to zero characters would make the maps small and fold lots of stats together.

The advantage here is that you could implement this with one new class, and a few injection points during startup. There would be zero extra runtime overhead.

jmarantz on 6 Sep 2018

👍1

After some offline chat with @PiotrSikora, it seems that Istio has in practice not had to rely heavily on /stats due to the availability of a number of other diagnosis mechanisms (e.g. Mixer, access logs). A CLI flag to disable (default enable) seems a reasonable optional for now. It doesn't fix the problem - users shouldn't have to chose between stats and scalability, but it unblocks Istio and other folks who need scalability right now.

htuch on 6 Sep 2018

@jmarantz That's basically my thought as well. I think we could just create a single NullCounter, NullGauge, etc. Then, the disabled store could give back that one instance to all calls to counter(name), gauge(name), and histogram(name). Assuming these stats are never read back as a part of Envoy's core functionality, we should be able to make all of the increment/set calls into no-ops (maybe a bad assumption...).

I think the lazy loading for clusters could be done as @PiotrSikora suggests above - we could just change when generateStats() gets called (and load it into a unique_ptr). There is one issue with lazy loading, however. We'll need to put a lock around the initialization/check to make it thread-safe, which would add a short duration lock to the request path.

mrice32 on 6 Sep 2018

@jmarantz wrt lazily-loading (i.e. allocating on first use) - Istio currently loads routes to all the services in the cluster, which means that even though a particular service might be talking only to a service or two, it has routes and accompanying configuration to all the services, stats for which are allocated but never used, so the savings would be coming from the fact that Envoy is configured with a bunch of clusters that in never uses.

PiotrSikora on 6 Sep 2018

FWIW, we have debated about this internally as well and we have similar situation like Istio where our Envoys have all the clusters irrespective of whether a service uses a particular cluster or not. One of the options we discussed was to disable mgmt update related metrics like which we really do not care. It might be give us at least 30% reduction and still give the latency/other cluster related metrics for those who want instead of completely disabling entire cluster metrics. Some thing to consider...
membership_change
update_attempt
update_success
update_empty
update_no_rebuild

ramaraochavali on 7 Sep 2018

Possible solution: what if, in the config, you set the alt_stat_name to the same string for every cluster? Wouldn't that result in all clusters sharing a common set of stats?

ggreenway on 7 Sep 2018

Catching up here. In general I think allowing people to disable stats they don't want to save memory is acceptable. I would prefer a solution though that maybe is a bit more flexible (though might be slower). Something along the lines of allowing people to supply a list of stats they either want or don't want, possibly via regex? People that don't want any stats can just exclude them all? This would incur a penalty at state creation much like tagging, but I think could be done pretty efficiently otherwise, modulo the fact that there will still need to be maps for keeping track of stat names that point to some NullStat interface that goes to dev/null. Thoughts?

mattklein123 on 19 Sep 2018

All of that sounds fine to me. Here's a spectrum of options:

Aggressively re-work stat naming so they are not so expensive. E.g. there can be maps, but why do we need a new copy of the name in each map? I'm not at all convinced we've seen the limit of how compact we can get this.
Use a regex to specify which stats you want to retain, replacing them with Null stats.
Optionally share stats between clusters
Drop all stats

jmarantz on 19 Sep 2018

@PiotrSikora is there an easy way to repro the results you got with a static command line?

jmarantz on 19 Sep 2018

@ all, there is a WIP PR at #4469 which implements a regex-based exclusion list. This feature would be opt-in, so I feel better about adding a potential performance hit. This PR is not in lieu of a disable-all-stats feature.

ambuc on 19 Sep 2018

Aggressively re-work stat naming so they are not so expensive. E.g. there can be maps, but why do we need a new copy of the name in each map? I'm not at all convinced we've seen the limit of how compact we can get this.

FWIW, I think this is intellectually interesting and if people want to work on it, I'm clearly not going to object, but I think effort is better spent on the other efforts, as well as IMO better educating people on how to use Envoy properly (i.e., let's not just send 150K clusters to an Envoy...).

mattklein123 on 20 Sep 2018

@mattklein123 that makes sense...especially as there are relatively straightforward approaches that will yield immediate gains.

I just did a quick survey and I think most or all of the maps I was worried about are in admin.cc with brief lifetime, although they still might affect things in practice. And actually that's a very small local PR.

The question of whether an Envoy should ever scale to 100k of clusters is an interesting one, and worth getting a definitive answer on before embarking on scale to that. But it seems that even 10k clusters presents scaling challenges.

jmarantz on 20 Sep 2018

But it seems that even 10k clusters presents scaling challenges.

IMO even 10K clusters is on the outer edges of what we would expect, at least without moving to incremental with cluster TTLs, but I would be curious what people think about this. In any case, I think we are all in agreement that starting with allowing for stat pruning to get rid of stats people don't care about is probably the way to go? I'm guessing this will save a huge amount of memory, though I do worry about the perf impact but that is easy to measure.

mattklein123 on 20 Sep 2018

@jmarantz

Generate test files:

$ cat /tmp/generate-config.sh
echo "admin:
  access_log_path: /dev/null
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: { address: 0.0.0.0, port_value: 40000 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: \*
              routes:
              - match: { prefix: / }
                route: { cluster: service_0000 }
          http_filters:
          - name: envoy.router
  clusters:"

for i in `seq -f "%04g" 0 $(($1-1))`; do
echo "
  - name: service_$i
    connect_timeout: 0.25s
    type: STATIC
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts: [{ socket_address: { address: 127.0.0.1, port_value: 60000 }}]"
done

$ sh /tmp/generate-config.sh 1 > /tmp/1_cluster.yaml
$ sh /tmp/generate-config.sh 1001 > /tmp/1001_clusters.yaml



md5-6f1985b0ffff8f04d6b860280fc3e182



$ cat /tmp/test-overhead.sh
bazel-bin/source/exe/envoy-static --concurrency 1 --disable-hot-restart -c /tmp/1_cluster.yaml >/dev/null 2>&1 &
sleep 30
M1=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

bazel-bin/source/exe/envoy-static --concurrency 1 --disable-hot-restart -c /tmp/1001_clusters.yaml >/dev/null 2>&1 &
sleep 30
M1001=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

bazel-bin/source/exe/envoy-static --concurrency 11 --disable-hot-restart -c /tmp/1001_clusters.yaml >/dev/null 2>&1 &
sleep 30
T1001=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

echo "Overhead per cluster: $(((M1001-M1)/1000)) + $(((T1001-M1001)/10000)) for each extra worker thread"



md5-3255be71c98be556d43ac47eeacddd7f



$ sh /tmp/test-overhead.sh
Overhead per cluster: 67439 + 1833 for each extra worker thread

PiotrSikora on 21 Sep 2018

(updated the script to calculate per worker thread overhead as well)

PiotrSikora on 21 Sep 2018

Awesome, thanks Piotr. I learned a bit messing around with these. Just to get a comparison I applied @ambuc's #4469 and used Piotr's script adding --filter-stats ".*" to compare against baseline. With 10001 stats I got:

baseline allocated: 671245080
no-stats allocate:  148864920

I was also curious how startup speed scaled. I used a 100001-cluster yaml file and "--mode init_only" (opt build of course) and timed startup:

baseline: 6.32 second startup  (does not include #4469)
no-stats: 6.08 second startup

jmarantz on 21 Sep 2018

Also with all stats filtered out, the per-cluster overhead measured by @PiotrSikora's script is reduced from 66923/cluster + 1861/thread to 14557/cluster + 1859/thread.

jmarantz on 21 Sep 2018

The intellectual challenge was too compelling. I spent some time poking around. Through a series of unreviewable hacks in https://github.com/jmarantz/envoy/commits/symbol-table-hacking I have brought the per-cluster overhead from 67439 bytes to 38522 bytes. Also note that disabling stats completely gives 14557 bytes per-cluster overhead. So from a stats-only perspective, those hacks reduce 52882 bytes of stat overhead to 23695 bytes. This is roughly 342 bytes per stat (including string storage). Still seems high.

There is a still running room:

unordered_map is not the most space-efficient map; we should use a better one. And there are a lot of maps.
I suspect Counter::pending_increment_ is not used for all counters and could be specialized

OTOH the last kick (from 44518 to 38522) was obtained by commenting out the tls_cache entirely, which might not be acceptable. But it does show you that the cache overhead is significant.

Of course none of this is as good as disabling stats entirely. And I'm not sure if this brings Envoy under a magic threshold goal either, so it might be good to get a sense of that.

jmarantz on 25 Sep 2018

Actually I should point out the hacks in that branch built on @ambuc's SymbolTable, which is table stakes; though there are some tweaks to that in the branch.

jmarantz on 25 Sep 2018

More specific ideas:

do we really need to dynamically delete stats individually, as opposed to by scope? if not, we could
arguably get rid of refcnt and the pointer from each stat back to the alloc. We could also potentially
hide the flags in the high-order bits of the value, I suppose.
do we really need to have a small map in each scope? Or could we do a scope lookup by prepending
the scope name and doing the lookup in the alloc ?
do we really need TLS cache for map lookups? Or could we use reader/writer locks on the alloc map
and just take the burst of contention when a new cluster is dynamically added.
currently for HeapAlloc we are allocating pending_increment_ for gauges but I don't think that's used.
we need it for RawStatData for uniform sizes.

And for RawStatData in particular:

could store (say) a 20-byte sha256 in shared-memory, and use a SymbolTable in-process so we get some compression for the hot-restart case too. Not sure if this would be worth it.

jmarantz on 25 Sep 2018

Also it seems like it might be worth looking at that 14557 bytes of per-cluster overhead that exists even with stats disabled. Perhaps in LB structures?

jmarantz on 25 Sep 2018

Update: I incorporated absl's just released flat_hash_map/set in a few places. This required some further refactoring (which was good to do anyway), and brought us to 40552 bytes per cluster with a TLS cache, and 35743 per cluster with TLS cache disabled. It's all in https://github.com/jmarantz/envoy/commits/symbol-table-hacking , which has a variety of pretty clean wins and some hacks that may be hard to justify.

FYI the refactoring needed was to (a) meet the requirement that absl::flat_hash_map keys be copyable (not unique_ptr which I had before) and (b) add a level of std::unique_ptr indirection for the tls-cache map. flat_hash_map stores the values directly in the hash array, so when values are large and the hash-map hasn't reached its capacity yet (e.g. was just resized) it pre-allocates space for the not-yet-populated values, and this blunted the size advantage offered by flat_hash_map.

and there's a dirty trick of using the low-order bit of a pointer to hide a bool. Probably cannot submit that to the Envoy codebase, but I was not letting that slow down this experimental effort for now.

jmarantz on 29 Sep 2018

@jmarantz re: TLS caching, I suspect that we do need it for dynamic stats that are high frequency (in particular the HTTP response code stats). However, I thought of potentially a nice optimization on that front that would be easy to implement. Basically, I don't think all stats need to be in the TLS cache. This is because many stats are allocated in a non-perf critical path and never looked up again. We could add a cache boolean parameter to the counter/gauge acquisition functions, and set this to false in all of the stat macro cases which is the majority of acquisition. All of these cases are acquired once on the main thread and don't need caching. Then, for the dynamic cases, we could set cache to true. This would lead to a tiny fraction of stats in the TLS cache. WDYT?

mattklein123 on 29 Sep 2018

@mattklein123 yes I love that idea; and agree that's the way to go. Of course there's a long journey to get all this stuff into reviewable shape; and it would be good to understand whether this would be of sufficient impact to projects besides Istio, as they will likely disable stats regardless of how small we make them.

jmarantz on 29 Sep 2018

Thinking more about this; I see around 3 categories of stats:

data-path-critical stats that are intrinsic to the core of Envoy (e.g. those in source/common/http/codes.cc as well as overload_manager, grpc, etc.
data-path-critical stats that are in optional filters, including extensions, e.g. hystrix, redix, mongo
stats that are not used in the data-path generally. Maybe guard_dog misses, stats.overflow? what else?

For the data-path critical core stats, can we store them in a struct, e.g. (abstractly ignoring which kind of map and how to represent stat names):

struct CriticaCoreStats : public ThreadLocal::ThreadLocalObject {
  struct ScopeCriticalStats {
    Counter& upstream_rq_completed_;
    Counter& upstream_rq_;
    Gauge& membership_total_;  // hystrix
    ...
  };
  map<uint64_t /* scope id */, ScopeCriticalStats> scope_critical_stats_;
};

The maintenance burden is that all core critical stats would have to be manually added to this structure (and its initializer helpers). You have some macro patterns that might help a bit to make that less cumbersome. Whenever a scope is introduced, all the stats would have to be looked up in the alloc in one burst, and then never looked up again.

An alternative that would be less centralized and almost almost as small and fast is to have each subsystem that wants to add an interesting stat could add register a name pattern e.g. "{}upstream_rq_completed" and get back an index, which would never be changed, and would have to be stored -- in filters it would go into their config class I think. The CriticalCoreStats would need to maintain separate arrays of counters, gauges, and histograms, but it would basically have the same dynamic as the first option above. I think to avoid taking locks during requests, we'd have to have an early stat-declaration phase where all subsystems register their stat name patterns, getting back indices they'd store to later use for fast runtime lookup.

struct CriticalStatPatternRegistry {
  std::vector<std::string> counter_patterns_;
  std::vector<std::string> gauge_patterns_;
  std::vector<std::string> histogram_patterns_;
};

struct CriticalStats : public ThreadLocal::ThreadLocalObject {
  struct ScopeCriticalStats {
    std::vector<Counter*> counters_:
    std::vector<Gauge*> gauges:
    std::vector<Histogram*> histograms_:
    ...
  };
  map<uint64_t  /* scope id */, CriticalStats> scope_critical_stats_;
};

And the fallback pattern for critical stats that can't be enumerated during Envoy startup is to use the current system with maps to individual stats.

jmarantz on 30 Sep 2018

@jmarantz I'm not sure I fully follow your last comment. We already store many/most data path critical stats in a struct which is looked up once. Are you talking about a way to also optimize the dynamic lookup cases?

mattklein123 on 2 Oct 2018

I don't have a broad enough understanding of the stats usage patterns yet and how they relate to the data path, but what I noticed was https://github.com/envoyproxy/envoy/blob/26e2a31650fe14db7e55a3136f9ae75db667bf5c/source/common/http/codes.cc#L21

Those look like they might be data-path critical and are looked up at request-time AFAICT.

Here the stat is per-scope, and thus looked up at runtime.

I've evolved my thinking a few iterations since writing that comment. One of the iterations is captured in https://github.com/jmarantz/envoy/commit/7656248b93741880e919ccd4bf6d7f12aaa4b5a7 where I add an int-based API, though this isn't really there yet -- it was just kind of a sketch.

The basic idea -- still iterating on it in my head -- is that a runtime lookup for a stat would just be an array index into an array held in the ScopeImpl. That array would be immutable after scope creation, so it wouldn't require a lock or a map lookup. We could still use maps for non-critical stats (though maybe not allocate a tls-cache for those).

jmarantz on 2 Oct 2018

Some of those stats could likely be lifted out, but some of them are fully dynamic and based on the HTTP response code which is why they are runtime looked up.

mattklein123 on 2 Oct 2018

Crazy impractical idea Matt -- could response-code stats be histograms? I'm not sure that would save any memory or be analogous to a latency-histogram or have any other real benefit :) But it seems unexpected to me to have a large potential space of stat names whose existence depends on traffic. Moreover, if you could enumerate all possible stat names for a configuration up front, and pre-allocate them in various nested scopes you'd might get some speed and latency benefit.

Those http stats from codes.cc are part of what I'm thinking about.

jmarantz on 6 Oct 2018

could response-code stats be histograms?

Maybe. TBH these are some of the most useful stats that Envoy provides and I think many users would be upset to see them change. Note that there is already a config option to turn them off in ultra high perf scenarios in which people don't care about them.

mattklein123 on 6 Oct 2018

Agreed. I think a big part of the design constraint is that stats are something that I expect there's a ton of ecosystem that's built around it.

My current thinking is to start making some of these experiments reviewable, since I've brought the per-cluster overhead from 68k to 40k, which is not enough but is a pretty good start. I'd like to make sure I'm not regressing data-plane performance first, though. Do you have suggestions on how I might start doing that?

jmarantz on 6 Oct 2018

Note that since I last run this test, the baseline per-cluster overhead has gone from 68k to 73k. I am not sure where that 5k went, unless a huge amount of stats were added.

Anyone listening have any ideas?

jmarantz on 27 Oct 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 26 Nov 2018

/wait

jmarantz on 26 Nov 2018

👍2

@jmarantz your PR #5321 closed this issue while stating to be part of a multi-PR effort to fix it.

loewenstein on 4 Jan 2019

We've seen a 15% regression in a cluster memory use scalability load test in the past 1-2 weeks, @snowp suggests potentially https://github.com/envoyproxy/envoy/pull/5367, but unclear.

I think we need to add a static load test for cluster memory use to presubmits to ensure we don't undo all the hard work being done in stats and elsewhere. It could be really simple, just spin up an Envoy with a large bootstrap with 10k clusters, measure tcmalloc memory use. This should only take a few seconds.

WDYT? @jmarantz @snowp @mattklein123

htuch on 1 Feb 2019

See @cmluciano 's #5519 for resolving #5201 . @cmluciano are you able to make progress on that? Happy to chat over Slack or whatever.

jmarantz on 1 Feb 2019

@htuch Do you know what Envoy shas the increases map back to? The linked PR was merged in late December, so if the regression started manifesting in the last few weeks it might be caused by something else.

+1 on having memory regression tests in CI. It would be awesome to also have historical perf/resource consumption graphs (kinda like how gRPC does it), but that's obviously a lot more work.

snowp on 2 Feb 2019

@snowp we do batched imports of PRs ~once a week, so it's hard to point to the specific SHA, we've lost that information by time out internal load tests run.

I'm not hugely concerned about the specific regression, since it's likely just a few fields here or there in endpoint or cluster data structures, we know there is some room to improve this via https://github.com/envoyproxy/envoy/issues/3278.

It would be great to be able to track this via a fixed watermark in CI and yeah, I agree, a historical view would be very cool to have. I actually don't think it would be that hard; once we have the basic test in CI that is emitting this information in logs, it just needs some script to crawl the history and plot.

htuch on 3 Feb 2019

See @cmluciano 's #5519 for resolving #5201 . @cmluciano are you able to make progress on that? Happy to chat over Slack or whatever.

Sorry I missed this notification. Yes I will progress on this today and reach out over Slack.

cmluciano on 4 Feb 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 18 Apr 2019

@jmarantz how would you like to proceed with this? Should we keep this open or do we have finer grained issues to track now?

htuch on 19 Apr 2019

I was considering this the 'master' bug which would be closed when all the other ones were, ultimately when #4980 is submitted.

jmarantz on 24 Apr 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 24 May 2019

/wait

jmarantz on 24 May 2019

👍1

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 23 Jun 2019

/wait

jmarantz on 23 Jun 2019

👍1

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 23 Jul 2019

/wait

hopefully not too much longer :)

jmarantz on 23 Jul 2019

👍1

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] on 21 Sep 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

stale[bot] on 28 Sep 2019

I'm going to close this as we have made a lot of progress here thanks to @jmarantz. Let's open new issues on specific enhancements.

mattklein123 on 12 Dec 2020

Envoy: Per listener and per cluster memory overhead is too high

Most helpful comment

All 63 comments

Related issues