Elasticsearch: Heap usage reduction in Elasticsearch

Created on 20 Jun 2018 · 16Comments · Source: elastic/elasticsearch

This is a proposal to reduce heap allocations in Elasticsearch code by object reuse. Elasticsearch allocates a number of per-document heap objects, such as Field (and derived Lucene) heap objects for metadata and data fields during indexing. We implement object reuse for Field and ParseContext objects across documents during bulk indexing. The changes improve ES heap allocation rate by 30%, heap garbage by 30% and promotion rate by 25% during indexing, while keeping the indexing rate. Max. GC pause time drops 98% from 13s to 0.3s; and the API tail latencies drop significantly: by about 60% at 100th, and by 50% at 99.9th percentiles. All benchmarks were done using Rally's nyc_taxis dataset, against an i3.16xl single node cluster with 128GB heap and parallel GC (we see similar improvements with CMS; the code does not degrade throughput on small instance types).

The patch URL: https://gist.github.com/aesgithub/cc5b54fc3cf5a3a13f1f5ad3139dfd00
The patch is on top of commit hash 7376c35960f8e08762d73aef53471678268d904b (dated Jun 1, 2018). It uses multiple maps to cache Field objects. We did not implement cache eviction policies for the prototype (the indexing, however, is fully functional). The caching is also optimized for NUMA instances (i.e., caches local to a thread), since we found that on multi-socket instance types NUMA contention limits indexing rates.

We want to start this as a discussion on improving some of the heap allocations in Elasticsearch. Making this patch production-ready will take some investment and we want to ensure that it is done after we get feedback.

:CorInfrCore :SearcMapping >enhancement CorInfra Search high hanging fruit

Source

aesgithub

Most helpful comment

pinging back.. any thoughts on the proposal?

aesgithub on 14 Jul 2018

👍2

All 16 comments

Pinging @elastic/es-search-aggs

elasticmachine on 21 Jun 2018

@jpountz could you take a look at this one?

colings86 on 21 Jun 2018

pinging back.. any thoughts on the proposal?

aesgithub on 14 Jul 2018

👍2

One of the greatest limitations of elasticsearch in terms of data/node is the 1/48 to 1/96 ratio of the assigned heap to indices size. This optimization may drive major improvements in this respect, and we'd really like to hear what the elastic team has to say.

redlus on 15 Jul 2018

Sorry for the lack of response, I had missed this issue. I agree we should look into reuse of Field instances, which is recommended by Lucene. I see you were careful with multi-valued fields, which are the main source of complexity here since it is fine to reuse field instances across documents, but not within the same document. I also liked that you didn't try to optimize fields for which this is less likely to help like binary fields or percolator fields, which are less common than keyword and numerics.

This patch improves memory reuse but also adds a bit of CPU overhead because of the hashmap lookup for the cached field. Hashmap lookups are fast, but this parsing code is called in very tight loops. I'd be curious to know whether we can get rid of this lookup somehow to make this change more likely to be a net win for every user.

jpountz on 16 Jul 2018

Thanks for the support! Adrian, you are right that there is an expected tradeoff with CPU here, since managed allocations are relatively lower overhead. Based on our benchmarks, even with the beefy i3 instance types on AWS, we have not seen any increase in CPU utilization. On the other hand, cutting down young gen allocations (and allocation rate) significantly did cut down promotions (re: pauses) and the associated GC cycles spent.

We initially started with a global map across all ingestion threads. This turned out to have two limitations: heavy data structure operation throughput, and NUMA contention on multi-socket machines (both to operate on the map and GC-related stall cycles). We moved to thread-local maps, which improved the problems (with O(threads) heap instead of O(docs) heap). The code has a few more optimizations around caching. The number of map operations remains proportional to fields parsed, but the CPU cycles used is insignificant relative to the cycles used in bulk ingestion. Given that the field objects depend on the field names, I'm not sure if we could eliminate a lookup operation. Open to ideas!

aesgithub on 18 Jul 2018

Hi Adrian, do you see a path forward? Any PRs that are relevant to the patch?

aesgithub on 22 Aug 2018

👍1

I have some ideas but I don't like them much due to the complexity that they would introduce... Looking at your patch again, I'm wondering that a simple improvement would be to stop caching on the field name? Said otherwise have a cache of an arbitrary number of SortedNumericDocValuesField, LongPoint, etc. and don't enforce that a given field name always reuses the same instance? This is not going to work with fields that need a configured FieldType but might be good enough?

jpountz on 22 Aug 2018

Hi Adrian, sorry for the delay. Any reduction in O(documents * fields) heap allocations is better than current! I'm not sure I understand your point yet. Given that field objects have to be created with the name of the field (which depends on the current data), how would we create the array of field objects as the cache?

aesgithub on 31 Aug 2018

Hi Adrain, if we re-use a field independent of the field name but type, won't it contradict with the point that you had in the previous comment "it is fine to reuse field instances across documents, but not within the same document" ?

muralikpbhat on 17 Dec 2018

@jpountz @muralikpbhat One thing that can be done is create a whitelist (maybe just a bit vector with a bit corresponding to a cache) and have a document use all caches except its own. Or maybe just tag cached objects with their corresponding documents in the cache itself. So you look at all Field objects except the ones in your own document when searching the cache for reusing an instance.

Thoughts?

atris on 24 Jan 2019

Given that field objects have to be created with the name of the field (which depends on the current data), how would we create the array of field objects as the cache?

Oops good catch, this doesn't work.

if we re-use a field independent of the field name but type, won't it contradict with the point that you had in the previous comment "it is fine to reuse field instances across documents, but not within the same document" ?

I was thinking of only doing it for the first value of a field.

have a document use all caches except its own

Sorry I don't get the idea.

So you look at all Field objects except the ones in your own document when searching the cache for reusing an instance.

One thing that I want to avoid it having to search a cache. Indexing a field doesn't use much CPU. For instance when Lucene processes a LongPoint, it mostly appends the value to the end of a buffer. Saving an object allocation by introducing a hash lookup introduces complexity and doesn't sound like a net win performance-wise to me.

One approach I had in mind was to replace FieldMapper#parse with something like FieldMapper#getPerThreadParser which would return an object that would be responsible for parsing but may only be called from a single thread. This object could cache some state locally to reuse fields. Then we would have to propagate it up to DocumentParser which would maintain ThreadLocal references to trees of per-thread parsers. This way we would still have only one threadlocal object, but no hash lookups anymore. The thing that I would watch if we explored that route would be the complexity that it introduces.

jpountz on 24 Jan 2019

Couple of more points:
1) The patch lacks cache eviction
2) I am not sure if we should be creating a cache class for each type. It looks a bit weird and hard to maintain.

I believe @jpountz 's idea of creating a fieldname independent cache should be a reasonable way of achieving a sizable improvement, without too much of an intrusive change.

atris on 25 Jan 2019

@jpountz we have tried the ThreadLocal idea and propagating that as far up in the call stack as possible. It's been a while, so I don't remember the implementation complexity upfront, but there was a performance hit (~10-20% drop in bulk throughput, iirc) that we incurred due to ThreadLocal (especially under multi-socket NUMA contention).

aesgithub on 25 Jan 2019

Pinging @elastic/es-core-infra

elasticmachine on 3 Jun 2019

One approach I had in mind was to replace FieldMapper#parse with something like FieldMapper#getPerThreadParser which would return an object that would be responsible for parsing but may only be called from a single thread. This object could cache some state locally to reuse fields. Then we would have to propagate it up to DocumentParser which would maintain ThreadLocal references to trees of per-thread parsers. This way we would still have only one threadlocal object, but no hash lookups anymore. The thing that I would watch if we explored that route would be the complexity that it introduces.

@jpountz Isn't what you suggested being done by the current PR itself(it can be refactored to look like your suggestion)? The object you are referring to is FieldObjectCache- which is a per thread object and maintains field level caches. I did not get the part that no hash lookups are needed if we maintain per thread parsers? How would the cache state be accessed for any field in the same thread- we need a lookup anyway, unless we know upfront that the mappings are fixed. Field objects are created by name, which is an immutable field in Field object. All you can change are the values. Am I missing anything here?