Elasticsearch: Possible memory leak in IndicesQueryCache / LRUQueryCache

Created on 23 Jan 2017  路  16Comments  路  Source: elastic/elasticsearch

Elasticsearch version:
2.4.2

Plugins installed: [ cloud-aws, license, marvel-agent ]

JVM version:
java version "1.7.0_121"
OpenJDK Runtime Environment (amzn-2.6.8.1.69.amzn1-x86_64 u121-b00)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)

OS version:
4.4.35-33.55.amzn1.x86_64 #1 SMP Tue Dec 6 20:30:04 UTC 2016 GNU/Linux

Description of the problem including expected versus actual behavior:
We recently migrated our 10-node, 5.5b document cluster from 1.3.4 to 2.4.2 and started seeing continuously growing memory usage eventually leading to OOM errors:

image
(This example node was restarted manually around 21:00)

Analyzing the heap dump suggests a possible memory leak in IndicesQueryCache$1:

image

The cluster is fairly write-heavy (average indexing rate 600/s, search rate 6/s) and has ES_HEAP_SIZE set to 31g. The old cluster (1.3.4) was running with same configuration and did not have this issue. Here's the elasticsearch.yml configuration from the new cluster:

cloud.aws.region: us-east-1

discovery.type: ec2
discovery.ec2.groups: cluster-security-group
discovery.ec2.any_group: false
discovery.zen.minimum_master_nodes: 6

network.host: [_local_, _ec2_]

cluster.name: my-cluster

gateway.recover_after_nodes: 8
gateway.expected_nodes: 10
gateway.recover_after_time: 5m

script.inline: true
script.indexed: true

GET /index_name/_stats/query_cache shows that memory_size_in_bytes grows steadily, but it doesn't exceed the default 10% limit. Manually clearing the query caches using POST /index_name/_cache/clear doesn't seem to have a clear effect on heap usage shown in the graphs.

Any ideas what might be preventing GC from freeing up the memory from IndicesQueryCache?

discuss

Most helpful comment

All 16 comments

The cache assumes that queries are not worth taking into account for memory usage since they should be small. However here you seem to have 887 term queries that use 22GB of memory. This is almost 26MB per query. I am curious whether you happen to run queries on gigantic terms?

Thanks for the quick response! I turned on query logging and found out that's exactly what a client application is doing:

{
  "bool": {
    "must": [
      {
        "wildcard": {
          "name_raw_lowercase": {
            "value": "something"
          }
        }
      },
      {
        "terms": {
          "other_id": [
            12345678901,
            12345678902,
            12345678903,
            ... (several hundreds of entries)
          ]
        }
      }
    ]
  }
}

Is there a way to prevent these queries from getting cached? Would it make a difference if the query used a terms filter instead?

I am not convinced this one is the problem since it should either create a Lucene TermsQuery or many small TermQuery objects (notice the s at the end of Term). But here we have a couple hundred TermQuery objects that are very large. Can your heap analysis tool tell us more about what makes these TermQuery objects so large?

Here's the uniqueQueries field of LRUQueryCache expanded:

image

Most of the entries are small but there are some very big ones amongst them, all of which seem to have perReaderTermState field retaining quite a bit of memory.

Thanks, I think I see the problem, queries in the cache are still referencing old segments. I'll work on a fix.

Thanks a lot! In the meantime, is there a way to e.g. disable the query cache altogether? Currently we need to manually restart each node in the cluster daily to avoid meltdown due to OOM errors 馃槃

On a 2.x setting, you should be able to do that by using the undocumented index.queries.cache.type index setting. You will need to close your index, set this setting to none and finally open the index again for it to be taken into account.

Great, I'll give it a go. Thanks again!

Based on the initial 6 hours of monitoring, it seems like the workaround of setting index.query.cache.type to none has helped and the heap is no longer constantly growing.

This will be fixed in the upcoming Elasticsearch 5.2.1 and 5.3.0 releases. We're still looking into whether/when we can get 2.x patched as well.

Hello @jpountz

Has the above fix made into 5.2.1? I don't see this mentioned in release notes.

Thanks.

Yes, it is. I guess it did not make it to the release notes since the bug is on the Lucene side, but it was fixed when we upgraded to Lucene 6.4.1: https://github.com/elastic/elasticsearch/pull/22978.

Thanks @jpountz

Will the fix be backported to the 2.x version? thanks.

It is backported but not released yet: https://github.com/elastic/elasticsearch/pull/23162.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clintongormley picture clintongormley  路  3Comments

jasontedor picture jasontedor  路  3Comments

martijnvg picture martijnvg  路  3Comments

Praveen82 picture Praveen82  路  3Comments

brwe picture brwe  路  3Comments