Elasticsearch: Possible memory leak in IndicesQueryCache / LRUQueryCache

Created on 23 Jan 2017 · 16Comments · Source: elastic/elasticsearch

Elasticsearch version:
2.4.2

Plugins installed: [ cloud-aws, license, marvel-agent ]

JVM version:
java version "1.7.0_121"
OpenJDK Runtime Environment (amzn-2.6.8.1.69.amzn1-x86_64 u121-b00)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)

OS version:
4.4.35-33.55.amzn1.x86_64 #1 SMP Tue Dec 6 20:30:04 UTC 2016 GNU/Linux

Description of the problem including expected versus actual behavior:
We recently migrated our 10-node, 5.5b document cluster from 1.3.4 to 2.4.2 and started seeing continuously growing memory usage eventually leading to OOM errors:

(This example node was restarted manually around 21:00)

Analyzing the heap dump suggests a possible memory leak in IndicesQueryCache$1:

The cluster is fairly write-heavy (average indexing rate 600/s, search rate 6/s) and has ES_HEAP_SIZE set to 31g. The old cluster (1.3.4) was running with same configuration and did not have this issue. Here's the elasticsearch.yml configuration from the new cluster:

cloud.aws.region: us-east-1

discovery.type: ec2
discovery.ec2.groups: cluster-security-group
discovery.ec2.any_group: false
discovery.zen.minimum_master_nodes: 6

network.host: [_local_, _ec2_]

cluster.name: my-cluster

gateway.recover_after_nodes: 8
gateway.expected_nodes: 10
gateway.recover_after_time: 5m

script.inline: true
script.indexed: true

GET /index_name/_stats/query_cache shows that memory_size_in_bytes grows steadily, but it doesn't exceed the default 10% limit. Manually clearing the query caches using POST /index_name/_cache/clear doesn't seem to have a clear effect on heap usage shown in the graphs.

Any ideas what might be preventing GC from freeing up the memory from IndicesQueryCache?

discuss

Source

hhakkala

Most helpful comment

I opened https://issues.apache.org/jira/browse/LUCENE-7657.

jpountz on 24 Jan 2017

👍2

All 16 comments

The cache assumes that queries are not worth taking into account for memory usage since they should be small. However here you seem to have 887 term queries that use 22GB of memory. This is almost 26MB per query. I am curious whether you happen to run queries on gigantic terms?

jpountz on 23 Jan 2017

Thanks for the quick response! I turned on query logging and found out that's exactly what a client application is doing:

{
  "bool": {
    "must": [
      {
        "wildcard": {
          "name_raw_lowercase": {
            "value": "something"
          }
        }
      },
      {
        "terms": {
          "other_id": [
            12345678901,
            12345678902,
            12345678903,
            ... (several hundreds of entries)
          ]
        }
      }
    ]
  }
}

Is there a way to prevent these queries from getting cached? Would it make a difference if the query used a terms filter instead?

hhakkala on 23 Jan 2017

I am not convinced this one is the problem since it should either create a Lucene TermsQuery or many small TermQuery objects (notice the s at the end of Term). But here we have a couple hundred TermQuery objects that are very large. Can your heap analysis tool tell us more about what makes these TermQuery objects so large?

jpountz on 23 Jan 2017

Here's the uniqueQueries field of LRUQueryCache expanded:

Most of the entries are small but there are some very big ones amongst them, all of which seem to have perReaderTermState field retaining quite a bit of memory.

hhakkala on 23 Jan 2017

Thanks, I think I see the problem, queries in the cache are still referencing old segments. I'll work on a fix.

jpountz on 23 Jan 2017

Thanks a lot! In the meantime, is there a way to e.g. disable the query cache altogether? Currently we need to manually restart each node in the cluster daily to avoid meltdown due to OOM errors 😄

hhakkala on 23 Jan 2017

On a 2.x setting, you should be able to do that by using the undocumented index.queries.cache.type index setting. You will need to close your index, set this setting to none and finally open the index again for it to be taken into account.

jpountz on 23 Jan 2017

👍2

Great, I'll give it a go. Thanks again!

hhakkala on 23 Jan 2017

I opened https://issues.apache.org/jira/browse/LUCENE-7657.

jpountz on 24 Jan 2017

👍2

Based on the initial 6 hours of monitoring, it seems like the workaround of setting index.query.cache.type to none has helped and the heap is no longer constantly growing.

hhakkala on 24 Jan 2017

This will be fixed in the upcoming Elasticsearch 5.2.1 and 5.3.0 releases. We're still looking into whether/when we can get 2.x patched as well.

jpountz on 6 Feb 2017

Hello @jpountz

Has the above fix made into 5.2.1? I don't see this mentioned in release notes.

Thanks.

vanga on 2 Mar 2017

Yes, it is. I guess it did not make it to the release notes since the bug is on the Lucene side, but it was fixed when we upgraded to Lucene 6.4.1: https://github.com/elastic/elasticsearch/pull/22978.

jpountz on 2 Mar 2017

Thanks @jpountz

vanga on 2 Mar 2017

Will the fix be backported to the 2.x version? thanks.