Elasticsuite: High JVM heap usage resulting in GC loop and unresponsive ElasticSearch

Created on 23 Jan 2019 · 26Comments · Source: Smile-SA/elasticsuite

Preconditions

Magento Version : 2.2.6 EE (Cloud)
ElasticSuite Version : 2.6.4
Environment : Production
JVM Heap Size : 8G
ElasticSuite Config : config

Steps to reproduce

After the ElasticSearch service is running for a while JVM heap usage gets to 100% resulting in GC loop and unresponsive ES. ES has to be restarted and reindex in order to get the site working again.

Are there any recommended settings that should be used on ES based on a number of products/categories/customers?

We were using ElasticSearch without ElasticSuite before, with smaller Heap Size, and it worked without any issues so we think it's something related to ElasticSuite module or misconfiguration of it.

bug

Source

adjogic

All 26 comments

Hello,

how much products/categories do you have actually ?

I'm not sure you need 3 shards, except if you are having a very huge catalog.

Regards

romainruaud on 23 Jan 2019

Hey,

There are around 20k products and 300 categories in 2 stores.

3 shards were configured because Magento Cloud environment has 3 nodes and they suggested the setting so the load spreads on all nodes.

Another thing to note is there are ~50k API update product calls made per day. Does that affect the number of indices based on the Indices Name Pattern?

Thanks

adjogic on 24 Jan 2019

API calls should not bother Elasticsearch.

Could you please (if you have proper access to do this) how much indexes you are having actually in Elasticsearch ?

Can be done via curl -XGET your-es-server:9200/_cat/indices?v

Regards

And paste the results here.

romainruaud on 25 Jan 2019

We are having the same issue on Magento Cloud and the site has just gone down again becasue of it. Here's out output:

~
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open magento2_default_thesaurus_20190125_080612 ao-3m0sNQZKlgUzrEAgjog 3 2 0 0 1.3kb 477b
green open magento2_default_catalog_product_20190125_080321 fe0ctvT2RHKJHvzG9E6OEw 3 2 78939 20651 255mb 84.7mb
green open staging_default_skywire_wordpress_20190125_171513 llTF3nuPTri472ow8JKDFA 1 0 434 0 2.6mb 2.6mb
green open magento2_default_catalog_category_20190125_080525 laD2mipLT_imoeb9SBmgAg 3 2 178 0 1mb 359.9kb
green open magento2_default_tracking_log_event_20190125 ZH42lwoQQkmGYTQqOOQoow 3 2 3582 0 6.1mb 2mb
green open staging_default_cms_page_20190125_171601 3KBARJWKRTyQE5-o5efuCA 1 0 95 1 830.1kb 830.1kb
green open staging_default_catalog_category_20190125_171511 QgJC0p-fQR6mui-1pauqbA 1 0 142 0 270.3kb 270.3kb
green open magento2_default_skywire_wordpress_20190125_080529 TTNITi_2QCqbfTMyWRnLWg 3 2 434 0 8.4mb 2.8mb
green open magento2_default_cms_page_20190125_080611 bIyW4Vs4RtKCJc6ciX_4Mg 3 2 109 2 2.9mb 1012.5kb
green open magento2_default_tracking_log_session_20190125 MU-y7AD6RHOpsgd9bhxpKQ 3 2 1179 13 1.4mb 492.5kb
green open staging_default_thesaurus_20190125_171604 f0vEokn3R8eh6FFb1V-cWg 1 0 0 0 130b 130b
~

valguss on 25 Jan 2019

The last time we had the issue, a week ago, there were more than 230 indexes.
After that, we disabled "Smile_ElasticsuiteTracker" module, updated Indices Name Pattern to {{YYYYMMdd}}, changed Shards per index to 1 and Replicas to 0. Not sure which one helped.

Thanks

adjogic on 29 Jan 2019

@adjogic : most probably, it can be due to the number of tracking indices. Changing the pattern like you did is discouraged because unicity of indices is not sure if "dd" is your lowest level of naming, and it has no effect on number of indices. And you should not set replicas to 0 because if you lose a node, no other node would be able to take the relay.

About the tracker, i'm not sure how we should deal with it :

implementing a purge with a maximum delay (15 days or 1 month could be enough to keep).
or just telling people to use curator but it would require additional system knowledge, and it could be a pain to configure such tool in Cloud envs.

Most probably, the solution 1 will fit best the most of users.

@valguss is it a full listing of indices ? If yes, there are tiny actually, and should not take all the memory of ES, except if the server is really a small one. I'm doubting it's a full listing because I do not see any "staging_default_catalog_product" indice.

However, there is still a "problem" or misconception on Magento Cloud architecture : as you can see, they are actually using the same ES cluster for both staging and production, which is a bad approach imho. This can lead to indices overwriting if you do not configure properly the indices alias between prod and staging, but worse, this also means that if you are performing a ressource consuming task on staging (like a full reindex), the production will suffer also...

Regards

romainruaud on 30 Jan 2019

Here are the indecies after a few days:
~
health status index yellow open magento2_default_tracking_log_session_20190118 green open magento2_default_catalog_product_20190130_010110 yellow open magento2_default_tracking_log_event_20190116 yellow open magento2_default_tracking_log_event_20190124 yellow open magento2_default_tracking_log_event_20190123 yellow open magento2_default_tracking_log_session_20190123 yellow open magento2_default_tracking_log_event_20190122 yellow open magento2_default_skywire_wordpress_20190129_141701 yellow open magento2_default_tracking_log_event_20190118 yellow open magento2_default_tracking_log_session_20190129 yellow open magento2_default_thesaurus_20190129_141702 yellow open magento2_default_tracking_log_event_20190117 yellow open magento2_default_catalog_category_20190129_141700 green open magento2_default_tracking_log_event_20190130 yellow open magento2_default_tracking_log_session_20190116 yellow open magento2_default_tracking_log_session_20190124 yellow open magento2_default_cms_page_20190129_141701 yellow open magento2_default_tracking_log_event_20190128 yellow open magento2_default_tracking_log_event_20190125 yellow open magento2_default_tracking_log_session_20190119 yellow open magento2_default_tracking_log_session_20190122 yellow open magento2_default_tracking_log_event_20190121 yellow open magento2_default_tracking_log_session_20190125 green open magento2_default_tracking_log_session_20190130 yellow open magento2_default_tracking_log_event_20190129 yellow open magento2_default_tracking_log_session_20190121 yellow open magento2_default_tracking_log_event_20190119 yellow open magento2_default_tracking_log_session_20190128 yellow open magento2_default_tracking_log_session_20190117 ~ uuid pri rep docs.count docs.deleted store.size pri.store.size
U5WDPek1RxWZFdZ53cA8MQ 3 2 6 0 29.3kb 29.3kb
4svPbXaQTzK4NHAIpAzxwA 1 0 79043 34795 120.4mb 120.4mb
-9tAoPLXRt-J8ru-8Yvc4w 3 2 1383 0 726.8kb 726.8kb
cTKV6rBhRua7kALcrzXteA 1 1 22 0 69.8kb 69.8kb
qQEL_WnhSNGoiafic3QHvg 1 1 2 0 18.1kb 18.1kb
0OK6miI0QzGnTqpHm6nuEw 1 1 2 0 8.4kb 8.4kb
14yFG7gaRsevQYfFTCHXXw 1 1 26 0 95.5kb 95.5kb
lJg1ZCPkQxG-JYsEukFJ1g 3 2 435 0 2.8mb 2.8mb
NKKGr3yzSN-W7D6KqePsRA 3 2 65 0 152.2kb 152.2kb
1bF_z4s-S0iEKR0f72Sf3A 3 2 920 10 394.8kb 394.8kb
xNpWpOB0SLmsTLhlMXdH0w 3 2 0 0 477b 477b
9wo1hriVQ4SWVebsSDE0kA 3 2 1 0 9.4kb 9.4kb
gePlwS_wTtWrxMNnTgDXFA 3 2 178 0 359.9kb 359.9kb
c-a4jziSRYKHkH8k3KNdVQ 1 0 726 0 607.6kb 607.6kb
d-401QZUQ3q35v6MTma2AQ 3 2 218 14 163.8kb 163.8kb
1sC424qzRda_fN1vmw3RyA 1 1 4 0 13.9kb 13.9kb
npY3UmtmSnuq3JVu3jVFhA 3 2 109 0 978.3kb 978.3kb
usw1yzlbTNGQm4TnH1zMGA 3 2 4287 0 2.1mb 2.1mb
iRGX-AKiSkujln-IAopkYg 1 1 54 0 46.5kb 46.5kb
RcsCHaVPQjWu1Bdw2wksWw 3 2 1 0 4.8kb 4.8kb
qpcgXquQTqyz2cJNVUsALg 1 1 3 0 14.3kb 14.3kb
AuldVaNKT6KWR6oX0Qpo2A 3 2 52 0 219.4kb 219.4kb
IGTiamveT_6Vkrs3cs7u0A 1 1 5 2 24.1kb 24.1kb
P0bxFIjRTXuUVlmbP_rMMg 1 0 144 2 164.7kb 164.7kb
3dLX7pqwSmylxU0i5P5EWw 3 2 4669 0 2.1mb 2.1mb
HrAITcgoS0yjWteqnkUR4Q 3 2 1 0 6kb 6kb
5CuENB46STyarp67RmGCPA 3 2 2 0 24kb 24kb
zFpEYDL-S3GUkd24sZPMag 3 2 795 10 357.3kb 357.3kb
yQy2EgdUSqyqrRR7EUZQuQ 3 2 1 0 4.6kb 4.6kb

valguss on 30 Jan 2019

I suspect the issue is around the amount of indexes being created for tracking

valguss on 30 Jan 2019

Yep, I agree

Since for now, tracking data is not used, you can disable tracking and delete the indices.

But on next 2.8.0 version tracking data will be mandatory to be able to benefit from the search analytics dashboard ( #589 )

We'll deliver also a feature to enable automatic cleaning (maybe just closing them should be enough to reduce drastically their impact) of these indices.

Regards

romainruaud on 30 Jan 2019

Thanks. I've now disabled tracking and will look out for the new feature in the future. I'll report back here if we continue to see any issues.

Cheers

Tom

valguss on 30 Jan 2019

👍1

@romainruaud thanks for the answers!
We will change the indices pattern back to recommended one and update shards and replicas with tracking still disabled to see if everything is still working as expected.

Based on the changelog in version 2.6.6 disabling tracking was fixed so we won't need to disable module in config.php anymore?

For us, even if the tracking indices would be purged after 15 days it would still cause an issue as the site was crashing at least once a day.

We also didn't disable tracking on staging which contributed to the number of indices in the ES.

Thanks

adjogic on 30 Jan 2019

Yes, tracking can be disabled properly via the back-office now, and the module can remains activated in config.php.

Self-reminder : https://www.elastic.co/fr/blog/elastic-stack-6-6-0-released

Probably the Frozen Indices feature is a good way to keep old data by reducing memory usage. Too bad it's only available with Elasticsearch 6.6.

romainruaud on 30 Jan 2019

so, 9 days on and no crash. But to be fair we are no longer on the cloud. Hopefully I haven't spoken too soon
Thanks
Tom

valguss on 8 Feb 2019

To be honest, having several dozens of indices of <1Mb each should be nothing for a properly sized Elasticsearch cluster... not sure how it's handled in Magento Cloud but it might need improvements...

Another strange thing is the high % of indices indicated as "yellow" on your previous listing, meaning these indices are trying to have a replica but are not able to get it. It might be due to the fact that the ES "cluster" is in fact ... a simple node... which is not really suitable for a production environment.

Anyway, we'll stick to the plan and allow to configure a retention delay for cleaning automatically these indices, but imho on a proper sized stock Elasticsearch server we should be able to store plenty of them without harming.

romainruaud on 12 Feb 2019

Hello,
We are experiencing similar problems too. (hosted on magento cloud too)
We have currently 173 indices in ES (with prod and staging indices on the same instance).

dooblem on 25 Feb 2019

We disabled tracking yesterday. The number of indices is now about 80. But it went down again this morning. As I'm writing I'm waiting for Magento Cloud to restart ES.
What is weird is the total size of indices is rather small (bellow 300M). And we have a cluster of 3 nodes with 10G each ! All full.
If you know of any command or what to look to have more insight of ES memory usage, you'll make my day.

dooblem on 26 Feb 2019

If you check elasticsearch.log you'll probably something similar to this:

[2019-02-04T10:56:42,298][WARN ][i.n.c.n.NioEventLoop     ] Unexpected exception in the selector loop.
[2019-02-04T10:56:42,307][WARN ][o.e.m.j.JvmGcMonitorService] [Lu9] [gc][old][30170][10] duration [1.3m], collections [4]/[53.3s], total [1.3m]/[2.9m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [32.1mb]->[33.2mb]/[33.2mb]}{[old] [7.6gb]->[7.6gb]/[7.6gb]}
[2019-02-04T10:56:42,307][WARN ][o.e.m.j.JvmGcMonitorService] [Lu9] [gc][30170] overhead, spent [1.3m] collecting in the last [53.3s]
[2019-02-04T10:56:42,298][WARN ][i.n.c.n.NioEventLoop     ] Unexpected exception in the selector loop.
[2019-02-04T10:56:42,352][INFO ][o.e.d.z.ZenDiscovery     ] [Lu9] master_left [{-xxx}{-xxxxxx}{zzz}{192.168.x.x}{192.168.x.x:9300}], reason [transport disconnected]
[2019-02-04T10:56:42,372][WARN ][o.e.d.z.ZenDiscovery     ] [Lu9] master left (reason = transport disconnected), current nodes: nodes:
[2019-02-04T10:56:42,401][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [Lu9] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{-xxx}{-xxxxxx}{zzz}{192.168.x.x}{192.168.x.x:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeNotConnectedException: [-xxx][192.168.x.x:9300] Node not connected]
[2019-02-04T10:56:42,404][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [Lu9] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{-xxx}{-xxxxxx}{zzz}{192.168.x.x}{192.168.x.x:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeNotConnectedException: [-xxx][192.168.x.x:9300] Node not connected]
[2019-02-04T10:56:42,395][WARN ][i.n.b.ServerBootstrap    ] Failed to register an accepted channel: [id: 0x865eb837, L:0.0.0.0/0.0.0.0:9200]
[2019-02-04T10:56:42,407][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [Lu9] failed to execute on node [V-yyy]

adjogic on 26 Feb 2019

I got exactly the same this morning. But though memory heap usage was high, I see no OutOfMemoryError in the log for this morning.
Looking into the logs you sent and I got too, it really seems like a network problem: "transport disconnected".
If so this issue has nothing to do with elastic suite...

Some interesting thread I found: https://discuss.elastic.co/t/possible-causes-for-transport-disconnected-errors-in-node-discovery/13651/3

dooblem on 26 Feb 2019

I agree, as I said before, hundreds of indices containing 1 or 2Mb each is nothing for Elastiscsearch to handle, especially on a three node cluster.

But this is only true if the cluster is properly dimensioned and designed by the hosting company.

From our personal experience at Smile (we are also an hosting company), we do not face such issues for our customers hosted in our "bare metal" servers, that's why I'm curious to know how the Magento Cloud elasticsearch cluster is built.

romainruaud on 27 Feb 2019

Sorry to be flippant but I'd say "badly"

valguss on 27 Feb 2019

😄1

One of the nodes went down this morning again, saturating the heap space (10G).

I found an interesting thread: https://discuss.elastic.co/t/overhead-and-heap-issues/117888

Then I red this insightful page: https://www.elastic.co/fr/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

I should also had red more carefully the comment in this issue. It appears that we had 3 shards per index configured, with 2 replicas. With staging and production hosted on the same cluster, the cluster had more than 800 shards to handle.

So this is what we did:

disable tracking, remove the trackings indices manually
for the production: in elasticsuite config, set the number of shards to 1, and the number of replicas to 2
for the staging, as it is stored in the same ES, set the number of shards to 1, and the number of replicas to 0 (no need to redundancing/performance for the staging)
reindex staging+production

The number of shards in the cluster is now bellow 200.

So for magento cloud customers, I would recommend those settings.

Is it also possible to add something in the doc, or a comment next to the setting for shards and replicas in Magento backoffice ? I think users should change those settings only if they really know what they are doing.

(and our developpers where also using those settings for their local elasticsearch instances. obviously their ES instances where consuming a lot of memory and are very slow)

dooblem on 7 Mar 2019

Hui @dooblem, did your changes made a significant improvement in your case? We had exactly the same issue on Magento Cloud and tried what you suggested but even though the crashes happen less frequently, they still do, and for the same reason (heap space overflow).

Interested in getting your feedback after 20 days.

For information, here's our ES health:

$ curl -sS localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 93,
  "active_shards" : 195,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

xi-ao on 27 Mar 2019

Funny thing @xi-ao and me are working in the same company. We are both talking about the same project.

But yes as stated by @xi-ao we are still experiencing the ES crashes on magento cloud even with the shard number reduction.

Magento Cloud support keep telling us that as we are not using m2 elasticsearch official module, they can't help us.

The thing is we are using elasticsuite on many other project hosted elsewhere with no problem at all.

We will try upgrading to ES 6 to see if it helps, or try to use an external ES server on AWS, or fallback to the magento standard module with a feature loss for our customer.

others m2cloud users ? @adjogic are you still experiencing issues or do they went away with your config tuning ?

dooblem on 4 Apr 2019

So !

According to Magento Cloud dev team :

if you are using ElasticSuite on a Magento Cloud project, you have to update your ece-tools package to version .19+ (for those who don't know, it's the package that is used for 'provisioning' the Magento Cloud environments).

On previous versions there was a misconfiguration which was causing performances issues of the Elasticsearch server : it forced it to run with 1 shard on 1 node, despite the standard architecture of Magento Cloud is a 3 node cluster. That has been fixed by Magento Cloud team.

This is also stated in their official documentation now :

https://devdocs.magento.com/guides/v2.3/cloud/project/project-conf-files_services-elastic.html#elasticsearch-plugins

What appeared in version .19 of the ECE-Tools is a dedicated configuration wrapper for Elasticsuite :

https://github.com/magento/ece-tools/blob/develop/src/Config/SearchEngine/ElasticSuite.php

romainruaud on 20 Jun 2019

🎉1

Thanks so much @romainruaud for your response and your help.
Unfortunately we could not wait that long and were forced to get back to the standard ES module a few weeks back. But that's really good to know, we may need it anyway in the future.

xi-ao on 24 Jun 2019

I close this one since Magento Cloud team handled it.

Regards everybody for participating !

romainruaud on 8 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings