Elasticsearch: verbose logging in Master creating log volume

Created on 8 Jun 2020  路  11Comments  路  Source: elastic/elasticsearch

### Problem: Too much logging of long exception stack traces results in log files getting rotated every few mins for Master Node.

Context:
When node connects to the cluster Master Node does AsyncShardFetch for primary and replica shards, if during this AsyncFetch, node disconnect then Master node prints stacktrace for every shards, this is particularly issue with high number shards per node and multiple nodes join and leave at the same time.

Printing same stack-trace for each shards doesn't seems to be beneficial so are thinking of using sampling based logging. And on other thought since, this stack-trace is not much useful for debugging so we can trim it also

Repro cluster configuration:
ES version : 7.7
Data nodes = 3
Master nodes = 3
Number of shards = 110950
Number of primary shards = 38949
All shards has data ~10-20 KB

Repro Steps:
Enabled Trace logs in the cluster to see when shard fetching is started
Repeatedly killed one of the data node, so that AsyncFetch enter into exception state

Replica:

[2020-06-04T22:33:36,085][TRACE][o.e.g.G.InternalReplicaShardAllocator] [c063a0601ff59493187b8174cf1d7674] [testlogidx13][762] processing failure FailedNodeException[Failed node [AGjLbt01ShOlykPZjz8_Cw]]; nested: NodeDisconnectedException[[f58fa56a9f0968ad07c38dc9b75ed022][10.212.61.242:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected]; for [shard_store]
[2020-06-04T22:33:36,085][WARN ][o.e.g.G.InternalReplicaShardAllocator] [c063a0601ff59493187b8174cf1d7674] [testlogidx13][762]: failed to list shard for shard_store on node [AGjLbt01ShOlykPZjz8_Cw]
org.elasticsearch.action.FailedNodeException: Failed node [AGjLbt01ShOlykPZjz8_Cw]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:221) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:142) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:196) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1001) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [f58fa56a9f0968ad07c38dc9b75ed022][10.212.61.242:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected


shard_started Logs Primary

[2020-06-04T21:03:39,922][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [c063a0601ff59493187b8174cf1d7674] [testlogidx6][215]: failed to list shard for shard_started on node [MowY2Vj5T4mJ2buZo4k8Wg]
org.elasticsearch.action.FailedNodeException: Failed node [MowY2Vj5T4mJ2buZo4k8Wg]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:221) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:142) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:196) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1001) [elasticsearch-7.7.0.jar:7.7.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [d4694e2b17768ef86d7a97d68f8d1fc8][10.212.18.205:9300][internal:gateway/local/started_shards[n]] disconnected
[2020-06-04T21:03:39,922][TRACE][o.e.g.G.InternalPrimaryShardAllocator] [c063a0601ff59493187b8174cf1d7674] [testlogidx6][215] scheduling reroute for post_response

Previous related issue: https://github.com/elastic/elasticsearch/issues/50900
https://github.com/elastic/elasticsearch/issues/57501

:DistributeAllocation >bug Distributed

All 11 comments

@DaveCTurner I am trying to evaluate approaches based on trimming or sampling based logging the exception.
Please let me know your thoughts for it.

Option-1 [Trimming down stack trace]
Since, Exception trace isn't much useful here, we can just print meta information of the exception such as:

  1. Shard ID
  2. Node ID
  3. Type of Fetch such as [shard_stores, shard_store or shard_started]
  4. Fetching round
  5. Detailed Message [message and class names of nested exceptions]

Option-2 [Sampling of the exception based on uniqueness]
Other options are is to maintain ExceptionCache in AsyncShardFetch.java where we would maintain list of unique exception happened so far so we won't print the stack trace again and again for same exception but this will add more complexity in the AsyncShardFetch.java.
In the stable scenarios we would need to clean up the ExceptionCache also.

Option-3 [Sampling of the exception based on frequency]
log the exception of every nth consecutive occurrence of the exception

Option-4 [Log for every AsyncShard Object]
AsyncShardFetch object is created for every unassigned shard and it is re-used until shard is started/ failed. And next time it goes unassigned a new object will get created for it. This will reduce spamming across multiple fetching round but won't reduce spamming of multiple shards.

@DaveCTurner , I am more inclined towards Option-1 as this will capture all the information and implementation will also be quick.

Pinging @elastic/es-distributed (:Distributed/Allocation)

I think we could reasonably drop to DEBUG level logging for certain exceptions. NodeClosedException, NodeDisconnectedException, NodeNotConnectedException are all possible here and I don't think it's worth reporting them at all at the shard level - we record that the node left the cluster anyway. It looks like you can also get an AlreadyClosedException but I think we can avoid that and I opened https://github.com/elastic/elasticsearch/issues/57797 to investigate that further.

CircuitBreakingException is of questionable value to report at the shard level too. Raising this one for discussion with the team as we might not want to lose that completely.

@mohit0193 would you fix the formatting in your OP to include line breaks in the stack traces? It's pretty unreadable as it stands; it looks like you have tried to comment on things in the logs but it's just lost in the noise. I tried to fix it myself but it's too messed up. Use triple-backticks to format blocks of text:


formatted text goes here
and includes newlines

@DaveCTurner Thanks, I have updated the stack traces.

I think we could reasonably drop to DEBUG level logging for certain exceptions. NodeClosedException, NodeDisconnectedException, NodeNotConnectedException are all possible here and I don't think it's worth reporting them at all at the shard level - we record that the node left the cluster anyway. It looks like you can also get an AlreadyClosedException but I think we can avoid that and I opened #57797 to investigate that further.

CircuitBreakingException is of questionable value to report at the shard level too. Raising this one for discussion with the team as we might not want to lose that completely.

@DaveCTurner Agree, we can safely drop to DEBUG level for NodeClosedException, NodeDisconnectedException, NodeNotConnectedException. I am thinking since this stack-trace is not much useful should we solve it for all type of exceptions?

On the contrary a stack trace is vital for understanding unexpected exceptions like the AlreadyClosedException I mentioned above, for which we should definitely be emitting a WARN log.

If you're preparing a PR for this I suggest dropping to DEBUG on the exceptions you listed (but keep the stack trace), suppress the stack trace on a CircuitBreakingException (but keep it at WARN) and leave all other exceptions as they are. We may not get to discuss this internally for some time but we may decide to do something better with CircuitBreakingException in future.

Thanks @DaveCTurner I will raise PR shortly.

CircuitBreakingException is of questionable value to report at the shard level too. Raising this one for discussion with the team as we might not want to lose that completely.

We discussed this today and opened #58259 to propose using the batched reroute mechanism to record fewer of these failures too. In the first instance, let's just drop the stack traces as I suggested above, and then we can follow up with this new idea.

@DaveCTurner , Thanks this is interesting. I will explore BatchedRerouteService

Was this page helpful?
0 / 5 - 0 ratings