Elasticsearch: TokenAuthIntegTests#testRefreshingMultipleTimesWithinWindowSucceeds failures

Created on 27 Apr 2020  路  16Comments  路  Source: elastic/elasticsearch

This has failed a few times already in 7.x and 7.7 https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(branch),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:testRefreshingMultipleTimesWithinWindowSucceeds),sort:!(time,desc))

Example failure : https://gradle-enterprise.elastic.co/s/m5xyt3as2qtv2/tests/g5u3xltc4r2ds-a52ia4rmdb2ji

This appears to not been failing in master ( after 450 iterations ) but reproducibly fails in 7.7 and 7.x once every 5 to 15 times. The difference between master and 7.x/7.7 is how we are checking that we can authenticate after the token is refreshed in that the test in master uses a rest client while the test in 7.7 uses a transport client. It's not clear at this time if this is a test issue in master that fails to test the scenario properly or a test bug in 7.x

I will mute this for now until we can get to the bottom of it

:SecuritAuthentication >test-failure Security

Most helpful comment

This is likely due to dirty read on the security index.

Assume node A holds the primary shard of the security index. Thread 1 hits node A and begins to index the new token document. While the indexing is still happening, thread 2 also hits node A and tries to get the document. Thread 2 may succeed and happily return since the document could be visible on node A's primary shard. But the indexing operation is not yet completed so it is not yet available on node B's replica shard. Now if thread 3 hits node B and tries to authenticate with the new token, it will fail because replica shard does not have the document yet. The described scenario is specific, but in general the failure can happen whenever two reads happen to hit shards that are not fully synced.

I can reproduce the failure reliably in 7.7 by changing numberOfThreads to 46 (higher number is more prone to trigger failure). And it can be fixed by re-use the same threadClient to perfom the authentication, i.e. threadClient.filterWithHeader(BEARER_TOKEN...).... The change is basically ensure authentication talks to the same server that answered the refresh call.

As for why it happens much more often with TransportClient compared to HLRC, I can only guess that TransportClient is faster and hence prone to this error. And thanks to @not-napoleon , now we know it can also happen to HLRC, just with a very low chance.

According to @jasontedor, there is no way to tell whether the read is stale or not. A possible fix might be swapping the order of "update exsiting token document" and "create new token document". This will guarantee that the new document is fully in place when "update to existing document" is visible.

PS: @gwbrown believes that this behaviour should be called "stale read" instead of "dirty read". I am using "dirty read" here since it is the term used in #52400.

All 16 comments

Pinging @elastic/es-security (:Security/Authentication)

Found an example of it failing in master: https://gradle-enterprise.elastic.co/s/y3t5c3uc3fpni

I'm going to mute it in master as well.

This is likely due to dirty read on the security index.

Assume node A holds the primary shard of the security index. Thread 1 hits node A and begins to index the new token document. While the indexing is still happening, thread 2 also hits node A and tries to get the document. Thread 2 may succeed and happily return since the document could be visible on node A's primary shard. But the indexing operation is not yet completed so it is not yet available on node B's replica shard. Now if thread 3 hits node B and tries to authenticate with the new token, it will fail because replica shard does not have the document yet. The described scenario is specific, but in general the failure can happen whenever two reads happen to hit shards that are not fully synced.

I can reproduce the failure reliably in 7.7 by changing numberOfThreads to 46 (higher number is more prone to trigger failure). And it can be fixed by re-use the same threadClient to perfom the authentication, i.e. threadClient.filterWithHeader(BEARER_TOKEN...).... The change is basically ensure authentication talks to the same server that answered the refresh call.

As for why it happens much more often with TransportClient compared to HLRC, I can only guess that TransportClient is faster and hence prone to this error. And thanks to @not-napoleon , now we know it can also happen to HLRC, just with a very low chance.

According to @jasontedor, there is no way to tell whether the read is stale or not. A possible fix might be swapping the order of "update exsiting token document" and "create new token document". This will guarantee that the new document is fully in place when "update to existing document" is visible.

PS: @gwbrown believes that this behaviour should be called "stale read" instead of "dirty read". I am using "dirty read" here since it is the term used in #52400.

Fixed in #61714

In master, we use the rest client which round robins between
cluster nodes and thus the same client might refresh the token
in a node and then immediately attempt to authenticate against
another. Occasionally this still triggers the issue at hand.

The new failures in 7.x are not the same as before

failure is

org.elasticsearch.xpack.security.authc.TokenAuthIntegTests > testRefreshingMultipleTimesWithinWindowSucceeds FAILED
    java.lang.AssertionError: 
    Expected: <false>
         but: was <true>
        at __randomizedtesting.SeedInfo.seed([2D3C8E6FF712A328:C32EEADBAD7235C]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.security.authc.TokenAuthIntegTests.testRefreshingMultipleTimesWithinWindowSucceeds(TokenAuthIntegTests.java:620)

with repro line

REPRODUCE WITH: ./gradlew ':x-pack:plugin:security:internalClusterTest' --tests "org.elasticsearch.xpack.security.authc.TokenAuthIntegTests.testRefreshingMultipleTimesWithinWindowSucceeds" -Dtests.seed=2D3C8E6FF712A328 -Dtests.security.manager=true -Dtests.locale=nl-BE -Dtests.timezone=Europe/Tiraspol -Druntime.java=8 -Dtests.fips.enabled=true

and the actual exception that triggers the assertion error is

  1> [2020-10-20T18:25:00,838][INFO ][o.e.x.s.a.l.LoggingAuditTrail] [node_s0]org.elasticsearch.xpack.security.audit action="cluster:monitor/health" authentication.type="REALM" event.action="access_granted" event.type="transport" indices="[]" origin.address="127.0.0.1:48636" origin.type="transport" request.id="xKr2AwAAQACdAHjq_____w" request.name="ClusterHealthRequest" user.name="test_user" user.realm="file" user.roles="["user"]"
  1> [2020-10-20T18:25:00,843][ERROR][o.e.x.s.a.TokenAuthIntegTests] [node_s3] caught exception
  1> java.lang.IllegalArgumentException: Cannot decode access token
  1>    at org.elasticsearch.xpack.security.authc.TokenService.lambda$authenticateToken$6(TokenService.java:433) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.xpack.security.authc.TokenService.lambda$getUserTokenFromId$11(TokenService.java:497) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:261) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:247) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1249) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:43) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:27) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) [elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.11.0-SNAPSHOT.jar:7.11.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]

@BigPandaToo this seems to have started failing after you merged this PR.It doesn't fail in master as it is already muted there for other reasons. I'll look into the failure in due time but if you have any insights on what might have changed in the PR that causes this, feel free to share

62490 did introduce a new behaviour for token refresh. Once a token is successfully refreshed, it immediately authenticates with the returned access token to get the authentication object of the token:

https://github.com/elastic/elasticsearch/blob/b6db0942abe27e65e46f165a81a59f7016bb126e/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/action/token/TransportRefreshTokenAction.java#L36-L39

This may be what is causing the current failure.

From the stacktrace, the actual failure is because the access token document cannot be found as in the following:
https://github.com/elastic/elasticsearch/blob/b6db0942abe27e65e46f165a81a59f7016bb126e/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L496-L497

But I don't understand how it could happen. The refresh and subsequent authentication are sequential. Also when refresh returns, both primary and replica shards should have acknowledged the new document. So the authentication (and get token document) should be able to fetch the new token document. I may be missing something.

I will take a look at it today.

I think this is still caused by the same "dirty read" problem described in the earlier comment. An contrived scenario can be as follows:

Say we have a 3-node cluster, the token index has its primary on node_A and replica on node_B and nothing on node_C. Thread 1 is refreshing the token and the new token document is processed by the primary, but has not made into the replica yet. Now Thread 2 comes to refresh the token as well, it hits node_C.

Becuase node_C does not have a shard copy, it needs to route the request. Since the new token document has not been fully acknowledged yet, both primary and replica are considered equal candidate for the GET token document request. For the refresh request, it happens to be routed to the primary shard (node_A) and successfully get the document back. After the refresh, the new code attempts to authenticate with it. Although both the refresh and authenticate happen on the same node, the GET request can be routed to a different node. This time it gets routed to the replica shard (node_B). The replica shard does not have this document yet (because Thread 1 is still processing it) and so it fails.

The above scenario can actually happen in production. But having an extra authenticate call straight after refresh in server side code definitely increases the likelihood of it.

To fix it, I think it is either to have TokenService#refreshToken to give back the token authentication object as well or retry the TokenService#authenticateToken (either in TransportRefreshTokenAction or within itself). @jkakavas Do you have a preference or other alternatives?

The above scenario can actually happen in production. But having an extra authenticate call straight after refresh in server side code definitely increases the likelihood of it.

Agreed. In most cases we've seen in production concurrent refreshes come from the same client so they will (most probably) be hitting the same node. In your example thread 1 and thread 2 will most probably both hit node_A. In that sense the change we recently did increases the chances of this happening but it could still happen without it if i.e. the two threads/clients both hit Node C ? We know we want to rework the concurrent refresh logic to accommodate these cases but I don't think this constitutes as a significant reason to drop other work in favor of this ( especially if we can reduce the recently increased probability )

I think it is either to have TokenService#refreshToken to give back the token authentication object as well

Strong suggestion for this one from my side

IIUC, the issue can _in theory_ happen even when both threads hit node_A. Because we don't guarantee 100% to route to the local shard copy even with preference=_local for GetRequest (just learnt it today). This means the two threads can still fetch document from different shard copies on different node and one of them could fail.

The chance is probably super low to have no practical concern. I agree this is not the time to revisit the whole refreshing logic. This issue should be fixed for the new behaviour only.

I agree with @ywangd analysis and with that we have to reduce the probability of these dirty reads until we rework the concurrent refresh logic. Will work on the fix

This should now be fixed by #64178

Was this page helpful?
0 / 5 - 0 ratings

Related issues

martijnvg picture martijnvg  路  3Comments

jpountz picture jpountz  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

clintongormley picture clintongormley  路  3Comments

rjernst picture rjernst  路  3Comments