Kibana version: 6.3
Elasticsearch version: 6.3
A user authenticates to Kibana via an Active Directory service, and can use Kibana for the day. When they stop using Kibana for a long time (usually when returning the next day), the next time they open Kibana they see an error on the screen:
{
statusCode: 500,
error: "Internal Server Error",
message: "An internal server error occurred"
}
Elasticsearch logs will show
[illegal_state_exception] token document is missing and must be present
One can work around the problem by clearing browser cookies and refreshing the page.
Expected result:
If the token used for authentication is no longer valid, the user should be asked to log in again.
@kobelb in https://github.com/elastic/kibana/blob/master/x-pack/plugins/security/server/lib/authentication/providers/saml.js#L43, Kibana handles invalid refresh tokens with some hardcoded response body strings. Is this a case of ES not returning back the proper message body for an error?
@elastic/es-security Kibana currently has the two error descriptions hard-coded to determine whether the SAML refresh token is invalid: token has already been refreshed
and refresh token is expired
.
We're seeing an additional error description token document is missing and must be present
, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?
We're seeing an additional error description token document is missing and must be present, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?
No, this might be a bug. Did the user logged out (called ES /_xpack/security/saml/invalidate
), but somehow still has the cookie?
We would be grateful for:
org.elasticsearch.xpack.security.authc
)curl -u elastic http://localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'
{
“transient”: {
“logger”: {
“org.elasticsearch.xpack.security.authc”: “DEBUG"
}
}
}
'
elasticsearch.log
with the stacktrace (the token should be in the logs).security
index contentscurl -u elastic -X GET "localhost:9200/.security/doc/_search"
Maybe someone else from @elastic/es-security has a better idea, that might not require work from the reporter.
@dmlittle we've had trouble recreating this issue that you're seeing, would you mind elaborating on your SAML configuration? Which identity provider are you using? Do you use single logout?
Ι had missed this originally :/ The behavior described here can happen as we are removing tokens periodically. For this to manifest, all the above needs to happen:
X
and leaves it beY
, with Y
> X
+ 24h
Invalidate Token API
on Z
, with X + 24h
< Z
< Y
( we potentially trigger the removal of old tokens on any token invalidation while also honoring xpack.security.authc.token.delete.interval
)The event of a user logging out or someone using the invalidation API can trigger the deletion of the now expired access and refresh tokens that the user got on X
. When kibana sends a request with this now deleted access token on Y
, ES replies with the aforementioned exception as it can't find the associated token document any more.
Arguably, ES should respond with a 401 instead of a 500 here.
@jkakavas that makes sense, we can update the SAML auth provider in Kibana to take this into consideration, thanks!
@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged
@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged
Gotcha, this is in regard to the 500
vs 401
right? There's the potential that even with this change, we might have to augment our auth providers to handle this specific error situation differently.
If I understand the manifestation above correctly, Kibana gets the 500
( which will be made into a 401
) when the access token is used in a normal API request. I thought that Kibana expects 401
in this case and will then attempt to use the refresh token
It was originally thought that the 500
was a response to the refresh token API request, but I don't believe this is the case
I'll try to reproduce locally and see if there is anything we'll need to do in Kibana once we start getting 401
instead of 500
from Elasticsearch.
Thanks Aleh! To manually cause this to occur, I was running a query to explicitly delete the tokens from Elasticsearch. Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.
Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.
Thanks for the suggestion! This time I just tweaked Elasticsearch source code a bit :)
So yeah, if ES starts responding with 401 in case access token document is missing we'll clear session cookie, drop refresh token and re-initiate login [*]. We try to use refresh token only if we get 401 token expired
.
Refresh token errors are different though, we get 400 error in most (all?) of the cases:
token has already been refreshed
refresh token is expired
could not refresh the requested token
---> e.g. when document with refresh token has been deletedFor the first two we'll start new SAML Handshake, but for the last one user will get 400 Bad Request
and stuck with invalid cookie.
Hey @jkakavas,
401 token expired
when access token is rejected and we should try to use refresh token?401 token expired
for access token + 400 could not refresh the requested token
when we try to refresh it)? Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed
and 400 refresh token is expired
.[*] If user hits this error during full page reload they will see blank page with JSON error, so user will have to refresh page once again. We can improve that if we want to, e.g. start SAML handshake if access token validation fails with 401 or something like that.
is there any case other than 401 token expired when access token is rejected and we should try to use refresh token?
No. For both explicitly invalidated and for expired access tokens, we return:
HTTP/1.1 401 Unauthorized
Content-Length: 540
Content-Type: application/json; charset=UTF-8
Www-Authenticate: Bearer realm="security", error="invalid_token", error_description="The access token expired"
Date: Thu, 21 Mar 2019 10:30:14 GMT
{
"error" : {
"root_cause" : [
{
"type" : "security_exception",
"reason" : "token expired",
"header" : {
"WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
}
}
],
"type" : "security_exception",
"reason" : "token expired",
"header" : {
"WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
}
},
"status" : 401
}
There is the question of what happens when you use a wrong/malformed access token ( This now returns 500 but will be fixed to return 401 ), what should be returned as error description in that 401 and how you should handle it in Kibana. I think that we should still try to refresh as there is no way of knowing if the access token is _wrong_ (as in we can't find the relevant document ) because it is actually a wrong attempt, or a bit flipped, or the access token was expired and then deleted but it's associated refresh token is still good to use. So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.
is it possible to have the expired access token without document for the refresh token (401 token expired for access token + 400 could not refresh the requested token when we try to refresh it)?
<implementation_detail>
it is the same document for access token and refresh token </implementation_detail>
. Yes this can happen (when we fix the behavior of getting a 401 for invalid access tokens ). Even now, there is no guarantee that you won't get a 400
when refreshing, even if you got 401 token expired
milliseconds before.
Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed and 400 refresh token is expired
As we're discussing in https://github.com/elastic/elasticsearch/pull/39808, we should be able to handle all 400
s in Kibana as "The refresh token can't be used, re-initiate the auth flow'
So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.
Nice, these error_description string based checks really bothered me, great that we can fully rely on status codes now.
Fixed via https://github.com/elastic/kibana/pull/33777. The fix should be available starting from 7.1.
@azasypkin we will still return 500 for a token that is deleted (https://github.com/elastic/elasticsearch/issues/38866) , does #33777 handle this ? ( we do plan to fix it for 7.1 , but I think we'd better leave this issue open for tracking until then )
does #33777 handle this ?
Yes, we added a workaround for this specific error that we'll remove once elastic/elasticsearch#38866 is merged.
I know this issue has been closed for a while, but we seem to be encountering it again on a 7.5 cluster: https://github.com/elastic/infra/issues/17590
Should we re-open this issue, or create a new one?
@mindbat The original problem has been addressed. This sounds similar and might as well be related ( the error message is the same but the part of the code that throws it is different ) but I think we should handle this in a separate issue. If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.
Since I can see in the infra ticket that this is reproducible, it would be immensely helpful if we can get the ES logs when that happens (with logger.org.elasticsearch.xpack.security.authc.TokenService
set to TRACE
)
I think we should handle this in a separate issue
Gotcha.
If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.
Makes sense to me, and thanks!
it would be immensely helpful if we can get the ES logs
Agreed! Since this is an ESS cluster, we don't have direct access to the logs...I'll see what we can do, though.
If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.
No objections from our side, that looks weird indeed. In the worst case I'd expect to see [esaggs] > Request to Elasticsearch failed: {"statusCode":400,"error":"Bad Request","message":"Both access and refresh tokens are expired."}
, but not 500
.
The only place we throw with that error message is https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L1503 where we search for a document in the tokens index. However, to get to there we have already called decodeToken
that has read that document from that same index just before so if something hasn't deleted the document explicitly in the meantime this makes no sense.
There is always a possibility of a bug ( a race condition somewhere ) but the logs will be very helpful in troubleshooting so I'll defer further analysis until we get them
@jkakavas What about https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L444 ?
I think we need the ES server logs for this, and should open a separate issue to track it.
Apologies everyone, I mixed up versions. This is a well known issue that we fixed in 7.6 ( and not in 7.5 ).
Kibana, on the other hand, should have dealt with the 500
in 7.5 too (and would until 7.7 where the handling is removed, right? This issue was closed when kibana added that handling to compensate for our bug in 7.1
Kibana, on the other hand, should have dealt with the 500 in 7.5 too (and would until 7.7 where the handling is removed, right?
That's correct. And if this happens since 7.2 we restart SAML handshake for non-AJAX requests or return 400 Bad Request
for AJAX requests (users will have to manually refresh browser page to trigger SAML handshake in this case). But Kibana isn't supposed to return 500: 'token document is missing and must be present
to the user. I just checked this on the latest 7.5 (removed tokens manually) and it seems to work like we expect.
@mindbat can we get verbose Kibana logs (logging.verbose: true
+ logging.quiet: false
+ logging.silent: false
) when this happens?
can we get verbose Kibana logs (logging.verbose: true + logging.quiet: false + logging.silent: false) when this happens?
@azasypkin It's an ESS cluster, so we don't have direct access to logs.
I'll ping Cloud ~today~ this week and see if we can get the logs from them.
Ok, I don't have a copy of all the logs, but I did find an Internal Server Error being logged over and over again on the day we experienced this for the first time (I've stripped out some non-essential ESS-only info from the below).
The query it's logging is one of the ones we use in the Kibana Dashboard that's throwing the token document is missing
error for us:
Internal Server Error
{
"_index": "cluster-kibana-filebeat-6.3.2-gcp-us-central1-2020.01.30",
"_type": "doc",
"_id": "***",
"_version": 1,
"_score": null,
"_source": {
"pid": 23,
"error": {
"stack": "Error: Internal Server Error\n at HapiResponseAdapter.toError (/kibana/src/core/server/http/router/response_adapter.js:119:19)\n at HapiResponseAdapter.toHapiResponse (/kibana/src/core/server/http/router/response_adapter.js:80:19)\n at HapiResponseAdapter.handle (/kibana/src/core/server/http/router/response_adapter.js:75:17)\n at Object.interceptAuth [as authenticate] (/kibana/src/core/server/http/lifecycle/auth.js:77:36)\n at process._tickCallback (internal/process/next_tick.js:68:7)",
"message": "Internal Server Error",
"name": "Error"
},
"@version": "1",
"type": "error",
"input": {
"type": "log"
},
"source": "kibana.log",
"docker": {
"container": {
"image": "docker.elastic.co/cloud-assets/kibana:7.5.0-0",
"labels": {
"org": {
"label-schema": {}
},
"co": {
"elastic": {
"cloud": {
"allocator": {
"user_id": "***",
"type_version": "7.5.0",
"instance_id": "***",
"cluster_id": "***",
"zone": "us-central1-b",
"type": "kibana",
"kind": "kibana",
"id": "***"
}
}
}
}
},
"name": "***,
"id": "***"
}
},
"url": {
"query": {
"ignore_unavailable": "true",
"ignore_throttled": "true",
"preference": "1580399421715",
"rest_total_hits_as_int": "true"
},
"pathname": "/elasticsearch/.ent-search-app-logs-*/_search",
"href": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
"search": "?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
"path": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715"
},
"host": {
"name": "***"
},
"beat": {
"hostname": "***",
"name": "***",
"version": "6.3.2"
},
"prospector": {
"type": "log"
},
"level": "error",
"message": "Internal Server Error",
"tags": [
"kibana",
"logging"
],
"@timestamp": "2020-01-30T21:02:42.000Z",
"meta": {
},
"input_type": "kibana"
},
"fields": {
"@timestamp": [
"2020-01-30T21:02:42.000Z"
]
},
"sort": [
***
]
}
Unfortunately we can't get much value from this log, we really need full Kibana verbose logs to see the entire request authentication flow to know what's going on.
@azasypkin Gotcha. I'll see about getting verbose logging turned on, and then replicating the error.
Most helpful comment
@mindbat The original problem has been addressed. This sounds similar and might as well be related ( the error message is the same but the part of the code that throws it is different ) but I think we should handle this in a separate issue. If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.
Since I can see in the infra ticket that this is reproducible, it would be immensely helpful if we can get the ES logs when that happens (with
logger.org.elasticsearch.xpack.security.authc.TokenService
set toTRACE
)