Kibana: Kibana throws internal server error when auth fails due to missing saml token

Created on 10 Sep 2018 · 29Comments · Source: elastic/kibana

Kibana version: 6.3

Elasticsearch version: 6.3

A user authenticates to Kibana via an Active Directory service, and can use Kibana for the day. When they stop using Kibana for a long time (usually when returning the next day), the next time they open Kibana they see an error on the screen:

{ 
  statusCode: 500, 
  error: "Internal Server Error", 
  message: "An internal server error occurred" 
}

Elasticsearch logs will show

[illegal_state_exception] token document is missing and must be present

One can work around the problem by clearing browser cookies and refreshing the page.

Expected result:
If the token used for authentication is no longer valid, the user should be asked to log in again.

Security bug

Source

tsullivan

👍3

Most helpful comment

@mindbat The original problem has been addressed. This sounds similar and might as well be related ( the error message is the same but the part of the code that throws it is different ) but I think we should handle this in a separate issue. If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

Since I can see in the infra ticket that this is reproducible, it would be immensely helpful if we can get the ES logs when that happens (with logger.org.elasticsearch.xpack.security.authc.TokenService set to TRACE )

jkakavas on 3 Feb 2020

👍2

All 29 comments

@kobelb in https://github.com/elastic/kibana/blob/master/x-pack/plugins/security/server/lib/authentication/providers/saml.js#L43, Kibana handles invalid refresh tokens with some hardcoded response body strings. Is this a case of ES not returning back the proper message body for an error?

tsullivan on 10 Sep 2018

@elastic/es-security Kibana currently has the two error descriptions hard-coded to determine whether the SAML refresh token is invalid: token has already been refreshed and refresh token is expired.

We're seeing an additional error description token document is missing and must be present, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?

kobelb on 11 Sep 2018

We're seeing an additional error description token document is missing and must be present, should we be hard-coding this error message as well, or is there a better way that we should be determining whether the error thrown by Elasticsearch is indeed because the refresh token is expired?

No, this might be a bug. Did the user logged out (called ES /_xpack/security/saml/invalidate), but somehow still has the cookie?

We would be grateful for:

a token (as submitted in the header) that manifests this behavior, ie it worked at some time but not anymore (you could turn up debug logging for org.elasticsearch.xpack.security.authc)

curl -u elastic http://localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d'
{
 “transient”: {
   “logger”: {
     “org.elasticsearch.xpack.security.authc”: “DEBUG"
   }
 }
}
'

the elasticsearch.log with the stacktrace (the token should be in the logs)
ideally, the .security index contents

curl -u elastic -X GET "localhost:9200/.security/doc/_search"

Maybe someone else from @elastic/es-security has a better idea, that might not require work from the reporter.

albertzaharovits on 13 Sep 2018

@dmlittle we've had trouble recreating this issue that you're seeing, would you mind elaborating on your SAML configuration? Which identity provider are you using? Do you use single logout?

kobelb on 12 Feb 2019

Ι had missed this originally :/ The behavior described here can happen as we are removing tokens periodically. For this to manifest, all the above needs to happen:

The user logs in via the saml authProvider in Kibana on timestamp X and leaves it be
The user then attempts to open kibana again on Y , with Y > X + 24h
The user's kibana authCookie has a lifetime of at least 24 hours
The user never explicitly logs out
Another user logs out or someone uses the Invalidate Token API on Z , with X + 24h < Z < Y ( we potentially trigger the removal of old tokens on any token invalidation while also honoring xpack.security.authc.token.delete.interval )

The event of a user logging out or someone using the invalidation API can trigger the deletion of the now expired access and refresh tokens that the user got on X . When kibana sends a request with this now deleted access token on Y , ES replies with the aforementioned exception as it can't find the associated token document any more.

Arguably, ES should respond with a 401 instead of a 500 here.

jkakavas on 13 Feb 2019

👍1

@jkakavas that makes sense, we can update the SAML auth provider in Kibana to take this into consideration, thanks!

kobelb on 19 Feb 2019

@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged

jkakavas on 19 Feb 2019

@kobelb I'm treating this as an ES bug, so hopefully this will be resolved for the next minor. I'll update this issue with the solution once it's merged

Gotcha, this is in regard to the 500 vs 401 right? There's the potential that even with this change, we might have to augment our auth providers to handle this specific error situation differently.

kobelb on 19 Feb 2019

If I understand the manifestation above correctly, Kibana gets the 500 ( which will be made into a 401 ) when the access token is used in a normal API request. I thought that Kibana expects 401 in this case and will then attempt to use the refresh token

It was originally thought that the 500 was a response to the refresh token API request, but I don't believe this is the case

jkakavas on 19 Feb 2019

I'll try to reproduce locally and see if there is anything we'll need to do in Kibana once we start getting 401 instead of 500 from Elasticsearch.

azasypkin on 15 Mar 2019

Thanks Aleh! To manually cause this to occur, I was running a query to explicitly delete the tokens from Elasticsearch. Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.

kobelb on 15 Mar 2019

Otherwise, we'd have to wait until a certain time period after the refresh token has expired for ES to delete them themselves.

Thanks for the suggestion! This time I just tweaked Elasticsearch source code a bit :)

So yeah, if ES starts responding with 401 in case access token document is missing we'll clear session cookie, drop refresh token and re-initiate login ^[*]. We try to use refresh token only if we get 401 token expired.

Refresh token errors are different though, we get 400 error in most (all?) of the cases:

400 token has already been refreshed
400 refresh token is expired
400 could not refresh the requested token ---> e.g. when document with refresh token has been deleted

For the first two we'll start new SAML Handshake, but for the last one user will get 400 Bad Request and stuck with invalid cookie.

Hey @jkakavas,

is there any case other than 401 token expired when access token is rejected and we should try to use refresh token?
is it possible to have the expired access token without document for the refresh token (401 token expired for access token + 400 could not refresh the requested token when we try to refresh it)? Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed and 400 refresh token is expired.

^[*] If user hits this error during full page reload they will see blank page with JSON error, so user will have to refresh page once again. We can improve that if we want to, e.g. start SAML handshake if access token validation fails with 401 or something like that.

azasypkin on 18 Mar 2019

is there any case other than 401 token expired when access token is rejected and we should try to use refresh token?

No. For both explicitly invalidated and for expired access tokens, we return:

HTTP/1.1 401 Unauthorized
Content-Length: 540
Content-Type: application/json; charset=UTF-8
Www-Authenticate: Bearer realm="security", error="invalid_token", error_description="The access token expired"
Date: Thu, 21 Mar 2019 10:30:14 GMT

{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "token expired",
        "header" : {
          "WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
        }
      }
    ],
    "type" : "security_exception",
    "reason" : "token expired",
    "header" : {
      "WWW-Authenticate" : "Bearer realm=\"security\", error=\"invalid_token\", error_description=\"The access token expired\""
    }
  },
  "status" : 401
}

There is the question of what happens when you use a wrong/malformed access token ( This now returns 500 but will be fixed to return 401 ), what should be returned as error description in that 401 and how you should handle it in Kibana. I think that we should still try to refresh as there is no way of knowing if the access token is _wrong_ (as in we can't find the relevant document ) because it is actually a wrong attempt, or a bit flipped, or the access token was expired and then deleted but it's associated refresh token is still good to use. So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.

is it possible to have the expired access token without document for the refresh token (401 token expired for access token + 400 could not refresh the requested token when we try to refresh it)?

<implementation_detail> it is the same document for access token and refresh token </implementation_detail> . Yes this can happen (when we fix the behavior of getting a 401 for invalid access tokens ). Even now, there is no guarantee that you won't get a 400 when refreshing, even if you got 401 token expired milliseconds before.

Just trying to figure out whether Kibana should implement special handling for refresh token errors other than 400 token has already been refreshed and 400 refresh token is expired

As we're discussing in https://github.com/elastic/elasticsearch/pull/39808, we should be able to handle all 400s in Kibana as "The refresh token can't be used, re-initiate the auth flow'

jkakavas on 21 Mar 2019

So in summary I think we can say that if you try auth with an access token and you get a 401, try to use the refresh token you have.

Nice, these error_description string based checks really bothered me, great that we can fully rely on status codes now.

azasypkin on 21 Mar 2019

Fixed via https://github.com/elastic/kibana/pull/33777. The fix should be available starting from 7.1.

azasypkin on 27 Mar 2019

🚀1 ❤1

@azasypkin we will still return 500 for a token that is deleted (https://github.com/elastic/elasticsearch/issues/38866) , does #33777 handle this ? ( we do plan to fix it for 7.1 , but I think we'd better leave this issue open for tracking until then )

jkakavas on 27 Mar 2019

does #33777 handle this ?

Yes, we added a workaround for this specific error that we'll remove once elastic/elasticsearch#38866 is merged.

azasypkin on 27 Mar 2019

❤1

I know this issue has been closed for a while, but we seem to be encountering it again on a 7.5 cluster: https://github.com/elastic/infra/issues/17590

Should we re-open this issue, or create a new one?

mindbat on 3 Feb 2020

jkakavas on 3 Feb 2020

👍2

I think we should handle this in a separate issue

Gotcha.

If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

Makes sense to me, and thanks!

it would be immensely helpful if we can get the ES logs

Agreed! Since this is an ESS cluster, we don't have direct access to the logs...I'll see what we can do, though.

mindbat on 3 Feb 2020

If @elastic/kibana-security feels otherwise, I will open an ES issue to track this.

No objections from our side, that looks weird indeed. In the worst case I'd expect to see [esaggs] > Request to Elasticsearch failed: {"statusCode":400,"error":"Bad Request","message":"Both access and refresh tokens are expired."}, but not 500.

azasypkin on 4 Feb 2020

The only place we throw with that error message is https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L1503 where we search for a document in the tokens index. However, to get to there we have already called decodeToken that has read that document from that same index just before so if something hasn't deleted the document explicitly in the meantime this makes no sense.

There is always a possibility of a bug ( a race condition somewhere ) but the logs will be very helpful in troubleshooting so I'll defer further analysis until we get them

jkakavas on 4 Feb 2020

@jkakavas What about https://github.com/elastic/elasticsearch/blob/7.5/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/TokenService.java#L444 ?

I think we need the ES server logs for this, and should open a separate issue to track it.

tvernum on 5 Feb 2020

Apologies everyone, I mixed up versions. This is a well known issue that we fixed in 7.6 ( and not in 7.5 ).

Kibana, on the other hand, should have dealt with the 500 in 7.5 too (and would until 7.7 where the handling is removed, right? This issue was closed when kibana added that handling to compensate for our bug in 7.1

jkakavas on 5 Feb 2020

Kibana, on the other hand, should have dealt with the 500 in 7.5 too (and would until 7.7 where the handling is removed, right?

That's correct. And if this happens since 7.2 we restart SAML handshake for non-AJAX requests or return 400 Bad Request for AJAX requests (users will have to manually refresh browser page to trigger SAML handshake in this case). But Kibana isn't supposed to return 500: 'token document is missing and must be present to the user. I just checked this on the latest 7.5 (removed tokens manually) and it seems to work like we expect.

@mindbat can we get verbose Kibana logs (logging.verbose: true + logging.quiet: false + logging.silent: false) when this happens?

azasypkin on 5 Feb 2020

can we get verbose Kibana logs (logging.verbose: true + logging.quiet: false + logging.silent: false) when this happens?

@azasypkin It's an ESS cluster, so we don't have direct access to logs.

I'll ping Cloud ~today~ this week and see if we can get the logs from them.

mindbat on 5 Feb 2020

Ok, I don't have a copy of all the logs, but I did find an Internal Server Error being logged over and over again on the day we experienced this for the first time (I've stripped out some non-essential ESS-only info from the below).

The query it's logging is one of the ones we use in the Kibana Dashboard that's throwing the token document is missing error for us:

Internal Server Error

{
  "_index": "cluster-kibana-filebeat-6.3.2-gcp-us-central1-2020.01.30",
  "_type": "doc",
  "_id": "***",
  "_version": 1,
  "_score": null,
  "_source": {
    "pid": 23,
    "error": {
      "stack": "Error: Internal Server Error\n    at HapiResponseAdapter.toError (/kibana/src/core/server/http/router/response_adapter.js:119:19)\n    at HapiResponseAdapter.toHapiResponse (/kibana/src/core/server/http/router/response_adapter.js:80:19)\n    at HapiResponseAdapter.handle (/kibana/src/core/server/http/router/response_adapter.js:75:17)\n    at Object.interceptAuth [as authenticate] (/kibana/src/core/server/http/lifecycle/auth.js:77:36)\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
      "message": "Internal Server Error",
      "name": "Error"
    },
    "@version": "1",
    "type": "error",
    "input": {
      "type": "log"
    },
    "source": "kibana.log",
    "docker": {
      "container": {
        "image": "docker.elastic.co/cloud-assets/kibana:7.5.0-0",
        "labels": {
          "org": {
            "label-schema": {}
          },
          "co": {
            "elastic": {
              "cloud": {
                "allocator": {
                  "user_id": "***",
                  "type_version": "7.5.0",
                  "instance_id": "***",
                  "cluster_id": "***",
                  "zone": "us-central1-b",
                  "type": "kibana",
                  "kind": "kibana",
                  "id": "***"
                }
              }
            }
          }
        },
        "name": "***,
        "id": "***"
      }
    },
    "url": {
      "query": {
        "ignore_unavailable": "true",
        "ignore_throttled": "true",
        "preference": "1580399421715",
        "rest_total_hits_as_int": "true"
      },
      "pathname": "/elasticsearch/.ent-search-app-logs-*/_search",
      "href": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
      "search": "?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715",
      "path": "/elasticsearch/.ent-search-app-logs-*/_search?rest_total_hits_as_int=true&ignore_unavailable=true&ignore_throttled=true&preference=1580399421715"
    },
    "host": {
      "name": "***"
    },
    "beat": {
      "hostname": "***",
      "name": "***",
      "version": "6.3.2"
    },
    "prospector": {
      "type": "log"
    },
    "level": "error",
    "message": "Internal Server Error",
    "tags": [
      "kibana",
      "logging"
    ],
    "@timestamp": "2020-01-30T21:02:42.000Z",
    "meta": {
    },
    "input_type": "kibana"
  },
  "fields": {
    "@timestamp": [
      "2020-01-30T21:02:42.000Z"
    ]
  },
  "sort": [
    ***
  ]
}

mindbat on 6 Feb 2020

Unfortunately we can't get much value from this log, we really need full Kibana verbose logs to see the entire request authentication flow to know what's going on.

azasypkin on 6 Feb 2020

@azasypkin Gotcha. I'll see about getting verbose logging turned on, and then replicating the error.

mindbat on 6 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings