Synapse: Unable to parse SAML2 response: Unsolicited response

Created on 10 Mar 2020 · 25Comments · Source: matrix-org/synapse

Sometimes, when authenticating with passwordless login on Mozilla's SSO, the user's browser gets told to POST to /authn_response with a SAML AuthN response (as expected), but that call seems to fail with the error "Unable to parse SAML2 response: Unsolicited response: id-XXXXXXXXXXXXX".

I'm currently not sure why this happens.

bug mozilla

Source

babolivier

All 25 comments

possibly we're hitting /authn_response twice for the same request somehow?

richvdh on 17 Mar 2020

Is there any consistent way to reproduce this or just logging in a bunch of times?

clokep on 23 Apr 2020

Is there any consistent way to reproduce this or just logging in a bunch of times?

If it helps, there is a bunch of these daily in the Modular Synapse sentry.

jaywink on 23 Apr 2020

👍1

Summary:

I think this might only be an issue in worker mode due to requests coming back to a different worker (and using in memory storage which has no knowledge of the original request). I see a few ways forward:

Persist the information into the database so all workers have it.
Disable the checking of unsolicited checking of responses of SAML.
Ensure that SAML requests and callbacks always go to the same worker (#7530 has some thoughts on that).

More details

I think #7530 might actually be a duplicate of this!? My thought is the following happens:

User does something to initiate a SAML flow.
Synapse returns a redirect to the IdP and stores the SAML session ID in memory.
The user goes through the SAML flow, blah blah.
The user gets directed back to Synapse, but a different worker. 💥 You get the error from above.

Note that the SAML response is valid, just that the worker knows nothing about it.

I suspect the solution is to store this information in the database, similar to what we did for #6877.

There's already a table (user_external_ids) which has auth_provider, external_id, user_id as columns. We could likely have a table which is auth_provider, request_id, creation_time, ui_auth_session_id`.

I was curious how OIDC handled this, and that doesn't seem to persist anything in memory, so taking another look at why we have this _outstanding_requests_dict:

It is passed into pysaml2 and used to ensure that this is not an "unsolicited query" (you can disable this check, I'm not sure if we should however).
It is used to get back to the UI auth session ID after the redirection is all done.
(We also prune items from this list after a period of time, which is why we store creation time on it.)

Note that we could actually pass the UI auth session ID in the RelayState (since that's unused for UI auth, see #7484), so my question is: do we need to protect against unsolicited requests like this? I'm unsure of the security ramifications of disabling this.

I'd be curious if other people have ideas on how to fix this!

clokep on 22 May 2020

so my question is: do we need to protect against unsolicited requests like this?

I've never really understood its purpose. Possibly a defence in depth against CSRF attacks? We can probably remove it if that solves any problems.

However, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

richvdh on 23 May 2020

_However_, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

A similar symptom would happen during a restart of services, I'm unsure how often that would happen on Modular instances.

clokep on 26 May 2020

It might also be worth fixing the known situation and seeing if it still happens, but ideally we'd want to ensure the solution works in all cases...

clokep on 1 Jun 2020

possibly. I'm hoping we'll be able to get some logs out of the modular instance to help understand what is going on.

richvdh on 1 Jun 2020

ok well, I got logs for two instances of this this morning on the mozilla instance.

The first one is a complete mystery tbh. A client, from an IP address we've never seen before, suddenly pops up with a SAML session ID we've never heard of (or at least, I couldn't find in some brief grepping of the logs). I guess it's just an old session, and the user used an old link in their browser history or something. The main source of regret here is that the error message isn't better ("oops something went wrong" isn't terribly informative.)

The second one is much clearer: the user took 6 minutes to validate their email address and come back. We expire the SAML session dict after only 5 minutes. Particularly given auth0's email validation links are valid for 15 minutes, this seems... silly.

richvdh on 2 Jun 2020

👍1

I wonder if the expiry time should be configurable?

clokep on 2 Jun 2020

it is. But I think the default is probably too short.

richvdh on 2 Jun 2020

Looks like this is already configurable, the default is 5 minutes:

https://github.com/matrix-org/synapse/blob/b2b86990705de8a099093ec141ad83e09f182034/synapse/config/saml2_config.py#L283-L287

Edit: Doh, you already set it is configurable. 😢

clokep on 2 Jun 2020

I put up #7664 to increase the timeout. Might not be an ideal solution, but should fix a concrete case we've seen.

clokep on 9 Jun 2020

This is happening much less after the changes in #7664. Not sure if these are people taking greater than the 15 minutes to finish validation or not. I'm unclear what the next steps might be here: try to improve the error message maybe?

clokep on 25 Jun 2020

are we still getting reports of this? I'd be inclined to close it if not.

Otherwise yes, probably need to remember where the "oops something went wrong" error message is coming from and try to make it give more clues as to what went wrong.

richvdh on 29 Jun 2020

Yes, it looks like we're still seeing this (around 5-10x/day on Modular).

babolivier on 29 Jun 2020

gosh. it was only a couple a day back when I investigated a few weeks ago (mind you, there was some brokenness in logging at the time).

ok then I would like to suggest a two-pronged approach:

investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail
assuming it turns out that it is just lots of people turning up with old saml session ids, try to improve the error handling.

richvdh on 29 Jun 2020

it was only a couple a day back when I investigated a few weeks ago

Sentry seems to be bucketing some separately, in this case I looked at two separate issues that each had roughly between 2 and 5 occurrences per day, maybe that explains why you were seeing less of them?

babolivier on 29 Jun 2020

investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail

I spent some time with these logs and with Sentry and couldn't really figure out if there was a correlation between old requests or something else happening.

I think improving the error handling might be useful, I'm guessing that the concern with that is that we're missing a "real" bug?

clokep on 7 Jul 2020

Now that we have better logging I looked back over the last 7 days of this error occurring on the Mozilla instance:

7 were due to the SAML session being used outside the 15 minute timeout.
4 were due to a SAML session ID being re-used (see below).
1 was due to a server restart (which is more similar to #7530 than to this issue, in my opinion).

Note that we remove the outstanding request once a response for it is received -- this seems correct, but I'm unsure if SAML allows for a single session to be completed multiple times (assuming that they are all within the proper timeout and such).

I'm not sure what, if anything, should be done to handle these cases? Maybe we can improve the error page to say something like "Your SAML session might have timed out or already been completed. Please try again." Or something to that effect?

clokep on 31 Aug 2020

👍1

do we have any idea why people would re-use the SAML session ID?

Improving the error text seems sensible either way.

richvdh on 1 Sep 2020

do we have any idea why people would re-use the SAML session ID?

My guess is that it is due to reloading a page? Or if e-mail verification is in the workflow it could be clicking on a link twice? I should note that the "re-used" SAML session IDs were within the 15 minute timeout period (and all from 2 users).

clokep on 1 Sep 2020

Since i couldn't remember the behavior the user saw here they currently just get an internal server error sent back to them (since it is part of the redirect flow the client isn't involved).

Steps to reproduce this sanely: