Synapse: User-Interactive Auth sessions need to be replicated across the processes of a worker-mode deployment

Created on 14 Jan 2020 · 17Comments · Source: matrix-org/synapse

Description

So I was trying to figure out why the /auth/xxx/fallback/web was working on a local homeserver instance but not on matrix.org.

In this flow, there's a sessions dict which governs the current state of your login/registration progress. When you complete a auth stage, your session in this dict is modified. This sessions dict is available to registration and login methods.

Turns out that in the recommended worker setup, you can end up with two workers, each with their own instance of sessions. One tracks your login progress, and another that gets modified when you successfully complete web fallback. However, because they are separate, you don't actually see any progress when you query /register after successfully completing web fallback registration stages.

Essentially /auth/xxx/fallback/web needs to get routed to the same worker as /register.

Synapse v1.8.0

bug p1

Source

anoadragon453

Most helpful comment

sticking it in the db sounds good

richvdh on 2 Apr 2020

👍2 🎉1 😄1

All 17 comments

Also slightly worrying as I think mobile use this to do recaptcha...?

anoadragon453 on 14 Jan 2020

We should investigate if this is actually breaking clients.

richvdh on 23 Feb 2020

Yes, it is breaking clients. I can confirm that, as I just implemented the user interactive auth part of registration in nheko. Basically we try the first stage, but after the user completed the challenge and clicks okay, synapse still returns completed: []. I'm pretty sure I fixed our bugs and the fault is now with synapse/matrix.org in this case. If you want to verify that, it should be possible to test it with the latest nightly build of nheko (still building atm).

deepbluev7 on 23 Feb 2020

thinking about this again: it doesn't need any more investigation. It just needs an update to the documentation, and an update to the matrix.org deployment (and worker-based modular.im deployments).

richvdh on 6 Mar 2020

I'm going to do this.

clokep on 7 Mar 2020

Thanks for picking this up @clokep. Just a reminder to us both that the update needs deploying on the matrix.org and modular.im infrastructure - I suggest you ping @benbz and @jaywink next week to see about getting that done.

richvdh on 7 Mar 2020

Does this fix all cases, where auth fallbacks are used? I'd imagine all requests, that use interactive authentification, need to be routed to the same endpoint as the request, that requires auth. For example I believe that the auth fallback for device deletion needs to be routed to the same worker, that handles that request. Or am I missing something like this not being handled by a worker, so it should work fine?

deepbluev7 on 7 Mar 2020

ohhhhh gosh that's a good point. this is going to be much harder to fix.

richvdh on 7 Mar 2020

I took another look at what uses the sessions object from the auth handler (nothing directly uses it outside of the class, but there's a variety of indirect users via _get_session_info and _save_session).

I think the following patterns use it:

/register$
/auth/.*/fallback/web$
/account/password$
/account/deactivate$
/account/3pid/add$
/delete_devices
/devices/.+$ (deleting a device)
/keys/device_signing/upload$
/login$ (for non-JWT and non-token logins)

The first two on this list are now documented, not sure if we just need to document the rest of these or if there's more to do. Just documenting them seems fairly fragile though (we will likely run into this again in the future).

clokep on 9 Mar 2020

❤1

Yeah, those endpoints match my expectation, of where I know UI auth is used. I think just documenting them is a bit brittle. I can't even find some of them in the doc, so I guess those are handled by the master process? I think it would make the most sense, if all of those could be handled by the client_reader and successive requests from the same client to the client reader were routed to the same instance of the client_reader, but I don't really know where the cut-off for that would be (5 minutes?). I think there should be some way for workers to cross-talk maybe, so that they can continue fallback auth, if needed. Since this only affects requests that contain /auth/.*/fallback/web or auth inside the request body, maybe a better solution than documenting it could be found? Maybe if redis is used for communication between master processes, that could be extended to UI auth?

deepbluev7 on 10 Mar 2020

Maybe if redis is used for communication between master processes, that could be extended to UI auth?

This is where my mind went to as well, especially with the work that @erikjohnston has been doing.

clokep on 10 Mar 2020

interim solutions to this could be:

just document that everything that does UI auth has to go to the master - though that is unsatisfactory because some of them can be quite high-cpu (particularly the bcrypting part of password-setting/registration)
create an internal http API on the master, which just handles UI auth, and which workers can call.

richvdh on 17 Mar 2020

Somewhat related, https://github.com/matrix-org/synapse/pull/649#issuecomment-197387749 notes that UI auth sessions get dropped over restarts.

clokep on 27 Mar 2020

In addition to https://github.com/matrix-org/synapse/issues/6705#issuecomment-600005895 we could attempt to solve this by persisting the data more permanently:

Use Redis (this could be nice since I think it has utilities to auto-expire content), but I'm unsure of the final plan for Redis for workers in the future. It also means that the implementation for worker vs. non-worker will diverge more heavily, but I don't think it would be terrible to abstract.
Persist the data into SQL. This means we'll need to run some sort of periodic task to prune data, but besides that I don't see a ton of downsides.

clokep on 1 Apr 2020

sticking it in the db sounds good

richvdh on 2 Apr 2020

👍2 🎉1 😄1

Fixed by #7302.

clokep on 30 Apr 2020