Phoenix: Race condition when presence is left shortly after presence state retrieval

Created on 17 Apr 2017  路  31Comments  路  Source: phoenixframework/phoenix

Environment

  • Elixir version (elixir -v): Elixir 1.4.0, erts-8.2)
  • Phoenix version (mix deps): phoenix 1.3.0-rc.1
  • NodeJS version (node -v): v7.7.2
  • NPM version (npm -v): 4.4.1
  • Operating system: OS X 10.12.3

Expected behavior

Using the default setup for channels and presence, any leaves happening concurrently with a new channel's connection should result in a client side presence state without the left presence (either through it not being included in presence_state or it being removed shortly thereafter through a presence_diff.

Actual behavior

Some presence changes, that are applied after the newly joined channel has retrieved its current presence state, will be sent to the client before the presence_state event is sent. When this happens, any such presence leaves will be ignored by the default js syncDiff. Once the presence_state event is received, the left presences will be kept even though they are not present anymore.

Easiest way to reproduce locally

  1. Introduce a delay in a Presence.fetch/2 callback, that scales linearly with the number of presences. Example: def fetch(_topic, presences) do Process.sleep(1000 * Enum.count presences); presences end
  2. Open a few tabs to make sure the presence state contains 5 or more items.
  3. Reload a tab, triggering a new channel connection.
  4. Within a few seconds, close one or more of the other tabs.
  5. Observe that the newly opened tab recevies the leave events before it's presence_state, and that it keeps those left presences until the next reload.

A runnable example of the above is available at https://github.com/alvinlindstam/phoenix_social_presence/tree/race-condition

I understand that the one second delay is an extreme example, but a fetch function that takes longer for many presences than for a small diff is hardly unlikely. I'm not sure if this would be possible to reproduce without a fetch function, but I can't se any reason why a presence change handled just after a presence state fetch by a new channel couldn't be sent to and handled by the transport process before the channel process itself sends the state to the transport.

Most helpful comment

Thanks for bearing with me on this @alvinlindstam . The overlap with the race on the Presence server side when this was initially reported threw me off, and in retrospect you provided all the information I could ask for with an exact project to reproduce, so I really really appreciate it :) This has been fixed in https://github.com/phoenixframework/phoenix/pull/2763 , using your example project as a means to expose the race. Now fixed in master and will be out in the next release. In the meantime you could vendor the phoenix.js. Thank you!!! 鉂わ笍鉂わ笍鉂わ笍馃惀馃敟

All 31 comments

Short of abandoning the idea that getting an initial state and then handling any diffs is sufficient, I'm guessing that the appropriate fix is to handle this client side. The syncDiff could be able to return or handle any unmatched leaves, so that they may be re-applied after the next syncState. This would however require client side API-changes.

You could also make sure that they are synced by intercepting any presence_diff events, but that is of course undesirable from a performance standpoint.

You could also eliminate the ignore leaves issue by retrieving the presence state in the join callback instead of after the join, but apart from any other delay issues, you would instead get the same race condition possibility but for joins.

I got some understanding for the client side lib when porting the presence functions to https://www.npmjs.com/package/phoenix-presence-immutable, so I wouldn't mind creating a PR for that if you agree that it's a good idea.

Thanks for the report Alvin! Let me dig into things a bit to see where best to solve this (client vs server).

+1 - we get multiple zombie presences on our prod server right now. It might related to this

Can you provide more info? How many replicas? Stacktraces? What tracked processes are doing, etc?

@alvinlindstam note, I am working on an internal change to the channel wire protocol that will resolve these races. It will require changes on the client, which will be supported in our 1.3 phoenix.js release. The server changes are backwards compatible, but the race will only be fixed by clients speaking the new protocol. I should have something out soon!

I'm trying to understand exactly why it happens for us, so we have a chat application and rely on presence to determine if we should send push notifications or not.
We started noticing we were not getting any new push notifications and by looking at the presence list we can see some zombie connections:

Presence.list("user:49")
%{"49" => %{metas: [%{client: "web", phx_ref: "PYR4iqp9blg=", since: 1492794605,
       status: "online"},
     %{client: "web", phx_ref: "oRjKL9vS5nw=", phx_ref_prev: "Z5A1WbC22g8=",
       since: 1492796622, status: "away"},
     %{client: "web", phx_ref: "j+z+7qY9Fj8=", phx_ref_prev: "jrSTX5LuQKY=",
       since: 1492796623, status: "away"},
     %{client: "web", phx_ref: "zOXh0cqEf5k=", since: 1493267718,
       status: "online"}]}}

On the last two ones (set as away) the socket connection is terminated. Shouldn't that take care of actually removing those instances?
In Alvin's example he says that the presence only goes away when he refreshes the tab, but for us, if the user closed the tab the zombie presence would never go away?
Is there any way we can manually clean those for now just to get the current system to work?
Right now we're running Phx 1.2, in the process of updating to 1.3.

yes, once the socket connection terminates, the presences should be removed. Are you absolutely positive the socket transport (and therefore its channel pids) are actually terminated? If a physical connection closes ungracefully in some cases, the transport connection will not be immediately closed and may remain alive until tcp keep-alive finally nukes it. For example, if a client closes their laptop lip, or hard disconnects where the browser does not send a proper close frame for the websocket connection. If you can indeed confirm the processes have gone away but the presences remain it's definitely a bug, but we need more sanity checks first. Thanks!

We're investigating this right now. Initial tests on local work perfectly. We're wondering if this could be related to our deployment.. Right now we deploy using docker containers, one container always stays up whenever a new build comes in - it seems that way the user never loses the ws connection, but we're wondering if this could break presence somehow.
I'll investigate further and keep you posted.

Are you using distributed elixir with the pg2 pubsub adapter or redis?

We know the ip of every server we bring up, then we just call connect manually:

      {:ok, ips} ->
#        IO.puts "Connecting to #{name}: #{inspect ips}"
        for {a,b,c,d} <- ips do
          Node.connect :"server@#{a}.#{b}.#{c}.#{d}"
        end

Is that a problem? It is similar to what you do here https://dockyard.com/blog/2016/01/28/running-elixir-and-phoenix-projects-on-a-cluster-of-nodes

Closed by accident, reopened the issue.

I've also seen zombie users in Phoenix presence for a long time. We use the pg2 adapter with the latest stable versions (phoenix 1.2.1, phoenix_pubsub 1.0.1), and run inside containers with EPMD-less distribution. During deployments, we bring up a new set of nodes, make them join the cluster (:net_adm.ping all nodes), and then bring down the old nodes. We also used Redis pubsub in the past though, and had similar issues.

Here's an example. First I start a hidden node and connect to the cluster, which has two nodes. I confirm they see each other fine:

iex([email protected])26> :recon.named_rpc nodes, fn -> Node.list() end
{["[email protected]": [:"[email protected]"],
  "[email protected]": [:"[email protected]"]], []}

If I query Presence on both nodes, they agree that users 7 and 118 are in the "global" channel right now. Note how "online_at" for user 7 has a month-old timestamp, which is impossible, because our last deployment was a week ago (all nodes report :erlang.statistics(:wall_clock) of about 7 days).

iex([email protected])23> :recon.named_rpc nodes, fn -> MyApp.Presence.list("global") end
{["[email protected]": %{"118" => %{metas: [%{online_at: "1493621693",
         phx_ref: "KUVDXzwNm6k="}]},
    "7" => %{metas: [%{online_at: "1490859335", phx_ref: "Np+kT2qhiVA="}]}},
  "[email protected]": %{"118" => %{metas: [%{online_at: "1493621693",
         phx_ref: "KUVDXzwNm6k="}]},
    "7" => %{metas: [%{online_at: "1490859335", phx_ref: "Np+kT2qhiVA="}]}}],
 []}

I know user 7 is not actually online, and if I query the ETS tables directly, I only see entries for the one user who's really online (118).

iex([email protected])22> :recon.named_rpc nodes, fn -> :ets.tab2list MyApp.PubSub.GC0 end
{["[email protected]": [{#PID<16737.17371.8>, "global"},
   {#PID<16737.17370.8>, "users:118"},
   {#PID<16737.456.0>, "phx_presence:Elixir.MyApp.Presence"}],
  "[email protected]": [{#PID<16738.453.0>,
    "phx_presence:Elixir.MyApp.Presence"}]], []}

I'm not sure what would be interesting to check in the tracker state, but I poked around, and "values" only mentions the one user that's online for real:

       values: %{{{:"[email protected]", 1493013769920286},
          2614} => {#PID<9671.17371.8>, "global", 118,
          %{online_at: "1493621693", phx_ref: "KUVDXzwNm6k="}}}},

If I kill MyApp.Presence on all nodes using RPC, things look good again (for a while).

I've also seen a variation on this issue before, where nodes don't agree about who's online: they can talk to each other just fine (we use :global for some things), but MyApp.Presence("global") returns a zombie user on one node and not on the other node. At the time I thought this was something related to https://github.com/phoenixframework/phoenix_pubsub/issues/55, but perhaps it wasn't.

I'll debug more when time allows - any pointers would be appreciated!

Alright, we did a lot of testing today and got some answers and a bunch of questions 馃槀

So first, about our architecture:

  • We use PG2 for pub/sub
  • Latest stable 1.2.1 Phoenix
  • Our deployment is done through CircleCI, we use Distillery and build a docker container that we then manage using Rancher.

    How we reproduce the problem

First we connect a bunch of users to the application, then we trigger a deploy.
Under the hood, this will build containers with the new code, start them, and then when everything is good it will bring down the old containers.
We manually connect nodes because rancher will give us one ip per container, so we got them to talk to each other (Node.list shows all connected nodes correctly).
The problem is that when the old containers go down we can verify zombies on the presence list of the servers.

More info on the problem

First, we added the node information to the metadata on the presence list, which actually shows that the zombies are "connected" to instances that no longer exist:

...
%{client: "web", node: :"server@*.*.*.128", phx_ref: "KaiuIglEXFw=",
       since: 1493686595, status: "online"},
...

And the Node list (btw, lucas@local takes over the ip, but it is different that 128 as well):

`[:"server@*.*.*.74", :lucas@local]`

This is very interesting in itself, since from what I understand presence uses a heartbeat system to determine if the user is online, correct? Why wouldn't it timeout in this case? (For sanity, we pinged, and tried to access the old containers and they sure seem to be gone).
Now, to also confirm @dmorneau's case, we spinned up a bunch of servers (like 4, which temporarily scales to 8 during a deploy) and the problem with having different Presence lists is also replicated. We notice that some servers have more zombies than others, which again, is very weird. If the new nodes inherited the presence list from the old ones, how is it possible they have different values?

Our assumptions so far

Problem is definitely related to deployment and possibly docker, we think it could be something related to Rancher( and/or docker ) not gracefully shutting down our containers. Which in turn raises the question if that could cause a bug in Phoenix presence.
We've been trying (unsuccessfully) to better control the shutting down of the containers, right now rancher kills the container completely after 10s of a SIGTERM). But even then we don't know if that could fix the issue.

Right now our only solution is to scale containers to 0 during deploy, but that causes downtime :(
We can also use the metadata above to check if the user in the presence list is part of our current Node.list, which makes it reliable again, but leaves a memory leak :(

We're willing to put as much time as needed on this, for our use case Presence is considered critical, but it would be awesome if you could shed some light in the direction we should be going.
Any help is very welcome and appreciated! Thank you!

@alvesl and @dmorneau: Your issues seem to be related to syncing the presence between different nodes server side, while this issue is about syncing the server state to the client.

Could you create a new issue regarding your syncing problems?

@alvinlindstam I am on vacation this week, but I'll be looking into this as soon as I return next week. Thanks for the info!

@alvinlindstam can you please try moving the client push to AFTER you start tracking presences? https://github.com/alvinlindstam/phoenix_social_presence/blob/race-condition/lib/social_presence/web/channels/user_channel.ex#L21

Otherwise between sending the list to the client and tracking, you will lose notifications about process leaving.

@dmorneau can you see any information about the user 7 anywhere in the tracker state?

@alvinlindstam actually, please ignore me. Changing the order should not affect your issue.

Your issue seems to be that the client is receiving the "presence_diff" events before it receives its first "presence_state". In such cases, we should probably store the diffs in the client until the first "presence_state" is done. I am not sure if the client currently does that but that should provide a direction for fixing the issue.

@josevalim: Precisely. Given that the subscription to the topic is done before :after_join message is handled, other user's presence_diffs might be pushed before the presence_state. Storing the presence_diffs client side would be my fix as well, but @chrismccord seem to be looking for some higher level fix for this.

This is not a high priority fix for me, I don't use this in production yet, so I don't mind waiting for Chris' fix.

Trivia: I actually found this when experimenting with https://medium.com/@alvinlindstam/phoenix-presence-for-social-networks-5fb67143f0ad, where my presence_state was generated by merging many calls to Presence.list.

@alvinlindstam after talking to @chrismccord I think he is also settling in the same solution as we propose (store the diffs until we receive the state) so if you would like to give it a try, please go ahead.

I will. However, I'd like some input on the proper API for it.

As it is today, the suggested way to handle the presence state client side is to store it in a plain object, and use the syncState and syncDiff functions to update it when the given events occur. Storing unhandled diffs would require an extra item of state to keep around (it can't be contained in the channel abstraction since that doesn't know of the presence state, it only passes the events on).

One option: Make this extra state explicit:

let presences = {} // client's initial empty presence state
let unhandledDiffs = {}

// receive initial presence data from server, sent after join
myChannel.on("presence_state", state => {
  presences = Presence.syncState(presences, state, onJoin, onLeave, unhandledDiffs)
  displayUsers(Presence.list(presences))
})
// receive "presence_diff" from server, containing join/leave events
myChannel.on("presence_diff", diff => {
  presences = Presence.syncDiff(presences, diff, onJoin, onLeave, unhandledDiffs)
  this.setState({users: Presence.list(room.presences, listBy)})
})

The solution above would require clients to pass the extra state object around. Here, we would be mutating the unhandledDiffs, breaking the purely functional properties of syncState/syncDiff. We could of course return a mutated copy instead, but that would return some composite object that is different from the presences object used today.

Another option, if we do need to change the client side api, is to embed the state in some object that phoenix.js would control:

let presences = new Presence(onJoin, onLeave)  // client's initial empty presence state


// receive initial presence data from server, sent after join
myChannel.on("presence_state", state => {
  presences.syncState(state)
  displayUsers(presences.list())
})
// receive "presence_diff" from server, containing join/leave events
myChannel.on("presence_diff", diff => {
  presences.syncDiff(diff)
  this.setState({users: presences.list(listBy)})
})

The option above would mutate the state in an existing object, which maybe isn't what we want.

A third, less invasive option, would be to maintain a global set of unmatched leaves hidden from the application. The current API could work without changes, but with the drawback that the syncState and syncDiff would not be pure anymore. In order for it to work, we'd also have to be sure that phx_ref is unique for all presence metas, even on different topics. Am I correct in understanding that this should generate random refs that are supposed to be globally unique?

Or did you think that the channel should store the events until the first presence_state? How would we handle reconnects in such a case?

@alvinlindstam unfortunately I am not well versed in JS to be able to judge the solutions. It is probably best to wait for @chrismccord's input then.

I've been playing around with this: https://github.com/phoenixframework/phoenix/compare/master...alvinlindstam:presence-js-fixes

I've bundled it with an overhaul of the client side API for presences, which I'm sure isn't something to do lightly. I'd think that such an API would match the style of the Channel and Socket classes, and could clarify the code somewhat.

It also makes it possible to add checks such as that for unhandled leaves without changing the API.

Any input is appreciated :)

I believe this issue has been fixed by https://github.com/phoenixframework/phoenix_pubsub/pull/80 as the client side should not have to handle any deduping itself. Thanks!

I have not tested this since https://github.com/phoenixframework/phoenix_pubsub/pull/80, but as I understand it that change only modifies the syncing of the presence state between nodes. This issue is about syncning between the local state of each node and the client side state of it's connected presence channels (it can be observed with just a single node).

The phoenix.js helpers and the default suggested implementation of sending presence_state and presence_diff events does currently depend on the correct order of events, but that order does not seem to be guaranteed:

The join and leave events are generally sent directly to the client as soon as the channel is joined. I assume that Presence.list is atomic, so that it always returns the full presence state of the topic at one moment in time, but there is not guarantee that leaves or joins that happen just efter that state was fetched is transmitted to the client after the presence.

I think it's premature to close this, unless we've confirmed that it's fixed.

Ah sorry I jumped the gun on the issue. I thought it was caused by the presence ghosting issue, but I see the issue is server side presence_diff racing the presence_state. This will be fixed in 1.4 Thanks!

Thanks for bearing with me on this @alvinlindstam . The overlap with the race on the Presence server side when this was initially reported threw me off, and in retrospect you provided all the information I could ask for with an exact project to reproduce, so I really really appreciate it :) This has been fixed in https://github.com/phoenixframework/phoenix/pull/2763 , using your example project as a means to expose the race. Now fixed in master and will be out in the next release. In the meantime you could vendor the phoenix.js. Thank you!!! 鉂わ笍鉂わ笍鉂わ笍馃惀馃敟

I understand the confusion :)

Thanks for looking at this again, the fix seems looks god for the reported issue. I hate to be a nitpicker, but does not the use of the synced flag in https://github.com/phoenixframework/phoenix/commit/0b0828cd23b3b6ebad2331bd72fe4b2b78851756#diff-e1d1e2ecdbf26e31a7a537204bd1081bR1191 mean that this might still happen whenever the client reconnects/rejoins a presence topic/channel? On a reconnect, such as after a network interruption, the client joins the channel again and is resent the presence state, but state.synced == true so it will not save diffs that arrive between the reconnect and the new state.

Maybe we could reset the synced flag on each reconnect from the client?

Bah, you are right. I need to do some thinking about this to see if we can resolve it without requiring change to existing end-user code.

Was this page helpful?
0 / 5 - 0 ratings