Phoenix: :pool_size ets.lookup in Tracker

Created on 6 May 2019 · 18Comments · Source: phoenixframework/phoenix

Environment

Elixir version (elixir -v): Elixir 1.8.1
Phoenix version (mix deps): phoenix 1.3.4
NodeJS version (node -v): v8.15.1
NPM version (npm -v): 6.4.1
Operating system: GKE/k8s, Docker (Elixir 1.8 image)

Expected behavior

No errors.

Actual behavior

We recently moved from Heroku to Kubernetes (via GKE). We're using rolling updates, with libcluster to connect the Erlang nodes. When the app first launches, we get a flurry of :ets_lookup/2 errors, with the arguments XXXWeb.Presence and :pool_size. Via Sentry, the breadcrumbs like this:

ets in :ets.lookup/2
arg0 | XXXWeb.Presence
arg1 | :pool_size

lib/phoenix/tracker.ex in Phoenix.Tracker.pool_size/1 at line 279
lib/phoenix/tracker.ex in Phoenix.Tracker.track/5 at line 140
lib/xxx_web/channels/xxx_channel.ex in XXXWeb.XXXChannel.handle_info/2 at line 123
lib/phoenix/channel/server.ex in Phoenix.Channel.Server.handle_info/2 at line 258
gen_server.erl in :gen_server.try_dispatch/4 at line 637
gen_server.erl in :gen_server.handle_msg/6 at line 711
proc_lib.erl in :proc_lib.init_p_do_apply/3 at line 249

For context, Presence.track is being called within an :after_join handler (line 123 above).

I'm thinking it's related to the new distributed setup, but I'm not exactly how. Is there a kind of delay with ETS lookup that Presence isn't accounting for? Anything we can do to mitigate this?

Thanks in advance.

Source

bjunc

👍1

All 18 comments

Are you positive the MyApp.Presence supervise is started in the sup tree? You also need to ensure that the deployed nodes are on the same phoenix_pubsub versions, and critically, they have the same configured pool_size. We used to default to the number of schedulers, but in recent versions the default is 1. So ensure you are also on the latest phoenix_pubsub on each deployment, and you have not explicitly configured different pool sizes.

chrismccord on 6 May 2019

I believe we're good on the making sure the supervisor has started (see below).

def start(_type, _args) do
    import Supervisor.Spec

    # Define workers and child supervisors to be supervised
    children = [
      # Start the Ecto repository
      supervisor(MyApp.Repo, []),

      # Start the endpoint when the application starts
      supervisor(MyAppWeb.Endpoint, []),

      # Start the socket channel presence supervisor
      supervisor(MyAppWeb.Presence, []),

      # ...

I'm thinking that part is correct, because Presence does work, it's just that we get a flurry of errors when first starting up. Maybe it's the pubsub version issue? Would it make sense that the errors would go away after old containers are drained, and therefore no conflict between versions?

How do I determine the pubsub version and the configured pool_size? I don't think we are explicitly setting those anywhere.

bjunc on 6 May 2019

By version, are you referring to the dep version? It appears we are consistently using {:hex, :phoenix_pubsub, "1.1.2", ...} through the last few deployments (according to committed mix.lock).

bjunc on 6 May 2019

Sorry for the triple post. Current config.exs looks like this:

config :myapp, MyAppWeb.Endpoint,
  pubsub: [name: MyApp.PubSub, adapter: Phoenix.PubSub.PG2]

Should we maybe set an explicit amount moving forward? In doing this, are we accepting that we will get one more round of errors on the next deployment due to a possible mismatch in pool size?

bjunc on 6 May 2019

Oh I see – it's only a flurry of errors on boot, then everything is happy and working as expected? It looks like a race between the endpoint starting before the tracker (channels asking to be tracked), before the tracker is up and running. Unfortunately moving the tracker above the Endpoint doesn't help, because we need to have the pubs started first. This is one reason we are decoupling the starting of the PubSub server from the endpoint in the future. To verify if this race is indeed occurring, you can try this in your endpoint config and sup tree:

```elixir
config :myapp, MyAppWeb.Endpoint,
pubsub: [name: MyApp.PubSub] # note, adapter has been removed so it won't be started

application.ex

children = [
  supervisor(MyApp.Repo, []),
  supervisor(MyApp.Pubsub,
    name: MyApp.PubSub,
    adapter: Phoenix.PubSub.PG2,
    fastlane: Phoenix.Channel.Server
  ),
  supervisor(MyAppWeb.Presence, []),
  supervisor(MyAppWeb.Endpoint, [])
]

chrismccord on 6 May 2019

👍1

Yeah, only a flurry of errors on boot, then everything works as expected. Sounds like you don't think it's a version or pool size issue, correct?

For additional context, we currently have two containers being load balanced, with a max surge of 1. So the logs show ~10 ets_lookup errors over the span of 1 second, then about 20s later, another ~10 ets_lookup errors over the span of a second (the second container?).

Hacky, but would preceding send(self(), :after_join) (which calls Presence.track) with a sleep also test the race theory?

bjunc on 6 May 2019

Sounds like you don't think it's a version or pool size issue, correct?

Correct. I believe the race is the way the sup tree is started and you may experience it under high load when the instance first boots. Sleeping is a good test, so if you are able to change the sup tree and report back, that would be very helpful. Thanks!

chrismccord on 6 May 2019

Oops, I mean't to stay, sleeping is not a good test

chrismccord on 6 May 2019

I'm getting module not available for the supervisor module. Am I missing something?

bjunc on 6 May 2019

It looks like I fat-fingered MyAppWeb.PubSub and it should be MyApp.PubSub. Can you verify the mod names?

chrismccord on 6 May 2019

Same issue. I'm using the same module name referenced in the config (eg. pubsub: [name: MyApp.PubSub]) and in the Presence module (eg. pubsub_server: MyApp.PubSub). Correct?

bjunc on 6 May 2019

Derp, sorry! It should be:

supervisor(Phoenix.PubSub.PG2,
  name: MyApp.PubSub,
  fastlane: Phoenix.Channel.Server
),

chrismccord on 6 May 2019

👍1

No worries, but now I'm getting:

failed to start child: Phoenix.PubSub.PG2
(FunctionClauseError) no function clause matching in Module.concat/2
(elixir) lib/module.ex:685: Module.concat({:name, MyApp.PubSub}, Supervisor)
(phoenix_pubsub) lib/phoenix/pubsub/pg2.ex:50: Phoenix.PubSub.PG2.start_link/2

bjunc on 7 May 2019

Bah, I was mixing the new child spec style. This will work, I promise :)

      supervisor(Phoenix.PubSub.PG2, [[
        name: MyApp.PubSub,
        fastlane: Phoenix.Channel.Server
      ]]),

chrismccord on 7 May 2019

👍1

Okay, so no compile issues with that. I deployed it, and I got the same flurry of errors. But here's the thing (and I totally should have caught this before), but the errors occur on the _draining_ nodes, not the new ones!

For example, I just pushed v7.12.0 of our app, and Sentry logged the ets_lookup errors for v7.11.0. Since the errors always happens when I deploy, and the errors seem to trace back to _joining_ a channel (not disconnecting), I (wrongfully) thought it was the new servers booting up that were throwing the errors. But no, it's old ones coming down.

In looking at the k8s logs and comparing to the Sentry timestamps, it appears the sequence goes like this:

v2 server C ↑ no errors
v1 server A ↓ flurry of ets_lookup errors
v2 server D ↑ no errors
v1 server B ↓ no errors

Any idea what would cause that?

bjunc on 11 May 2019

Update.

So I just pushed a patch (v7.12.1), and NO errors! v7.12.0 was the first time I tried your fix, and as I've since realized that it's during k8s _draining_ when the errors occur, then it appears your fix worked! The weird thing is that it seems like the app is actually booting up during the draining process!? Regardless, I think that's another topic. It seems your fix worked.

Since your fix isn't the default, are there any caveats / quirks I should be aware of?

bjunc on 12 May 2019

No caveats whatsoever. The Endpoint simply adds the supervisor as a child itself as shown above. Since you've now narrowed down the issue to be happening while draining, I'm not entirely sure where the problem is since we add the pubsub sup above the webserver child, so shutting things down in reverse order should not allow new connections after the pubsub server is down.

chrismccord on 13 May 2019

Yeah, I'm not really sure what's causing it. The app seems to be "starting" while the pod is draining. I'm not enough of a Kubernetes wizard to know how the draining process works, but that seems odd to me. Either way, the steps above fix the flurry of errors, so thank you! If I figure out what causes it, I'll post back here for future generations.

bjunc on 16 May 2019

Was this page helpful?

0 / 5 - 0 ratings