Elixir 1.8.1phoenix 1.3.4v8.15.16.4.1GKE/k8s, Docker (Elixir 1.8 image)No errors.
We recently moved from Heroku to Kubernetes (via GKE). We're using rolling updates, with libcluster to connect the Erlang nodes. When the app first launches, we get a flurry of :ets_lookup/2 errors, with the arguments XXXWeb.Presence and :pool_size. Via Sentry, the breadcrumbs like this:
ets in :ets.lookup/2
arg0 | XXXWeb.Presence
arg1 | :pool_size
lib/phoenix/tracker.ex in Phoenix.Tracker.pool_size/1 at line 279
lib/phoenix/tracker.ex in Phoenix.Tracker.track/5 at line 140
lib/xxx_web/channels/xxx_channel.ex in XXXWeb.XXXChannel.handle_info/2 at line 123
lib/phoenix/channel/server.ex in Phoenix.Channel.Server.handle_info/2 at line 258
gen_server.erl in :gen_server.try_dispatch/4 at line 637
gen_server.erl in :gen_server.handle_msg/6 at line 711
proc_lib.erl in :proc_lib.init_p_do_apply/3 at line 249
For context, Presence.track is being called within an :after_join handler (line 123 above).
I'm thinking it's related to the new distributed setup, but I'm not exactly how. Is there a kind of delay with ETS lookup that Presence isn't accounting for? Anything we can do to mitigate this?
Thanks in advance.
Are you positive the MyApp.Presence supervise is started in the sup tree? You also need to ensure that the deployed nodes are on the same phoenix_pubsub versions, and critically, they have the same configured pool_size. We used to default to the number of schedulers, but in recent versions the default is 1. So ensure you are also on the latest phoenix_pubsub on each deployment, and you have not explicitly configured different pool sizes.
I believe we're good on the making sure the supervisor has started (see below).
def start(_type, _args) do
import Supervisor.Spec
# Define workers and child supervisors to be supervised
children = [
# Start the Ecto repository
supervisor(MyApp.Repo, []),
# Start the endpoint when the application starts
supervisor(MyAppWeb.Endpoint, []),
# Start the socket channel presence supervisor
supervisor(MyAppWeb.Presence, []),
# ...
I'm thinking that part is correct, because Presence does work, it's just that we get a flurry of errors when first starting up. Maybe it's the pubsub version issue? Would it make sense that the errors would go away after old containers are drained, and therefore no conflict between versions?
How do I determine the pubsub version and the configured pool_size? I don't think we are explicitly setting those anywhere.
By version, are you referring to the dep version? It appears we are consistently using {:hex, :phoenix_pubsub, "1.1.2", ...} through the last few deployments (according to committed mix.lock).
Sorry for the triple post. Current config.exs looks like this:
config :myapp, MyAppWeb.Endpoint,
pubsub: [name: MyApp.PubSub, adapter: Phoenix.PubSub.PG2]
Should we maybe set an explicit amount moving forward? In doing this, are we accepting that we will get one more round of errors on the next deployment due to a possible mismatch in pool size?
Oh I see – it's only a flurry of errors on boot, then everything is happy and working as expected? It looks like a race between the endpoint starting before the tracker (channels asking to be tracked), before the tracker is up and running. Unfortunately moving the tracker above the Endpoint doesn't help, because we need to have the pubs started first. This is one reason we are decoupling the starting of the PubSub server from the endpoint in the future. To verify if this race is indeed occurring, you can try this in your endpoint config and sup tree:
```elixir
config :myapp, MyAppWeb.Endpoint,
pubsub: [name: MyApp.PubSub] # note, adapter has been removed so it won't be started
children = [
supervisor(MyApp.Repo, []),
supervisor(MyApp.Pubsub,
name: MyApp.PubSub,
adapter: Phoenix.PubSub.PG2,
fastlane: Phoenix.Channel.Server
),
supervisor(MyAppWeb.Presence, []),
supervisor(MyAppWeb.Endpoint, [])
]
Yeah, only a flurry of errors on boot, then everything works as expected. Sounds like you don't think it's a version or pool size issue, correct?
For additional context, we currently have two containers being load balanced, with a max surge of 1. So the logs show ~10 ets_lookup errors over the span of 1 second, then about 20s later, another ~10 ets_lookup errors over the span of a second (the second container?).
Hacky, but would preceding send(self(), :after_join) (which calls Presence.track) with a sleep also test the race theory?
Sounds like you don't think it's a version or pool size issue, correct?
Correct. I believe the race is the way the sup tree is started and you may experience it under high load when the instance first boots. Sleeping is a good test, so if you are able to change the sup tree and report back, that would be very helpful. Thanks!
Oops, I mean't to stay, sleeping is not a good test
I'm getting module not available for the supervisor module. Am I missing something?
It looks like I fat-fingered MyAppWeb.PubSub and it should be MyApp.PubSub. Can you verify the mod names?
Same issue. I'm using the same module name referenced in the config (eg. pubsub: [name: MyApp.PubSub]) and in the Presence module (eg. pubsub_server: MyApp.PubSub). Correct?
Derp, sorry! It should be:
supervisor(Phoenix.PubSub.PG2,
name: MyApp.PubSub,
fastlane: Phoenix.Channel.Server
),
No worries, but now I'm getting:
failed to start child: Phoenix.PubSub.PG2
(FunctionClauseError) no function clause matching in Module.concat/2
(elixir) lib/module.ex:685: Module.concat({:name, MyApp.PubSub}, Supervisor)
(phoenix_pubsub) lib/phoenix/pubsub/pg2.ex:50: Phoenix.PubSub.PG2.start_link/2
Bah, I was mixing the new child spec style. This will work, I promise :)
supervisor(Phoenix.PubSub.PG2, [[
name: MyApp.PubSub,
fastlane: Phoenix.Channel.Server
]]),
Okay, so no compile issues with that. I deployed it, and I got the same flurry of errors. But here's the thing (and I totally should have caught this before), but the errors occur on the _draining_ nodes, not the new ones!
For example, I just pushed v7.12.0 of our app, and Sentry logged the ets_lookup errors for v7.11.0. Since the errors always happens when I deploy, and the errors seem to trace back to _joining_ a channel (not disconnecting), I (wrongfully) thought it was the new servers booting up that were throwing the errors. But no, it's old ones coming down.
In looking at the k8s logs and comparing to the Sentry timestamps, it appears the sequence goes like this:
v2 server C ↑ no errors
v1 server A ↓ flurry of ets_lookup errors
v2 server D ↑ no errors
v1 server B ↓ no errors
Any idea what would cause that?
Update.
So I just pushed a patch (v7.12.1), and NO errors! v7.12.0 was the first time I tried your fix, and as I've since realized that it's during k8s _draining_ when the errors occur, then it appears your fix worked! The weird thing is that it seems like the app is actually booting up during the draining process!? Regardless, I think that's another topic. It seems your fix worked.
Since your fix isn't the default, are there any caveats / quirks I should be aware of?
No caveats whatsoever. The Endpoint simply adds the supervisor as a child itself as shown above. Since you've now narrowed down the issue to be happening while draining, I'm not entirely sure where the problem is since we add the pubsub sup above the webserver child, so shutting things down in reverse order should not allow new connections after the pubsub server is down.
Yeah, I'm not really sure what's causing it. The app seems to be "starting" while the pod is draining. I'm not enough of a Kubernetes wizard to know how the draining process works, but that seems odd to me. Either way, the steps above fix the flurry of errors, so thank you! If I figure out what causes it, I'll post back here for future generations.