While using updating many targets for a same upstream concurrently,
there's possibility that some workers enter bad local state and can't recover.
timeout waiting for balancer xxx)."October 29th 2019, 13:52:37.148","2019/10/29 13:52:37 [error] 40#0: *331090 [lua] healthcheck.lua:989: log(): [healthcheck] (*) Error loading initial target list: failed to acquire lock: timeout, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/api/routes/upstreams.lua:146: in function 'fn'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/db/dao/targets.lua:103: in function 'delete_entity'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/db/dao/init.lua:867: in function 'insert'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/db/dao/init.lua:1062: in function 'post_crud_event'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/resty/worker/events.lua:272: in function 'post_local'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/resty/worker/events.lua:198: in function 'do_event'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/resty/worker/events.lua:166: in function 'do_handlerlist'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/runloop/balancer.lua:596: in function 'on_target_event'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/runloop/balancer.lua:470: in function 'check_target_history'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/runloop/balancer.lua:406: in function 'create_balancer'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/runloop/balancer.lua:333: in function 'create_healthchecker'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/kong/runloop/balancer.lua:196: in function 'populate_healthchecker'
October 29th 2019, 21:52:40.161 /usr/local/share/lua/5.1/resty/healthcheck.lua:390: in function 'get_target_status'
"October 29th 2019, 13:52:40.161","2019/10/29 13:52:40 [error] 40#0: *331090 [lua] events.lua:175: do_handlerlist(): worker-events: event callback failed; source=balancer, event=targets, pid=48 error='/usr/local/share/lua/5.1/resty/healthcheck.lua:390: attempt to index field 'targets' (a nil value)"
"October 29th 2019, 13:52:45.164","2019/10/29 13:52:45 [error] 40#0: *331090 [lua] balancer.lua:598: on_target_event(): failed checking target history for *: failed to get from node cache: could not acquire callback lock: timeout, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""
"October 29th 2019, 13:53:15.317","2019/10/29 13:53:15 [error] 40#0: *331090 [lua] balancer.lua:598: on_target_event(): failed checking target history for *: timeout waiting for balancer for a70227e9-24e3-4a94-9690-c8296f8adb79, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""
targets=nil is returned, then a nil pointer error happened in create_healthchecker https://github.com/Kong/kong/blob/8ee57d885756322bc89e31fa997132d74491beee/kong/runloop/balancer.lua#L444-L450. So creating[upstream.id] = nil on L450 is never executed.wait()Actionable items:
creating[upstream.id] = true and creating[upstream.id] = nil@fffonion Some amazing work in debugging here! Thank you for getting down to the bottom of this!
@locao Since @fffonion's PR was merged to lua-resty-healthcheck, can we close this one now?
Looks like it, so I'm closing this one for now! (Please reopen if there's anything else pending.) @fffonion, great work tracking this down!
@hishamhm One last thing before closing this loop : )
In create_healthchecker where a lua-resty-healthcheck instance is created, the error is returned (https://github.com/Kong/kong/blame/master/kong/runloop/balancer.lua#L360) but not catched. It would be awesome if we can fail gracefully in this case (and might be even better to add error handling to other functions between creating[upstream.id] = true and creating[upstream.id] = nil).
@fffonion thanks for the heads up! we'll make those additions!
@fffonion I was in the process of making this changes and then I remembered the reason why we're not catching the error ther. A balancer shouldn't fail to build if a healthchecker can't be built for any reason. That is, if Kong can load-balance traffic but cannot health-check it, it should still load-balance. An error creating the healthchecker should be logged, but not cause create_balancer to fail.
There is a fix to be made there to release creating correctly though, so I'm making a PR to address that!
@hishamhm That makes sense to me. Thanks for the explaination and taking care of this! I'll look at the linked PR shortly.
Most helpful comment
@hishamhm One last thing before closing this loop : )
In
create_healthcheckerwhere a lua-resty-healthcheck instance is created, the error is returned (https://github.com/Kong/kong/blame/master/kong/runloop/balancer.lua#L360) but not catched. It would be awesome if we can fail gracefully in this case (and might be even better to add error handling to other functions betweencreating[upstream.id] = trueandcreating[upstream.id] = nil).