Kong: Broken worker local state while handling concurrent requests to same upstream

Created on 31 Oct 2019  路  8Comments  路  Source: Kong/kong

Summary

While using updating many targets for a same upstream concurrently,
there's possibility that some workers enter bad local state and can't recover.

Steps To Reproduce

  1. Start a Kong cluster with 20 workers serving admin API, using postgres database.
  2. Send 20 concurrent request to delete targets with same upstream
  3. One or more workers throw a stack trace (attached below) and further CRUD request to that
    worker to update that upstream will always fail (errored with timeout waiting for balancer xxx).

Additional Details & Logs

  • Kong version: 1.2.1
  • Kong error logs:
"October 29th 2019, 13:52:37.148","2019/10/29 13:52:37 [error] 40#0: *331090 [lua] healthcheck.lua:989: log(): [healthcheck] (*) Error loading initial target list: failed to acquire lock: timeout, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""

October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/api/routes/upstreams.lua:146: in function 'fn'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/db/dao/targets.lua:103: in function 'delete_entity'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/db/dao/init.lua:867: in function 'insert'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/db/dao/init.lua:1062: in function 'post_crud_event'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/resty/worker/events.lua:272: in function 'post_local'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/resty/worker/events.lua:198: in function 'do_event'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/resty/worker/events.lua:166: in function 'do_handlerlist'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/runloop/balancer.lua:596: in function 'on_target_event'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/runloop/balancer.lua:470: in function 'check_target_history'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/runloop/balancer.lua:406: in function 'create_balancer'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/runloop/balancer.lua:333: in function 'create_healthchecker'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/kong/runloop/balancer.lua:196: in function 'populate_healthchecker'
October 29th 2019, 21:52:40.161  /usr/local/share/lua/5.1/resty/healthcheck.lua:390: in function 'get_target_status'

"October 29th 2019, 13:52:40.161","2019/10/29 13:52:40 [error] 40#0: *331090 [lua] events.lua:175: do_handlerlist(): worker-events: event callback failed; source=balancer, event=targets, pid=48 error='/usr/local/share/lua/5.1/resty/healthcheck.lua:390: attempt to index field 'targets' (a nil value)"
"October 29th 2019, 13:52:45.164","2019/10/29 13:52:45 [error] 40#0: *331090 [lua] balancer.lua:598: on_target_event(): failed checking target history for *:  failed to get from node cache: could not acquire callback lock: timeout, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""
"October 29th 2019, 13:53:15.317","2019/10/29 13:53:15 [error] 40#0: *331090 [lua] balancer.lua:598: on_target_event(): failed checking target history for *:  timeout waiting for balancer for a70227e9-24e3-4a94-9690-c8296f8adb79, client: 127.0.0.1, server: kong_admin, request: ""DELETE /upstreams/a70227e9-24e3-4a94-9690-c8296f8adb79/targets/34575e29-5058-488c-8bbb-6e1d9ee431d8 HTTP/1.1"", host: ""localhost:8444"""
  • While concurrent requests of same upstream are balanced to different workers, each worker compete for the same lock.
  • Since timeout waiting for a lock is 5 seconds, running 20 worker means each worker needs to release lock in 250ms.
  • There's possibility some unlucky worker timeout getting the lock, and goes to this branch (https://github.com/Kong/lua-resty-healthcheck/blob/38a9802e090824f5fc727993d5cb29f3b4837b13/lib/resty/healthcheck.lua#L1326-L1328).
  • A healthcheck table with targets=nil is returned, then a nil pointer error happened in create_healthchecker https://github.com/Kong/kong/blob/8ee57d885756322bc89e31fa997132d74491beee/kong/runloop/balancer.lua#L444-L450. So creating[upstream.id] = nil on L450 is never executed.
  • Since the flag is not cleared properly, meaning following request to this worker updating this upstream will always fail in wait()
tasbug

Most helpful comment

@hishamhm One last thing before closing this loop : )
In create_healthchecker where a lua-resty-healthcheck instance is created, the error is returned (https://github.com/Kong/kong/blame/master/kong/runloop/balancer.lua#L360) but not catched. It would be awesome if we can fail gracefully in this case (and might be even better to add error handling to other functions between creating[upstream.id] = true and creating[upstream.id] = nil).

All 8 comments

Actionable items:

  • The issue initially comes from user of Ingress Controller, which under the hood issues concurrent requests talking to Kong. @hbagdi confirmed that official deployments already by default set worker_count on admin nodes to 1 couple of weeks ago.
  • Fix healthcheck in https://github.com/Kong/lua-resty-healthcheck/pull/34
  • Be protective between foreign function call between creating[upstream.id] = true and creating[upstream.id] = nil

@fffonion Some amazing work in debugging here! Thank you for getting down to the bottom of this!

@locao Since @fffonion's PR was merged to lua-resty-healthcheck, can we close this one now?

Looks like it, so I'm closing this one for now! (Please reopen if there's anything else pending.) @fffonion, great work tracking this down!

@hishamhm One last thing before closing this loop : )
In create_healthchecker where a lua-resty-healthcheck instance is created, the error is returned (https://github.com/Kong/kong/blame/master/kong/runloop/balancer.lua#L360) but not catched. It would be awesome if we can fail gracefully in this case (and might be even better to add error handling to other functions between creating[upstream.id] = true and creating[upstream.id] = nil).

@fffonion thanks for the heads up! we'll make those additions!

@fffonion I was in the process of making this changes and then I remembered the reason why we're not catching the error ther. A balancer shouldn't fail to build if a healthchecker can't be built for any reason. That is, if Kong can load-balance traffic but cannot health-check it, it should still load-balance. An error creating the healthchecker should be logged, but not cause create_balancer to fail.

There is a fix to be made there to release creating correctly though, so I'm making a PR to address that!

@hishamhm That makes sense to me. Thanks for the explaination and taking care of this! I'll look at the linked PR shortly.

Was this page helpful?
0 / 5 - 0 ratings