Celery: worker getting stuck

Created on 5 May 2015 · 45Comments · Source: celery/celery

Celery worker getting stuck consuming lot of resident memory.

Version is celery (3.1.17)

Strace

celery]# strace -p 8401
Process 8401 attached - interrupt to quit
read(10,

celery]# lsof -n -p 8401 | egrep -v '(DIR|REG)'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 8401 dsl 0r FIFO 0,8 0t0 124716100 pipe
python 8401 dsl 1w FIFO 0,8 0t0 124716101 pipe
python 8401 dsl 2w FIFO 0,8 0t0 124716101 pipe
python 8401 dsl 6r FIFO 0,8 0t0 124716462 pipe
python 8401 dsl 7w FIFO 0,8 0t0 124716462 pipe
python 8401 dsl 8r FIFO 0,8 0t0 124716463 pipe
python 8401 dsl 9w FIFO 0,8 0t0 124716463 pipe
python 8401 dsl 10r FIFO 0,8 0t0 124716464 pipe
python 8401 dsl 13w FIFO 0,8 0t0 124716465 pipe
python 8401 dsl 14r FIFO 0,8 0t0 124716466 pipe
python 8401 dsl 15r CHR 1,3 0t0 3662 /dev/null
python 8401 dsl 16w FIFO 0,8 0t0 124716467 pipe

Pstack dump

celery]# pstack 8401

0 0x0000003056c0e740 in __read_nocancel () from /lib64/libpthread.so.0

1 0x00007fa96b97b4c6 in _Billiard_conn_recvall () from /home/apps/analy/app/venv/lib/python2.6/site-packages/_billiard.so

2 0x00007fa96b97b552 in Billiard_conn_recv_string () from /home/apps/analy/app/venv/lib/python2.6/site-packages/_billiard.so

3 0x00007fa96b97b668 in Billiard_connection_recv_payload () from home/apps/analy/app/venv/lib/python2.6/site-packages/_billiard.so

4 0x00000030574d5916 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

5 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

6 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

7 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

8 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

9 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

10 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

11 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

12 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

13 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

14 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

15 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

16 0x000000305746acb0 in ?? () from /usr/lib64/libpython2.6.so.1.0

17 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

18 0x00000030574566af in ?? () from /usr/lib64/libpython2.6.so.1.0

19 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

20 0x000000305749568e in ?? () from /usr/lib64/libpython2.6.so.1.0

21 0x0000003057494298 in ?? () from /usr/lib64/libpython2.6.so.1.0

22 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

23 0x00000030574d4f74 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

24 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

25 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

26 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

27 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

28 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

29 0x000000305746adad in ?? () from /usr/lib64/libpython2.6.so.1.0

30 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

31 0x00000030574d4470 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

32 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

33 0x000000305746adad in ?? () from /usr/lib64/libpython2.6.so.1.0

34 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

35 0x00000030574566af in ?? () from /usr/lib64/libpython2.6.so.1.0

36 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

37 0x000000305749568e in ?? () from /usr/lib64/libpython2.6.so.1.0

38 0x0000003057494298 in ?? () from /usr/lib64/libpython2.6.so.1.0

39 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

40 0x00000030574d4470 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

41 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

42 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

43 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

44 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

45 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

46 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

47 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

48 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

49 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

50 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

51 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

52 0x000000305746adad in ?? () from /usr/lib64/libpython2.6.so.1.0

53 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

54 0x00000030574d4470 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

55 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

56 0x000000305746adad in ?? () from /usr/lib64/libpython2.6.so.1.0

57 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

58 0x00000030574566af in ?? () from /usr/lib64/libpython2.6.so.1.0

59 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

60 0x0000003057495a54 in ?? () from /usr/lib64/libpython2.6.so.1.0

61 0x0000003057443c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0

62 0x00000030574d4470 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

63 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

64 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

65 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

66 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

67 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

68 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

69 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

70 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

71 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

72 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

73 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

74 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

75 0x00000030574d6b8f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

76 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

77 0x00000030574d5aa4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0

78 0x00000030574d7657 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0

79 0x00000030574d7732 in PyEval_EvalCode () from /usr/lib64/libpython2.6.so.1.0

80 0x00000030574f1bac in ?? () from /usr/lib64/libpython2.6.so.1.0

81 0x00000030574f1c80 in PyRun_FileExFlags () from /usr/lib64/libpython2.6.so.1.0

82 0x00000030574f316c in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.6.so.1.0

83 0x00000030574ff8a2 in Py_Main () from /usr/lib64/libpython2.6.so.1.0

84 0x000000305681ed5d in __libc_start_main () from /lib64/libc.so.6

85 0x0000000000400649 in _start ()

Bug Report Feedback Needed ✘ Worker Hangs

Source

girishbin

Most helpful comment

+1. Hitting this issue as well:

Celery is still running, but not processing tasks
strace shows it blocked on recvfrom
Seeing cmd=NULL in Redis for the connection

ericholscher on 14 Aug 2018

👍2

All 45 comments

Its stuck on pipe read.

proc]# ls -l /proc/8401/fd
total 0
lr-x------ 1 dsl dsl 64 May 5 17:26 0 -> pipe:[124716100]
l-wx------ 1 dsl dsl 64 May 5 17:26 1 -> pipe:[124716101]
lr-x------ 1 dsl dsl 64 May 5 17:26 10 -> pipe:[124716464]
l-wx------ 1 dsl dsl 64 May 5 17:26 13 -> pipe:[124716465]

girishbin on 5 May 2015

@girishbin

When does this typically occur?
What what does the worker log look like?
Is there also a lot of CPU usage or just memory consumption?

joostdevries on 8 May 2015

Are you using redis as the broker? Seeing a similar symptom with redis broker on celery 3.1.8 and billiard 3.3.0.16 . No high memory consumption though.

idealopamp on 22 May 2015

Same here. @joostdevries it happens quite often to us, hard to say under what conditions. We have 4 workers using redis backend.

workers log before they are stuck:

[2015-08-07 16:50:40,140: INFO/MainProcess] Task feeds.transformers.rss_atom.by[6dbd5f0d-222b-4c5c-bd22-5e05bb63447b] succeeded in 0.0153002970037s: {}
[2015-08-07 16:50:40,141: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[8a582425-456d-4e49-93c8-eb375967cac5]
[2015-08-07 16:50:40,155: INFO/MainProcess] Task feeds.transformers.rss_atom.by[3f60a721-2a6e-4494-bda7-b7b939efe66a] succeeded in 0.00693402900652s: {}
[2015-08-07 16:50:40,157: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[486cc62d-d330-467f-b30c-02005c2038b6]
[2015-08-07 16:50:40,171: INFO/MainProcess] Task feeds.transformers.rss_atom.by[8a582425-456d-4e49-93c8-eb375967cac5] succeeded in 0.0071912699932s: {}
[2015-08-07 16:50:40,173: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[538a71a6-0c33-494a-adfe-575f7465e9d4]
[2015-08-07 16:50:40,188: INFO/MainProcess] Task feeds.transformers.rss_atom.by[486cc62d-d330-467f-b30c-02005c2038b6] succeeded in 0.0155014329939s: {}
[2015-08-07 16:50:40,189: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[2ad9294a-7553-468a-a3dc-73a8f2cea188]
[2015-08-07 16:50:40,203: INFO/MainProcess] Task feeds.transformers.rss_atom.by[538a71a6-0c33-494a-adfe-575f7465e9d4] succeeded in 0.0153862849984s: {}
[2015-08-07 16:50:40,205: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[f7371090-aa05-4194-a327-cb41d1165b7e]
[2015-08-07 16:50:40,220: INFO/MainProcess] Task feeds.transformers.rss_atom.by[2ad9294a-7553-468a-a3dc-73a8f2cea188] succeeded in 0.0158518639946s: {}
[2015-08-07 16:50:40,222: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[6d35c0b2-9c5a-425f-9405-9f7e1fb3aa41]
[2015-08-07 16:50:40,236: INFO/MainProcess] Task feeds.transformers.rss_atom.by[f7371090-aa05-4194-a327-cb41d1165b7e] succeeded in 0.00751440098975s: {}
[2015-08-07 16:50:40,238: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ed664157-a065-4b15-9bbe-633edf96d230]
[2015-08-07 16:50:40,252: INFO/MainProcess] Task feeds.transformers.rss_atom.by[6d35c0b2-9c5a-425f-9405-9f7e1fb3aa41] succeeded in 0.00709322700277s: {}
[2015-08-07 16:50:40,254: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[683bc593-44c1-4145-b698-1f2ba66a43bd]
[2015-08-07 16:50:40,260: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ed664157-a065-4b15-9bbe-633edf96d230] succeeded in 0.015573162993s: {}
[2015-08-07 16:50:40,275: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[d8018f2b-a1f0-4112-97ba-335cd597be1c]
[2015-08-07 16:50:40,282: INFO/MainProcess] Task feeds.transformers.rss_atom.by[683bc593-44c1-4145-b698-1f2ba66a43bd] succeeded in 0.0205264859978s: {}
[2015-08-07 16:50:40,292: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[92f938ce-894a-49f4-8c89-a090c37b71c8]
[2015-08-07 16:50:40,299: INFO/MainProcess] Task feeds.transformers.rss_atom.by[d8018f2b-a1f0-4112-97ba-335cd597be1c] succeeded in 0.016343981988s: {}
[2015-08-07 16:50:40,302: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ff12882d-61b8-403e-97ac-b096580de5f0]
[2015-08-07 16:50:40,318: INFO/MainProcess] Task feeds.transformers.rss_atom.by[92f938ce-894a-49f4-8c89-a090c37b71c8] succeeded in 0.0184662840038s: {}
[2015-08-07 16:50:40,330: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[3e0ff326-cd83-4d88-8dee-f9c167a44f6a]
[2015-08-07 16:50:40,338: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ff12882d-61b8-403e-97ac-b096580de5f0] succeeded in 0.0187683639961s: {}
[2015-08-07 16:50:40,341: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[f12efd44-b2be-42fd-8872-5c979421bea3]
[2015-08-07 16:50:40,357: INFO/MainProcess] Task feeds.transformers.rss_atom.by[3e0ff326-cd83-4d88-8dee-f9c167a44f6a] succeeded in 0.0182138609962s: {}
[2015-08-07 16:50:40,359: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[57d8ba02-db1f-40d4-88f9-1d3d83939c2a]
[2015-08-07 16:50:40,374: INFO/MainProcess] Task feeds.transformers.rss_atom.by[f12efd44-b2be-42fd-8872-5c979421bea3] succeeded in 0.0165290409932s: {}
[2015-08-07 16:50:40,385: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[8ae76799-d19d-4acf-b2b4-cfe9d77e3000]
[2015-08-07 16:50:40,393: INFO/MainProcess] Task feeds.transformers.rss_atom.by[57d8ba02-db1f-40d4-88f9-1d3d83939c2a] succeeded in 0.0185826079978s: {}
[2015-08-07 16:50:40,405: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[ca5e9d23-f660-4f88-b7d1-d30b91750727]
[2015-08-07 16:50:40,414: INFO/MainProcess] Task feeds.transformers.rss_atom.by[8ae76799-d19d-4acf-b2b4-cfe9d77e3000] succeeded in 0.0205218030023s: {}
[2015-08-07 16:50:40,417: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[5cfdbbe8-3da1-45b5-b83e-599303ad02eb]
[2015-08-07 16:50:40,433: INFO/MainProcess] Task feeds.transformers.rss_atom.by[ca5e9d23-f660-4f88-b7d1-d30b91750727] succeeded in 0.0175381449953s: {}
[2015-08-07 16:50:40,436: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cb864c28-b3c5-4def-a5de-8c15049c6d29]
[2015-08-07 16:50:40,453: INFO/MainProcess] Task feeds.transformers.rss_atom.by[5cfdbbe8-3da1-45b5-b83e-599303ad02eb] succeeded in 0.0190771400084s: {}
[2015-08-07 16:50:40,456: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cc85e444-7969-45ed-98a4-f9ab074db260]
[2015-08-07 16:50:40,473: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cb864c28-b3c5-4def-a5de-8c15049c6d29] succeeded in 0.0191195910011s: {}
[2015-08-07 16:50:40,476: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[4696c784-819d-4278-87fd-b74c2aab4c57]
[2015-08-07 16:50:40,491: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cc85e444-7969-45ed-98a4-f9ab074db260] succeeded in 0.0098425810138s: {}
[2015-08-07 16:50:40,494: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[bce5ad3e-dd4a-4bbc-bb69-9585597ae010]
[2015-08-07 16:50:40,501: INFO/MainProcess] Task feeds.transformers.rss_atom.by[4696c784-819d-4278-87fd-b74c2aab4c57] succeeded in 0.0174191809929s: {}
[2015-08-07 16:50:40,512: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[cfefb0bb-399c-423e-9032-add27dabd6df]
[2015-08-07 16:50:40,526: INFO/MainProcess] Task feeds.transformers.rss_atom.by[bce5ad3e-dd4a-4bbc-bb69-9585597ae010] succeeded in 0.0162074130058s: {}
[2015-08-07 16:50:40,528: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[51f11154-3dbc-448a-a3a4-00c2fe7ab782]
[2015-08-07 16:50:40,543: INFO/MainProcess] Task feeds.transformers.rss_atom.by[cfefb0bb-399c-423e-9032-add27dabd6df] succeeded in 0.0163563999959s: {}
[2015-08-07 16:50:40,545: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[fb900a0c-403f-4941-aa6a-f242312e5d85]
[2015-08-07 16:50:40,560: INFO/MainProcess] Task feeds.transformers.rss_atom.by[51f11154-3dbc-448a-a3a4-00c2fe7ab782] succeeded in 0.0162344239943s: {}
[2015-08-07 16:50:40,563: INFO/MainProcess] Received task: feeds.transformers.rss_atom.by[330fff18-71e0-4d9e-9ad1-0111d44f7e03]

No CPU usage, memory usage (output from top):

31059 celery    20   0  517176  64388   3456 S  0.0  1.7   0:14.86 celery                                                                                                                                                                                   
31058 celery    20   0  517204  64308   3456 S  0.0  1.7   0:14.32 celery                                                                                                                                                                                   
31062 celery    20   0  517044  64308   3456 S  0.0  1.7   0:14.88 celery                                                                                                                                                                                   
31061 celery    20   0  516912  63508   3064 S  0.0  1.7   0:14.32 celery                                                                                                                                                                                   
31046 celery    20   0  369344  63396   6964 S  0.0  1.7   1:09.70 celery                                                                                                                                                                                   
16967 celery    20   0  366648  57396   7612 S  0.0  1.6   0:04.04 celery

domenkozar on 7 Aug 2015

will try setting BROKER_TRANSPORT_OPTIONS = {'socket_timeout': 10} and see if it helps.

domenkozar on 2 Sep 2015

@domenkozar Did it help?

dtao on 2 Sep 2015

See https://github.com/andymccurdy/redis-py/issues/306

domenkozar on 3 Sep 2015

Maybe it's not:

[root@ip-172-30-0-183 ec2-user]# lsof -p 9253|grep 6379
celery  9253 celery   10u     IPv4            2738405       0t0      TCP ip-172-30-0-183.ec2.internal:37155->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  9253 celery   23u     IPv4            2737905       0t0      TCP ip-172-30-0-183.ec2.internal:37090->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  9253 celery   33u     IPv4            2737956       0t0      TCP ip-172-30-0-183.ec2.internal:37098->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  9253 celery   36u     IPv4            2737990       0t0      TCP ip-172-30-0-183.ec2.internal:37105->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  9253 celery   46u     IPv4            2739860       0t0      TCP ip-172-30-0-183.ec2.internal:37297->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
[root@ip-172-30-0-183 ec2-user]# lsof -p 9253|grep CLOSE_WAIT
celery  9253 celery   11u     IPv4            2739163       0t0      TCP ip-172-30-0-183.ec2.internal:45206->wordpress.com:http (CLOSE_WAIT)
celery  9253 celery   16u     IPv4            2737802       0t0      TCP ip-172-30-0-183.ec2.internal:53865->205.251.242.33:https (CLOSE_WAIT)
celery  9253 celery   32u     IPv4            2764510       0t0      TCP ip-172-30-0-183.ec2.internal:48242->ec2-52-2-102-195.compute-1.amazonaws.com:http (CLOSE_WAIT)
celery  9253 celery   37u     IPv4            2739073       0t0      TCP ip-172-30-0-183.ec2.internal:45198->wordpress.com:http (CLOSE_WAIT)
celery  9253 celery   38u     IPv4            2738052       0t0      TCP ip-172-30-0-183.ec2.internal:36962->230-156-220-74-available.ilandcloud.com:http (CLOSE_WAIT)
celery  9253 celery   39u     IPv4            2738313       0t0      TCP ip-172-30-0-183.ec2.internal:46067->qb-in-f118.1e100.net:http (CLOSE_WAIT)
celery  9253 celery   40u     IPv4            2739202       0t0      TCP ip-172-30-0-183.ec2.internal:43252->r2.ycpi.vip.dcb.yahoo.net:http (CLOSE_WAIT)
celery  9253 celery   42u     IPv4            2739382       0t0      TCP ip-172-30-0-183.ec2.internal:45228->wordpress.com:http (CLOSE_WAIT)
celery  9253 celery   43u     IPv4            2739488       0t0      TCP ip-172-30-0-183.ec2.internal:38920->wordpress.com:http (CLOSE_WAIT)
celery  9253 celery   45u     IPv4            2739667       0t0      TCP ip-172-30-0-183.ec2.internal:57721->ec2-54-165-198-100.compute-1.amazonaws.com:https (CLOSE_WAIT)

domenkozar on 3 Sep 2015

Flower is showing all workers as offline, workers appear to be just waiting for a task.

domenkozar on 3 Sep 2015

new hypothesis: I think the major culprit was that redis ran out of memory (there are some OOM traces), and master is unable to send new tasks to workers.

domenkozar on 3 Sep 2015

The strace you are seeing looks normal to me, is it not just waiting for more tasks? I guess it's possible that it was sent incomplete data (seeing as there is something in the buffer)

ask on 4 Sep 2015

Yes, workers appear to just be waiting for more tasks, so something is going on with master celery process. The task queue has 100k items,.

domenkozar on 4 Sep 2015

So I've cleared celery redis queue and celery beat on master process is not scheduling new tasks. Looking at gdb it's doing what it should:

(gdb) py-list
 115            buf.seek(self.bytes_written)
 116            marker = 0
 117
 118            try:
 119                while True:
>120                    data = self._sock.recv(socket_read_size)
 121                    # an empty string indicates the server shutdown the socket
 122                    if isinstance(data, bytes) and len(data) == 0:
 123                        raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
 124                    buf.write(data)
 125                    data_length = len(data)

domenkozar on 4 Sep 2015

Aha, found which of the connection was the problem, from CLIENT LIST:

id=497 addr=172.30.0.183:45591 fd=20 name= age=1142 idle=1142 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL

domenkozar on 6 Sep 2015

It's interesting that it sends a cmd=NULL and the connction is coming from the master celery worker. @ask any ideas?

domenkozar on 6 Sep 2015

So the way I see it, redis connection pool gets exausted. If I kill some random connection on redis side, they go to CLOSE_WAIT state, but celery master process keeps them open.

We have many many small tasks, so tasks are getting picked up really fast.

You can see there are 11 connections and 2 of them were killed from redis side with CLIENT KILL command:

[root@ip-172-30-0-183 ec2-user]# lsof -p 24747|grep TCP
celery  24747 celery   10u     IPv4            3619402       0t0      TCP ip-172-30-0-183.ec2.internal:45535->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   11u     IPv4            3619425       0t0      TCP ip-172-30-0-183.ec2.internal:45544->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   16u     sock                0,6       0t0  3113866 protocol: TCP
celery  24747 celery   34u     IPv4            3619406       0t0      TCP ip-172-30-0-183.ec2.internal:45537->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   35u     IPv4            3619421       0t0      TCP ip-172-30-0-183.ec2.internal:45543->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   36u     IPv4            3619417       0t0      TCP ip-172-30-0-183.ec2.internal:45542->ip-172-30-3-169.ec2.internal:6379 (CLOSE_WAIT)
celery  24747 celery   37u     IPv4            3619431       0t0      TCP ip-172-30-0-183.ec2.internal:45545->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   38u     IPv4            3619435       0t0      TCP ip-172-30-0-183.ec2.internal:45546->ip-172-30-3-169.ec2.internal:6379 (CLOSE_WAIT)
celery  24747 celery   39u     IPv4            3619437       0t0      TCP ip-172-30-0-183.ec2.internal:45547->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   40u     IPv4            3703870       0t0      TCP ip-172-30-0-183.ec2.internal:46884->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   41u     IPv4            3619441       0t0      TCP ip-172-30-0-183.ec2.internal:45549->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)
celery  24747 celery   42u     IPv4            3619443       0t0      TCP ip-172-30-0-183.ec2.internal:45550->ip-172-30-3-169.ec2.internal:6379 (ESTABLISHED)

domenkozar on 6 Sep 2015

So it's also not connection pool being exausted. I've increased it to 20 and currently it got stuck with 8 connections.

domenkozar on 6 Sep 2015

This is the traceback that happens when I kill cmd=NULL connection:

[2015-09-06 11:34:44,336: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 278, in start
    blueprint.start(self)
  File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 821, in start
    c.loop(*c.loop_args())
  File "/opt/cmgd/venv/lib/python2.7/site-packages/celery/worker/loops.py", line 70, in asynloop
    next(loop)
  File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 267, in create_loop
    tick_callback()
  File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 942, in on_poll_start
    [add_reader(fd, on_readable, fd) for fd in cycle.fds]
  File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 201, in add_reader
    return self.add(fds, callback, READ | ERR, args)
  File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/async/hub.py", line 152, in add
    self.poller.register(fd, flags)
  File "/opt/cmgd/venv/lib/python2.7/site-packages/kombu/utils/eventio.py", line 78, in register
    self._epoll.register(fd, events)
IOError: [Errno 9] Bad file descriptor
[2015-09-06 11:34:44,339: WARNING/MainProcess] Restoring 1 unacknowledged message(s).
[2015-09-06 11:34:44,398: INFO/MainProcess] Connected to redis://mas-re-13j3y6r7dd9ui.vfavzi.0001.use1.cache.am

domenkozar on 6 Sep 2015

I think I found a way for everyone to reproduce this, schedule few thousands of dummy tasks that don't do anything.

4 Workers process around 170 tasks/s then get stuck.

domenkozar on 6 Sep 2015

And flower graphs showing the hang http://i.imgur.com/1q0A8DN.png

domenkozar on 8 Sep 2015

We're issuing the same problems here.
RabbitMQ does the job at the moment just right.
We prefer redis because of performance issues.

Our workers hangs after a short period with redis, not error messages or anything that looks suspicious.
set socket_timeout did not resolve the issue.

wavenator on 24 Oct 2015

@domenkozar What configuration are you using? I execute 100k tasks using redis often with the stress test suite.

If it's consuming a lot of memory, do you have the fanout_prefix and fanout_patterns transport options set? http://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html#caveats

ask on 27 Oct 2015

@ask unfortunately I don't have access to that environment anymore, but we didn't change fanout_* settings

domenkozar on 30 Oct 2015

In my case, I see the worker's main process in a tight loop at 100% CPU while job-doing process sit quietly doing nothing.

strace indicates that fd=3 (always the same fd) is being continuously polled, and the poll returns POLLIN | POLLHUP.

strace -p 9988

poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])

lsof says:

celeryd:   9988   sentry    3u     sock        0,6      0t0  566337941 can't identify protocol

I've read somewhere this is an indication of a half-closed socket connection.

But that's as far as I've gotten. Reading this thread gives me more hints to look after the next time the issue kicks.

mvaled on 24 Nov 2015

I'm on the same project that @domenkozar was working on, and we're seeing this problem again. strace shows the worker processes hanging in a read() syscall.

Interestingly, the hangs went away for months and only started to recur recently when we added a bunch more work to the system by processing more tasks, so in some way it's timing- or load-dependent.

Package versions: amqp-1.4.9, billiard -3.3.0.23, kombu-3.0.35, celery 3-1.23.

How should I go about debugging this? Should I be trying to instrument the socket operations in billiard, in kombu, or in celery?

akuchling on 24 May 2016

👍1

@akuchling can you use redis-cli and post the output of CLIENT LIST?

domenkozar on 24 May 2016

Also, I'd double check all servers are using kombu-3.0.35

domenkozar on 24 May 2016

Version # of kombu is in fact 3.0.35. 'Client list' returns the following: http://dpaste.com/2GGCECS (IP addresses slightly redacted).

Interesting! In that dpaste, there isn't a client listed where cmd=null, so maybe this is a new failure mode. It doesn't seem to me that we're filling up Redis's memory at the moment; our data is around 3Gb on a 6Gb AWS machine.

We do have socket_timeout set to 10, but that doesn't fix the problem. Instead celery ends up doing a few tasks every 10sec and then freezes again. Note how the seconds figure is incrementing: 37, 47, 57 in this dpaste: http://dpaste.com/2AZYEH4

akuchling on 25 May 2016

I found myself looking at on_stop_not_started() in celery/concurrent/asyncpool.py because it showed up in a GDB stacktrace, and was puzzled by the use of pending_remove_fd.

It's initially given the value of an empty set. Then the code does: for fd in outqueues: self._flush_outqueue(fd, pending_remove_fd.discard, ...) and removes the contents of pending_remove_fd from outqueues.

Bug: The set is initially empty, and _flush_outqueues() is calling pending_remove_fd.discard(). So how does the pending_remove_fd set ever get filled in? Was it supposed to be initialized as = set(outqueues) to make a copy of the starting set of fds? Or have I misread something?

akuchling on 27 May 2016

@akuchling good catch, I'm guess it should be pending_remove_fds.add instead of pending_remove_fds.discard there

ask on 27 May 2016

So at the end of that loop it will remove the flushed outqueues when it does the outqueues.difference_update(pending_remove_fd)

ask on 27 May 2016

Thanks! Should this change also be committed to the 3.1 branch (which I've done in a fork so that I can test it)?

akuchling on 28 May 2016

Unfortunately the pending_remove_fd patch doesn't fix all hanging for us, though it doesn't seem to cause any problems. With the patch, I've also gotten a worker/master hanging: the master is doing a read() from the worker, and the worker is in a read() in Billiard_conn_recv(), being called from pickle.load(). Any suggestions for what this might be?

akuchling on 1 Jun 2016

Closing this, as we don't have the resources to complete this task.

May be fixed in master, let's see if comes back after 4.0 release.

ask on 24 Jun 2016

@akuchling @domenkozar - was the discussion on this issue continued anywhere / did you find any resolution?

Trawling through the issues list there's a lot of subjective "something is broken" but this appears to be the closest to the issues we're seeing.

JonPeel on 15 Sep 2017

Sadly none of us work at the company anymore where this was happening in production.

@akuchling took over my previous efforts, maybe he can report the last update if this bug was really fixed or not.

domenkozar on 15 Sep 2017

@JonPeel: I don't remember if we ever found a resolution. To summarize the changes we applied:

in BROKER_TRANSPORT_OPTIONS in settings.py, we added:

 'socket_timeout': 10,                # timeout in case of network errors
 'fanout_prefix': True,
 'fanout_patterns': True,

we used a fork of Celery 3.1 with commit 320777611a0e349f08f4bb3ca2d36c9036eda330 (the on_stop_not_started() bugfix I found above).

akuchling on 15 Sep 2017

😄1

@JonPeel Now I'm running 4.0.22 (and planning to update to 4.1). I haven't found this issue anymore.

I do find from time time one worker with -c1 (for serialization) being unresponsive after reaching the maximum jobs (2000). It seems that when trying to kill and respawn the child worker it gets stuck. But I haven't explored deeply: Oddly enough, restarting the worker makes it work ok: even recycling the child worker after reaching another 2000 jobs. So I don't think is related to Celery but to our App.

mvaled on 15 Sep 2017

Thanks for the replies; will open a new ticket if our issues re-occurs (For ref: Celery 4.1, Redis broker and result backend: 2.10.5 redis-py, Redis 3.2.0, SSL enabled).

From what I've seen we're getting workers stuck listening on connections that are doing nothing (CLOSE_WAIT) state and a restart resumes business as usual.

From that, https://github.com/andymccurdy/redis-py/issues/306 looks more relevant to us than it was above. If so, it's a) an issue in redis-py not celery and b) seems to be positively affected by the socket_timeout in BROKER_TRANSPORT_OPTIONS as suggested above. Restart was required every 24 hours or so, no longer appears to be a problem.

JonPeel on 19 Sep 2017

I am running into this issue on

Celery 4.1.0
redis-py 2.10.6

Celery hangs on redis-py connection.py

return sock.recv(*args, **kwargs)

Here is the stack

 [2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/bin/celery", line 11, in <module>
     sys.exit(main())
 [2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/__main__.py", line 14, in main
     _main()
 [2017-11-24 04:11:08,423: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 326, in main
     cmd.execute_from_commandline(argv)
 [2017-11-24 04:11:08,424: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
     super(CeleryCommand, self).execute_from_commandline(argv)))
 [2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
     return self.handle_argv(self.prog_name, argv[1:])
 [2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 480, in handle_argv
     return self.execute(command, argv)
 [2017-11-24 04:11:08,425: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/celery.py", line 412, in execute
     ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
 [2017-11-24 04:11:08,426: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/worker.py", line 221, in run_from_argv
     return self(*args, **options)
 [2017-11-24 04:11:08,426: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/base.py", line 244, in __call__
     ret = self.run(*args, **kwargs)
 [2017-11-24 04:11:08,427: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bin/worker.py", line 256, in run
     worker.start()
 [2017-11-24 04:11:08,427: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/worker.py", line 203, in start
     self.blueprint.start(self)
 [2017-11-24 04:11:08,428: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
     step.start(parent)
 [2017-11-24 04:11:08,428: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 370, in start
     return self.obj.start()
 [2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 320, in start
     blueprint.start(self)
 [2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
     step.start(parent)
 [2017-11-24 04:11:08,429: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 596, in start
     c.loop(*c.loop_args())
 [2017-11-24 04:11:08,430: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/loops.py", line 88, in asynloop
     next(loop)
 [2017-11-24 04:11:08,430: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/async/hub.py", line 354, in create_loop
     cb(*cbargs)
 [2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 1040, in on_readable
     self.cycle.on_readable(fileno)
 [2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 337, in on_readable
     chan.handlers[type]()
 [2017-11-24 04:11:08,431: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 667, in _receive
     ret.append(self._receive_one(c))
 [2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/redis.py", line 678, in _receive_one
     response = c.parse_response()
 [2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/client.py", line 2448, in parse_response
     return self._execute(connection, connection.read_response)
 [2017-11-24 04:11:08,432: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/client.py", line 2437, in _execute
     return command(*args)
 [2017-11-24 04:11:08,433: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 604, in read_response
     response = self._parser.read_response()
 [2017-11-24 04:11:08,433: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 259, in read_response
     response = self._buffer.readline()
 [2017-11-24 04:11:08,434: WARNING/MainProcess] File "/usr/local/cedar/utils/redis/connection.py", line 185, in readline
     traceback.print_stack()

This happens when a connection is reset by a timeout (load balancer in my case). An error event is sent on the socket, but then celery tries to read from the socket. Since the socket is closed, no data is available and it hangs forever. It specifically happens on celery's pubsub redis connection, since that is a long running connection.

A socket_timeout param will fix the issue of blocking and will throw an error after the timeout, but there is an underlying issue that celery should not read from a closed socket.

The imperfect fix I made was to use this patch in redis-py.
https://github.com/andymccurdy/redis-py/pull/886

as well as this custom fix for pubsub
https://github.com/cedar-team/redis-py/compare/f9348968dec581830a32e8469228c74ce443bd88...cedar-team:6aae12c96b733ad0b6e896001f0cf576fa26280a