Salt: Minion did not return. [No response] appears occasionally,but once happened,minion never returns

Created on 28 Mar 2020  ·  25Comments  ·  Source: saltstack/salt

Description of Issue

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

  • minion and master config file are as follows, sensitive info is replace by *,because of more than one minion has been set up, we specify our conf file path as /etc/salt2019

cat minion | grep -v "^#" | grep -v "^$"

master: ***
user: root
pidfile: /var/run/salt2019-minion.pid
conf_file: /etc/salt2019/minion
pki_dir: /etc/salt2019/pki/minion
id: ***
cachedir: /var/cache/salt2019/minion
sock_dir: /var/run/salt2019/minion
file_roots: /srv/salt2019
log_file: /var/log/salt2019/minion
key_logfile: /var/log/salt2019/key
log_level: debug
tcp_keepalive: True
  • master config file
    cat master | grep -v "^#" | grep -v "^$"
user: ***
publisher_acl:
  user***:
    - test.*
    - state.*
    - cmd.*

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)

The following is debug log,master and minion ip is replaced by 1.2.3.4 and 2.2.2.2. We are executing a sls file, the salt minion runs very well for a couple of days before it suddenly couldn't return data to master. I located the source code, found out that minion just stopped in “ Connecting the Minion to the Master URI (for the return server)”, and no more debug log info was printed.

  • Here is the log when minion does not return
 2020-03-20 14:50:40,465 [salt.minion                                                      :1465][INFO    ][1397] User saltapi Executing command state.sls with jid 20200320145808931182
2020-03-20 14:50:40,493 [salt.minion                                                      :1472][DEBUG   ][1397] Command details {'fun': 'state.sls', 'ret': '', 'tgt': '1.2.3.4', 'user': 'saltapi', 'tgt_type': 'glob', 'jid': '20200320145808931182', 'arg': ['------sensitive info is replaced ----------', 'pillar={"tasktype":"ent-serial-collect"}', {'queue': True, '__kwarg__': True}]}
2020-03-20 14:50:40,526 [salt.minion                                                      :1605][INFO    ][17093] Starting a new job 20200320145808931182 with PID 17093
2020-03-20 14:50:40,550 [salt.utils.lazy                                                  :107 ][DEBUG   ][17093] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-03-20 14:50:40,570 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded state.sls
2020-03-20 14:50:40,582 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded saltutil.is_running
2020-03-20 14:50:40,587 [salt.utils.lazy                                                  :104 ][DEBUG   ][17093] LazyLoaded grains.get
2020-03-20 14:50:40,589 [salt.loader.2.2.2.2.int.module.config                       :398 ][DEBUG   ][17093] key: test, ret: _|-
2020-03-20 14:50:40,611 [salt.transport.zeromq                                            :132 ][DEBUG   ][17093] Initializing new AsyncZeroMQReqChannel for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506', 'aes')
2020-03-20 14:50:40,611 [salt.crypt                                                       :463 ][DEBUG   ][17093] Initializing new AsyncAuth for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506')
2020-03-20 14:50:40,613 [salt.transport.zeromq                                            :203 ][DEBUG   ][17093] Connecting the Minion to the Master URI (for the return server): tcp://2.2.2.2:4506
2020-03-20 14:50:41,995 [salt.minion                                                      :1465][INFO    ][1397] User saltapi Executing command state.sls with jid 20200320145810461355
2020-03-20 14:50:41,995 [salt.minion                                                      :1472][DEBUG   ][1397] Command details {'fun': 'state.sls', 'ret': '', 'tgt': '1.2.3.4', 'user': 'saltapi', 'tgt_type': 'glob', 'jid': '20200320145810461355', 'arg': ['------sensitive info is replaced ----------', 'pillar={"tasktype":"ent-serial-collect"}', {'queue': True, '__kwarg__': True}]}
2020-03-20 14:50:42,006 [salt.minion                                                      :1605][INFO    ][17213] Starting a new job 20200320145810461355 with PID 17213
2020-03-20 14:50:42,010 [salt.utils.lazy                                                  :107 ][DEBUG   ][17213] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-03-20 14:50:42,014 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded state.sls
2020-03-20 14:50:42,018 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded saltutil.is_running
2020-03-20 14:50:42,020 [salt.utils.lazy                                                  :104 ][DEBUG   ][17213] LazyLoaded grains.get
2020-03-20 14:50:42,022 [salt.loader.2.2.2.2.int.module.config                       :398 ][DEBUG   ][17213] key: test, ret: _|-
2020-03-20 14:50:42,038 [salt.transport.zeromq                                            :132 ][DEBUG   ][17213] Initializing new AsyncZeroMQReqChannel for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506', 'aes')
2020-03-20 14:50:42,038 [salt.crypt                                                       :463 ][DEBUG   ][17213] Initializing new AsyncAuth for ('/etc/salt2019/pki/minion', '1.2.3.4', 'tcp://2.2.2.2:4506')
2020-03-20 14:50:42,040 [salt.transport.zeromq                                            :203 ][DEBUG   ][17213] Connecting the Minion to the Master URI (for the return server): tcp://2.2.2.2:4506
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1627][DEBUG   ][1397] schedule: Job __mine_interval was scheduled with jid_include, adding to cache (jid_include defaults to True)
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1630][DEBUG   ][1397] schedule: Job __mine_interval was scheduled with a max number of 2
2020-03-20 15:01:15,337 [salt.utils.schedule                                              :1647][INFO    ][1397] Running scheduled job: __mine_interval
2020-03-20 15:01:15,458 [salt.utils.schedule                                              :689 ][DEBUG   ][10760] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'pid': 10760, 'id': '1.2.3.4', 'jid': '20200320150115452799', 'schedule': '__mine_interval', 'fun_args': []}
2020-03-20 15:01:15,461 [salt.utils.lazy                                                  :104 ][DEBUG   ][10760] LazyLoaded mine.update
2020-03-20 15:01:15,463 [salt.utils.lazy                                                  :104 ][DEBUG   ][10760] LazyLoaded config.merge
2020-03-20 15:01:15,463 [salt.utils.schedule                                              :836 ][DEBUG   ][10760] schedule.handle_func: Removing /var/cache/salt2019/minion/proc/20200320150115452799

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

  • master version

[root@localhost chenyanyan]# salt-master -V

Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: 1.6.0
       cherrypy: 5.6.0
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: 0.8.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Aug 27 2018, 16:21:36)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist:
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-693.21.1.el7.x86_64
         system: Linux
        version: Not Installed
  • minion version

linux-jc57:/var/log # /root/miniconda3/bin/salt-minion -V

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.1
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.5

System Versions:
           dist: SuSE 11 x86_64
         locale: UTF-8
        machine: x86_64
        release: 3.0.76-0.11-default
         system: Linux
        version: SUSE Linux Enterprise Server  11 x86_64
Bug info-needed

Most helpful comment

We found out that the Python 3. 5 had a bug that caused the problem : here's the bug report 
https: //bugs. python. org/issue29386
Please check your python version first ,and what I want to know is what have you do,did you use salt to execute some nested shell scripts? 

发自我的iPhone

------------------ Original ------------------
From: Dmitry Kuzmenko <[email protected]>
Date: Tue,Sep 1,2020 6:56 PM
To: saltstack/salt <[email protected]>
Cc: marilyn6483 <[email protected]>, Mention <[email protected]>
Subject: Re: [saltstack/salt] Minion did not return. [No response] appears occasionally,but once happened,minion never returns (#56467)

All 25 comments

How many minions do you have connected to your master? And what I'm reading here is that _some_ minions will no longer connect to the master after a few days, or all minions will no longer connect to the master after a few days?

Also, I noticed your master is on 2018.3.4, which is a version we are no longer officially supporting. Is there any way you'd be able to try this setup in another environment with your master on 3000 or also on 2019.2?

In our test environment,we have about 6 minions with 3 types of os: suse and other two developed by our company, only the minions installed in suse severs will occasionally meets with that 'not return' problem, others run very well;

Yes, we can set up with the latest version and see if this will occur;

I should point out that this problem happened very occasionally,we have no idea when will it happen again,but in our product environment, we have thousands minion boxes,so it could be a serious problem.

Thank you very much for your reply.

------------------ 原始邮件 ------------------
发件人: "Joe Eacott"<[email protected]>;
发送时间: 2020年3月31日(星期二) 凌晨0:34
收件人: "saltstack/salt"<[email protected]>;
抄送: "1280129660"<[email protected]>;"Author"<[email protected]>;
主题: Re: [saltstack/salt] Minion did not return. [No response] appears occasionally,but once happened,minion never returns (#56467)

Also, I noticed you're on 2018.3.4, which is a version we are no longer officially supporting. Is there any way you'd be able to try this setup in another environment with our latest Salt 3000?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

Are you able to run basic commands such as ip ping <address> and also check the ports on the suse minions? It could be a configuration issue on the suse minion end since it seems your other in-house minions are running fine without any hiccups!

After minion doesn't return, test.ping also doesn't return ,but ip ping is OK. When we restart the minion manually,it runs well again.
I didn't check the ports on suse, and now I am waiting for it happens again. 

发自我的iPhone

------------------ Original ------------------
From: Joe Eacott <[email protected]>
Date: Tue,Mar 31,2020 1:22 AM
To: saltstack/salt <[email protected]>
Cc: marilyn6483 <[email protected]>, Author <[email protected]>
Subject: Re: [saltstack/salt] Minion did not return. [No response] appears occasionally,but once happened,minion never returns (#56467)

Are you able to run basic commands such as ip ping


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

address

Are you able to run basic commands such as ip ping <address> and also check the ports on the suse minions? It could be a configuration issue on the suse minion end since it seems your other in-house minions are running fine without any hiccups!

I found some system logs:
Mar 20 14:50:40 linux kernel: [89579.613557] python[17093]: segfault at 5f436 ip 000000000005f436 sp 00007fff29e20658 error 14 in python3.5[400000+1000]
Mar 20 14:50:42 linux kernel: [89581.040805] python[17213]: segfault at 5f436 ip 000000000005f436 sp 00007fff29e20658 error 14 in python3.5[400000+1000]

At the time 'Mar 20 14:50:40', the minion‘s log is:
020-03-20 14:50:40,613 [salt.transport.zeromq :203 ][DEBUG ][17093] Connecting the Minion to the Master URI (for the return server): tcp://xxxxx

Is there any connections?

I am facing the same issue in Salt: 2019.2.5. Where salt master and minion running inside centos 7 container. Almost 24 test.ping is worked. After that when do several tries also will getting "Minion did not return. [No response]".

If i rebooted the salt-minion, next almost 24 test.ping worked and then this problem occurred again.

What's your Python version?

发自我的iPhone

------------------ Original ------------------
From: SaravanakumarSivasankaran <[email protected]>
Date: Tue,Jul 14,2020 2:24 PM
To: saltstack/salt <[email protected]>
Cc: marilyn6483 <[email protected]>, Author <[email protected]>
Subject: Re: [saltstack/salt] Minion did not return. [No response] appears occasionally,but once happened,minion never returns (#56467)

I am facing the same issue in Salt: 2019.2.5. Where salt master and minion running inside centos 7 container. Almost 16 test.ping is worked. After that when do several tries also will get "Minion did not return. [No response]".

If i rebooted the salt-minion, next almost 16 test.ping worked and the problem occurred again.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

Python 2.7.5

Found the below observation in salt-minion trace logs for not responded test.ping

  • Looks like salt-minion responded that test.ping
    [DEBUG ] minion return: {u'fun_args': [], u'jid': u'20200715071230671893', u'return': True, u'retcode': 0, u'success': True, u'fun': u'test.ping'}

  • Looks like salt-master not received the above response, it asking for that jid again to salt-minion
    [DEBUG ] Command details {u'tgt_type': u'list', u'jid': u'20200715071235735268', u'tgt': [u' node-name'], u'ret': u'', u'user': u'root', u'arg': [u'20200715071230671893'], u'fun': u'saltutil.find_job'}

  • Since the jid is responded already by salt-minion, it responded the below logs
    [DEBUG ] minion return: {u'fun_args': [u'20200715071230671893'], u'jid': u'20200715071235735268', u'return': {}, u'retcode': 0, u'success': True, u'fun': u'saltutil.find_job'}

I have the same problem in Salt 3000.3 (master and minons) too, but it's been going on for quite some time across prior versions to. I've struggled to find any ideas on how to identify it or how to even write a bug report for it.

We're running python 2.7 too and running on AWS Linux.

What I can determine is that there appears to be two cases:

  1. The minion responds to an event, and the master accepts the event was returned
  2. The minion responds (or at least says it responds), but the master never receives the event.

You can see it play out by using salt-run state.event pretty=True.

In the happy case, the following happens:

20200716191834527792    {
    "_stamp": "2020-07-16T19:18:34.528083",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191834527792/new   {
    "_stamp": "2020-07-16T19:18:34.528555",
    "arg": [],
    "fun": "test.ping",
    "jid": "20200716191834527792",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": "the-minion,",
    "tgt_type": "glob",
    "user": "root"
}
salt/job/20200716191834527792/ret/the-minion, {
    "_stamp": "2020-07-16T19:18:34.642001",
    "cmd": "_return",
    "fun": "test.ping",
    "fun_args": [],
    "id": "the-minion,",
    "jid": "20200716191834527792",
    "retcode": 0,
    "return": true,
    "success": true
}

The job is created, the return is received and all is right with the world.

In the unhappy case:

20200716191533486948    {
    "_stamp": "2020-07-16T19:15:33.487146",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191533486948/new   {
    "_stamp": "2020-07-16T19:15:33.487613",
    "arg": [],
    "fun": "test.ping",
    "jid": "20200716191533486948",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": "the-minion,",
    "tgt_type": "glob",
    "user": "root"
}
20200716191538601450    {
    "_stamp": "2020-07-16T19:15:38.601653",
    "minions": [
        "the-minion,"
    ]
}
salt/job/20200716191538601450/new   {
    "_stamp": "2020-07-16T19:15:38.602129",
    "arg": [
        "20200716191533486948"
    ],
    "fun": "saltutil.find_job",
    "jid": "20200716191538601450",
    "minions": [
        "the-minion,"
    ],
    "missing": [],
    "tgt": [
        "the-minion,"
    ],
    "tgt_type": "list",
    "user": "root"
}
salt/job/20200716191538601450/ret/the-minion, {
    "_stamp": "2020-07-16T19:15:38.711512",
    "cmd": "_return",
    "fun": "saltutil.find_job",
    "fun_args": [
        "20200716191533486948"
    ],
    "id": "the-minion,",
    "jid": "20200716191538601450",
    "retcode": 0,
    "return": {},
    "success": true
}

It's broadly the same, but you can see a find_job instance is created as @SaravanakumarSivasankaran suggests.

The minion logs similarly in both cases:

In the happy case:

2020-07-16 19:18:34,536 [salt.minion                                           :1491][INFO    ][8630] User root Executing command test.ping with jid 20200716191834527792
2020-07-16 19:18:34,619 [salt.minion                                           :1621][INFO    ][5442] Starting a new job 20200716191834527792 with PID 5442
2020-07-16 19:18:34,637 [salt.minion                                           :1949][INFO    ][5442] Returning information for job: 20200716191834527792

In the unhappy case, there is the additional find_job:

2020-07-16 19:15:33,494 [salt.minion                                           :1491][INFO    ][8630] User root Executing command test.ping with jid 20200716191533486948
2020-07-16 19:15:33,595 [salt.minion                                           :1621][INFO    ][5192] Starting a new job 20200716191533486948 with PID 5192
2020-07-16 19:15:33,616 [salt.minion                                           :1949][INFO    ][5192] Returning information for job: 20200716191533486948
2020-07-16 19:15:38,610 [salt.minion                                           :1491][INFO    ][8630] User root Executing command saltutil.find_job with jid 20200716191538601450
2020-07-16 19:15:38,701 [salt.minion                                           :1621][INFO    ][5195] Starting a new job 20200716191538601450 with PID 5195
2020-07-16 19:15:38,706 [salt.minion                                           :1949][INFO    ][5195] Returning information for job: 20200716191538601450

Note there are two returns - one for the first test.ping and one for the following find_job.

Some print debugging shows that the minion is, as best as I can tell, returning sensible messages. I modified the minion function here:

https://github.com/saltstack/salt/blob/45efc4c1423f6ad233552cb7efbc9c5ff9f73a3b/salt/transport/zeromq.py#L389-L395

to:

        def _do_transfer():
            log.debug("_do_transfer")
            log.debug(base64.b64encode(pickle.dumps(load))
            # Yield control to the caller. When send() completes, resume by populating data with the Future.result
            data = yield self.message_client.send(
                self._package_load(self.auth.crypticle.dumps(load)),
                timeout=timeout,
                tries=tries,
            )

And used this to dump the output in JSON:

print(json.dumps(pickle.loads(base64.b64decode("...."))))

Here's the successful and unsuccessful cases for two runs.

Successful case:

 {"fun_args": ["sleep 10"], "jid": "20200716224848367462", "return": "", "retcode": 0, "success": true, "cmd": "_return", "fun": "cmd.run", "id": "the-minion"}

Unsuccessful case:

{"fun_args": ["sleep 10"], "jid": "20200716224718720155", "return": "", "retcode": 0, "success": true, "cmd": "_return", "fun": "cmd.run", "id": "ip-172-20-5-15.eu-west-1.compute.internal"}
{"fun_args": ["20200716224718720155"], "jid": "20200716224723832936", "return": {}, "retcode": 0, "success": true, "cmd": "_return", "fun": "saltutil.find_job", "id": "the-minion"}

Nothing to suggest that the minions are behaving differently.

tcpdump shows that messages are going from the minion to the master - I can see no obvious differences between the two cases.

It looks to me that from running with -ltrace and some print debugging that the minions are doing exactly what they're supposed to and the problem lies with the master not picking up the response from the minion, or at least not doing anything with it.

I find that restarting the master almost always resolves the problem, which makes tracking it down really hard. As far as I can tell - my best guess is that something about the socket receiving logic means that the messages aren't received correctly. I found that the following code: https://github.com/saltstack/salt/blob/master/salt/ext/tornado/iostream.py#L1051-L1054
if modified to:

    def read_from_fd(self):
        try:
            chunk = self.socket.recv(self.read_chunk_size)
            print(chunk) # print the chunk out when received
        except socket.error as e:

would print the chunk message from the minion immediately when sending test.ping and it was working, but would never print those messages if not. But I'm unfamiliar how this socket is created, what it's used for and whether it's indicative of an error or not. I'm running out of ideas on how to try and track this down - my socket, Python, Tornado and IPC knowledge is very limited, and so I'm struggling to determine where to even look.

I don't think there's anything unusual about our environment - we've around ~50-70 nodes. Most are static, with ~15 being auto scaling. The rest don't change. It's all in AWS within peered VPCs - and as above the traffic regarding tcpdump it seems to be traversing the network perfectly fine - sometimes it works, sometimes it doesn't and it's proving impossible to tell why.

Any thoughts or recommendations on how best to proceed would be utterly and completely magnificent.

@edhgoose I apologize it took me so long to actually follow up, here. Wondering if we should schedule a session with a Core Team member to attempt to hammer on this, I can help to schedule something, is there a time that is best for you?

@sagetherage, that'd be great.

If you want to take a take a look at https://calend.ly/edhgoose that's probably a good start? Generally evenings (UK time) are pretty good too.

@sagetherage - hey, I'd really love to get that help. This issue is driving me mad, sorry to chase - can we try and setup that session with a team member?

@sagetherage - hey, I'd really love to get that help. This issue is driving me mad, sorry to chase - can we try and setup that session with a team member?

yes! I have assigned @DmitryKuzmenko who can help here.

@DmitryKuzmenko have you been able to get with @edhgoose on this issue?

@sagetherage @DmitryKuzmenko and I had a conversation about turning on certain debugging, but I'm on holiday atm and haven't seen the issue for a little while. Unfortunately it seems extremely intermittent - but I'll be sure to confirm if it comes up again what the logs are.

@edhgoose no worries! I was checking in because I didn't want things to be left undone, thank you for the extra effort and have a nice holiday.

Unfortunately, I'm getting this same problem and manage about 1000s of servers.

[root@master]# salt-master -V
Salt Version:
           Salt: 3001.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core
[root@server salt]# salt-minion -V
Salt Version:
           Salt: 3001.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core

Like @edhgoose mentioned. Usually, a restart fixes it, however, it's been happening more and more lately. The company I work for adds new dedicated servers weekly and the issue persists even on new updated provisioned servers.

We use the python 3 versions of Salt.

@marilyn6483 do you have any chance to provide corresponding minion debug log?

We found out that the Python 3. 5 had a bug that caused the problem : here's the bug report 
https: //bugs. python. org/issue29386
Please check your python version first ,and what I want to know is what have you do,did you use salt to execute some nested shell scripts? 

发自我的iPhone

------------------ Original ------------------
From: Dmitry Kuzmenko <[email protected]>
Date: Tue,Sep 1,2020 6:56 PM
To: saltstack/salt <[email protected]>
Cc: marilyn6483 <[email protected]>, Mention <[email protected]>
Subject: Re: [saltstack/salt] Minion did not return. [No response] appears occasionally,but once happened,minion never returns (#56467)

Well, I'm running python3.6 and the issue persists.

I do run a few cmd.script that execute shell scripts, however, it's never been this bad.

Just a follow-up as it's been 10 days without a reply.

  • I'm running CentOS 7 x64.
  • Python 3.6
  • Salt is running on python3.6.

I checked this bug and ran:

strace python3.6 -c 'import select; select.epoll().poll(timeout=-1)' &
stat("/usr/lib64/python3.6", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/lib/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib64/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
brk(NULL)                               = 0x1184000
brk(0x11a5000)                          = 0x11a5000
getcwd("/root", 1024)                   = 6
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
stat("/root", {st_mode=S_IFDIR|0550, st_size=246, ...}) = 0
openat(AT_FDCWD, "/root", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
getdents(3, /* 17 entries */, 32768)    = 528
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
stat("/usr/lib64/python3.6", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/lib64/python3.6/lib-dynload/select.cpython-36m-x86_64-linux-gnu.so", {st_mode=S_IFREG|0755, st_size=33536, ...}) = 0
futex(0x7f3acc3a80d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/lib64/python3.6/lib-dynload/select.cpython-36m-x86_64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000 \0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=33536, ...}) = 0
mmap(NULL, 2127544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f3ac5187000
mprotect(0x7f3ac518d000, 2093056, PROT_NONE) = 0
mmap(0x7f3ac538c000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7f3ac538c000
close(3)                                = 0
mprotect(0x7f3ac538c000, 4096, PROT_READ) = 0
epoll_create1(EPOLL_CLOEXEC)            = 3
epoll_wait(3,

Just waits...

You mentioned nested shell scripts. We do something like this:

install_backups:
  cmd.script:
    - name: install_backups.sh
    - source: salt://backups/files/install_backups.sh
  • This happens on new servers too with all the updates applied.

@JustOneMoreBlock I've asked before and asking again. Could you please provide a hung minion debug log?

I got the same problem, my os is centos, python version is 3.6.8.
sall-master version is 3001.1.

It seems like we might have a few different issues all in the same ticket here:

@marilyn6483 Just to confirm, was the original issue that you had posted related to Python 3.5 and with an upgrade has been resolved?

@JustOneMoreBlock Following on your the comments from @DmitryKuzmenko can you provide a minion log with information when the issue occurs.

@edhgoose Is the issue you're seeing still persisting? Does it appear to be a timeout issue between master and minion? Have you tried increasing the timeout values? Are you running Salt in the same AWS region or is between regions? From outside AWS into AWS?

Thanks!

Was this page helpful?
0 / 5 - 0 ratings