Salt: Log filled with "Exception occurred while Subscriber handling stream: Already reading"

Created on 16 Aug 2018 · 47Comments · Source: saltstack/salt

Description of Issue/Question

Hi, on a server running salt-minion and salt-master, the minion's logfile was today filled (quite quickly) with the following message:

2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,762 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,763 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-12 06:25:10,764 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

Stopping the minion service didn't stop the process writing to the logfile, neither did SIGTERM - I only got rid of it with SIGKILL (PID 25008 here):

$ ps aux | grep [p]ython
root     24851  0.0  0.1 101600  6048 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     24854  0.0  0.2 548420 11264 ?        Sl   Aug10   0:09 /usr/bin/python /usr/bin/salt-minion
root     25005  0.0  0.1 101600  6412 ?        S    Aug10   0:00 /usr/bin/python /usr/bin/salt-minion
root     25008 99.9  0.6 549964 27508 ?        Sl   Aug10 8403:05 /usr/bin/python /usr/bin/salt-minion

Setup

I don't know what's relevant here...

Steps to Reproduce Issue

Don't know either :/

Versions Report

Salt Version:
           Salt: 2017.7.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.6 (default, Nov 23 2017, 15:49:48)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 3.19.0-80-generic
         system: Linux
        version: Ubuntu 14.04 trusty

Same version of minion, master, and common.

Bug P3 ZD fixed-pending-your-verification severity-medium

Source

furgerf

👍1

All 47 comments

was there possibly anything before these log messages appeared that might be more telling as to why this occurred?

Any other reason as to why this would have occured? Where you running a particular command, state, etc?

Ch3LL on 16 Aug 2018

As far as I know, no manual activity caused this. This looks like the beginning, there don't seem to be any earlier messages:

2018-08-10 13:01:57,661 [salt.log.setup   ][WARNING ][24854] Temporary logging is already configured
2018-08-10 13:01:57,720 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,721 [salt.log.setup   ][WARNING ][24854] Console logging already configured
2018-08-10 13:01:57,723 [salt.log.setup   ][WARNING ][24854] Logfile logging already configured
2018-08-10 13:01:57,727 [salt.log.setup   ][ERROR   ][24854] An un-handled exception was caught by salt's global exception handler:
IOError: [Errno 5] Input/output error
Traceback (most recent call last):
  File "/usr/bin/salt-minion", line 26, in <module>
    salt_minion()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 204, in salt_minion
    minion.start()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 340, in start
    super(Minion, self).start()
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1032, in start
    self.prepare()
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 299, in prepare
    self.shutdown(1)
  File "/usr/lib/python2.7/dist-packages/salt/cli/daemons.py", line 397, in shutdown
    self.__class__.__name__, (exitmsg or '')).strip()))
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 1045, in shutdown
    self.exit(exitcode, exitmsg)
  File "/usr/lib/python2.7/dist-packages/salt/utils/parsers.py", line 268, in exit
    optparse.OptionParser.exit(self, status, msg)
  File "/usr/lib/python2.7/optparse.py", line 1573, in exit
    sys.stderr.write(msg)
IOError: [Errno 5] Input/output error
2018-08-10 13:02:01,413 [salt.log.setup   ][WARNING ][25008] Temporary logging is already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,473 [salt.log.setup   ][WARNING ][25008] Console logging already configured
2018-08-10 13:02:01,475 [salt.log.setup   ][WARNING ][25008] Logfile logging already configured
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading
2018-08-10 13:02:01,477 [salt.transport.ipc][ERROR   ][25008] Exception occurred while Subscriber handling stream: Already reading

The master's log file contains many stacktraces because it's unable to create temp files, as well as messages like those:

2018-08-14 11:57:41,801 [salt.utils.gitfs ][WARNING ][13836] Update lock file is present for gitfs remote 'http://XXX', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk or running 'salt-run cache.clear_git_lock gitfs type=update' will allow updates to continue for this remote.
2018-08-14 11:58:42,687 [salt.utils.gitfs ][WARNING ][13836] gitfs_global_lock is enabled and update lockfile /var/cache/salt/master/gitfs/67557cee86485c172576f530bfb03bbdb4c87c5971858ce20b19388e07b85606/.git/update.lk is present for gitfs remote 'http://XXX'.

Sorry, I could've added that right at the start, hope this helps!

furgerf on 16 Aug 2018

ping @saltstack/team-transport any ideas here? or follow up questions since this might be difficult to replicate.

Ch3LL on 17 Aug 2018

This is a race condition somewhere in the event bus client usage on minion side. We use the IPCMessageSubscriber to read the event bus. It's a singleton so there could be 2 instances of the bus client that uses one subscriber and since it's asynchronous, one tries to read while another one didn't done yet. I can't say more without detailed analyze.
@furgerf is it easily reproducible on your side? Is it possible to ask you to reproduce this on patched minion code to get better logging?

DmitryKuzmenko on 20 Aug 2018

Hi @DmitryKuzmenko, no unfortunately I have no idea how to reproduce it. By your description, it sounds like you might be able to trigger this more easily by creating a bunch of bus clients and having them read from the same subscriber if that's an option in the relevant code.
Let me know if I can help with more info.

furgerf on 20 Aug 2018

@furgerf it's OK.
What about the problem, I think we could try to catch this by code review. There are not many places where that subscriber is used.
Did you do something on that minion right before? I see the strange exception report in your log 4 seconds before the issue happened. Did you restarted minion manually or by cron? Did minion execute some state or schedule task?

@Ch3LL I propose to mark it as a bug and put to the backlog. If we'd have more reports here, we could increase the severity of this.

DmitryKuzmenko on 20 Aug 2018

👍1

@DmitryKuzmenko I've looked into the logs and see that there was some activity before the problem started, someone ran salt '*' cmd.run 'curl ifconfig.me'. There were no restarts though.

Like I mentioned before, this server contains both a minion and a master; the master manages around 40 minions in total. About half of them are currently connected.

furgerf on 21 Aug 2018

@furgerf thank you for the info. I'll review the related code when I'll have time. If you'll reproduce it again any new information is very appreciated.

DmitryKuzmenko on 21 Aug 2018

Any news with this? I was just bitten by it on a couple of minions, eg

# salt-call --versions-report
Salt Version:
           Salt: 2018.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.9 (default, Sep 25 2018, 20:42:16)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.11 
         locale: UTF-8
        machine: x86_64
        release: 3.16.0-7-amd64
         system: Linux
        version: debian 8.11

viq on 9 Nov 2018

Getting the same here in our setup, about 10 or so of our minions are getting this and filling up drive space rapidly.

rochen-joshua on 14 Nov 2018

👍1

FYI - We have about 300 or so minions, so it's not global, and reinstalling did not fix issue.

rochen-joshua on 14 Nov 2018

Started getting this as well. Only a few minions, but it fills up the drive rather quickly causing other issues.

tomwalsh on 20 Nov 2018

@DmitryKuzmenko seems more people are reporting this so i will increase the severity. any idea as to when you can get to this issue? I know its currently on your backlog due to higher priority issues.

Ch3LL on 26 Nov 2018

Had some time today to work on this. I confirmed my thinking I wrote here above. Tomorrow I'll provide the description and solution.

DmitryKuzmenko on 27 Nov 2018

👍1

I have no any code yet, but I promised to post an update and I'm doing this: the main problem is that IPCMessageSubscriber is a singleton. It's used by event that could be instantiated anywhere just to wait for some event, handle it and forget. So we can have a number of event instances at once. All of them will share one subscriber. If a number of instances run read_async simultaneously all will call stream.read_bytes that will fail for all subsequent calls if the first one isn't done yet. Moreover it conforms with the subscriber pattern, all the instance have to get the same data if some received.

The simplest way to fix this is to make the IPCMessageSubscriber non-sigleton. Here we need to make IPCClient a non-singleton and slightly rewrite IPCMessageClient to implement singleton by itself (or make it non-singleton too). This could have a performance impact as well.

One more of my minds is to split singleton to non-singleton handler class strong referencing the singleton instance and singleton weak referencing the handlers. This will allow singleton to know who is waiting for the event data and pass data to all event objects.

Let me some time to think about the best solution. I'll provide one soon.

DmitryKuzmenko on 28 Nov 2018

@saltstack/team-core any thoughts about this ^^^ are very appreciated.

DmitryKuzmenko on 28 Nov 2018

I've started to see this on a number (15-20%) of minions in my fleet as well. salt-minion doesn't properly die when restarted via systemd, requiring a SIGTERM.

This fills up my disks very rapidly, which of course causes other issues.

jimmyat on 5 Dec 2018

Seeing this regularly reoccur across our entire fleet at regular intervals of a week or so. Exactly as described. The only way to resolve this is to force kill the minion processes, as they no longer respond to the master or systemd trying to shut them down.

afletch on 6 Dec 2018

It looks like service restart cause the issue, because minion sometimes can not properly die on SIGTERM, which leaves some processes hanged that must be killed manually.

2018-12-05 05:10:11,000 [salt.utils.parsers:1053][WARNING ][2280007] Minion received a SIGTERM. Exiting.
2018-12-05 05:10:11,531 [salt.log.setup   :441 ][WARNING ][2280007] Temporary logging is already configured
2018-12-05 05:10:11,624 [salt.log.setup   :494 ][WARNING ][2280007] Console logging already configured
2018-12-05 05:10:11,626 [salt.log.setup   :494 ][WARNING ][2280007] Console logging already configured
2018-12-05 05:10:11,830 [salt.log.setup   :569 ][WARNING ][2280007] Logfile logging already configured
2018-12-05 05:10:11,833 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading
2018-12-05 05:10:11,833 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading
2018-12-05 05:10:11,834 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading
2018-12-05 05:10:11,834 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading
2018-12-05 05:10:11,834 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading
2018-12-05 05:10:11,834 [salt.transport.ipc:723 ][ERROR   ][2280007] Exception occurred while Subscriber handling stream: Already reading

We have this issue on cPanel servers, and trigger for this is daily package update ( upcp / rpmup ) that also auto restart services that are using some updated packages. For example, if python is updated on this cron job it will find all processes using old python version, and will auto restart such services, and salt-minion is one of them, but in some cases salt-minion restart just does not work properly and leave hanged processes. Maybe python update does really cause the restart issue, but I am not sure because ~500 packages were updated on this rpmup execution.

- Processing command `/usr/local/cpanel/scripts/find_outdated_services --auto`
 [/usr/local/cpanel/scripts/find_outdated_services] Looking for outdated services …
 [/usr/local/cpanel/scripts/find_outdated_services]
 [/usr/local/cpanel/scripts/find_outdated_services] The system found 28 outdated services:
...
 [/usr/local/cpanel/scripts/find_outdated_services]     salt-minion
...
 [/usr/local/cpanel/scripts/find_outdated_services] Restarting “salt-minion” …
 [/usr/local/cpanel/scripts/find_outdated_services] The system has restarted “salt-minion”.
...

Untill this salt-minion issue is resolved, temporary solution for this on cPanel servers is to add salt-minion to file /etc/cpanel/local/ignore_outdated_services to prevent auto restarts.

deyanx on 6 Dec 2018

@deyanx I think that while cPanel does restart salt-minion, that's just a trigger for the underlying problem. We see this on cPanel servers too, but even with find_outdated_services ignoring salt-minion, any other attempt to restart salt-minion via systemctl causes the same behaviour.

afletch on 21 Dec 2018

@Ch3LL I have a strace of the minion process when it is restarted and this problem occurs, if that helps. Contains sensitive data so I will not be placing it here.

afletch on 21 Dec 2018

@deyanx I think that while cPanel does restart salt-minion, that's just a trigger for the underlying problem. We see this on cPanel servers too, but even with find_outdated_services ignoring salt-minion, any other attempt to restart salt-minion via systemctl causes the same behaviour.

Yes, exactly. That is what I tried to say. cPanel's rpmup is just a trigger for automatic restart and underlying problem, not the cause of the issue, so my suggestion will just prevent problem to happen "unattended", so we can restart minion manually when/if needed, and then we can check if the issue is there on manual restart. That is obviously not a solution for this problem, just a suggestion to avoid random unattended issues... :)

deyanx on 21 Dec 2018

Manual restart salt-minion doesn't replicate the issue.

DejanKolar on 16 Jan 2019

any progress here? we had a bunch more servers start doing this today, really starting to impact us here

rochen-joshua on 16 Jan 2019

Any updates on this ?

DejanKolar on 4 Feb 2019

@DmitryKuzmenko i know you are working on the tornado stuff. Are you going to be able to get to this soon? If not we can re-assign it to someone else.

Ch3LL on 12 Feb 2019

I'll take care of this.

DmitryKuzmenko on 25 Feb 2019

We're using CentOs7 and updating and restarting our minions with kernel 3.10.0-957.5.1.el7.x86_64 seem to have solved this problem for us. The affected minions had kernels between 3.10.0-5xx and 3.10.0-8xx.

Edit: Might not be kernel related, might just have been the reboot, after the update to salt-minion-2019.2.0-1.el7 it seemed there were 2 minions running on some of our machines.

vandernorth on 26 Feb 2019

+1 here, we're seeing it on a selection of CentOS7 systems as well. Interestingly the most recent was also running cpanel.

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Oct 30 2018, 23:45:53)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.5.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

sfozz on 26 Feb 2019

Updating kernel to latest release didn't solve the problem.

DejanKolar on 20 Mar 2019

@DejanKolar the PR is still not merged yet. I'm waiting for the CI system fix to re-run tests. I expect this to be done in a couple of days.

DmitryKuzmenko on 20 Mar 2019

I am also facing the same issue after upgrading to 2018.3.3

gsagwan on 7 Apr 2019

I observed the same issue after upgrading from salt-2018.3.0-1.el7 to salt-2019.2.0-1.el7 today. Our monitoring noticed /var was almost full. Doing a "systemctl stop salt-minion.service" did not work. I had to do a "kill -9" on two remaining orphaned salt-minion daemons to get the error messages in /var/log/salt/minion to stop.

sfishback on 11 Apr 2019

Also, for what it's worth...
Running RHEL7 kernel 3.10.0-957.el7.x86_64. Kernel kernel-3.10.0-957.10.1.el7.x86_64 is installed but the system hasn't been rebooted yet to apply.

sfishback on 11 Apr 2019

I'm still seeing this issue on a large number of servers- any idea when the fix will be available?

jimmyat on 1 May 2019

Perhaps @DmitryKuzmenko can update everyone on which release this will get into

afletch on 1 May 2019

I've added in the following on cpanel servers which appears to have addressed this:

cpanel_local:
  file.directory:
    - name:     /etc/cpanel/local
    - makedirs: True

cpanel_ignored:
  file.managed:
    - name:     /etc/cpanel/local/ignore_outdated_services
    - contents:
      - salt-minion

sfozz on 2 May 2019

Currently the fix is in all branches (2018.3, 2019.2 and develop). It will be released in the nearest builds.

DmitryKuzmenko on 7 May 2019

👍1

Could you reference the PR or commit by any chance? Got bitten by this today and it'd be interesting what the fix looks like. Thanks!

anitakrueger on 14 May 2019

@anitakrueger that was a long way finally done with #52445 and #52564

DmitryKuzmenko on 14 May 2019

Awesome, much appreciated!

anitakrueger on 14 May 2019

👍1

We also encountered this late last evening, roughly 2~4% of our 250 minions were affected.

Any idea of when the fix will hit http://repo.saltstack.com/?

elipsion on 22 May 2019

@elipsion in June I think.

DmitryKuzmenko on 23 May 2019

👍1

ZD-3869

doesitblend on 28 Jun 2019

Had the same issue running salt 2018.3.4 on Centos 7

It kept writing to minion log every millisecond, and caused one of my hosts to run out of disk space,

I think the reason was multiple salt procs running at same time, you can see PID 1114 has a strange "1-20:07.." in the timespan column, maybe a zombi?

                                                                                                                                                                                                                                                                                root@testbox /var# ps -ef | grep salt
root      1114     1 30 Oct11 ?        1-20:07:34 /usr/bin/python /usr/bin/salt-minion
root     10122  1114  0 Oct15 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     25791     1  7 11:18 ?        00:00:14 /usr/bin/python /usr/bin/salt-minion
root     26009     1  0 11:19 ?        00:00:00 /usr/bin/python /usr/bin/salt-minion
root     26165 24719 10 11:21 pts/0    00:00:00 grep --color=auto salt

killing all salt procs and restarting minion fixed the issue, but luckily this was a testbox, if was production and this caused a disk space issue, would be much bigger problem.

Not sure how to reproduce it again

perfecto25 on 17 Oct 2019

@elipsion in June I think.

Since this is ticket is still "open", I would assume this is not released for now, right? Or can you send us the version number to update towards, please?