During an install sync from an admin workstation -- tor is not coming back up correctly and I'm unable to re initiate connection.
0.3.12 SD instance to 0.4rc1 by switching the apt repos to point to apt-test.freedom.press3.0.1 with persistencerelease/0.4 branch is cloned into ~/Persistent/securedropinstall_files/ansible-base/groups_vars/all/site-specific match the settings on your SD instance.install_files/ansible-base accordingly.source .venv/bin/activate)install_files/ansible-base, and run ./securedrop-prod.yml -t tor --diff -vvPlaybook will finish with minimal changes, and any changes that come up should not affect SD service permanently. Tor should reboot and still be accessible.
Ansible is bombing out on the Waiting for SSH connection (slow).... Before that you can see that Sandbox 1 is getting added to the /etc/tor/torrc file and triggering a tor restart.
We need to do two things:
swap the pause logic with something smarter that takes advantage of retries and command (see https://github.com/freedomofpress/securedrop/blob/develop/devops/playbooks/aws-ci-teardown.yml#L48 and http://docs.ansible.com/ansible/playbooks_loops.html#do-until-loops).
consider ripping out the sandbox 1 option before the 0.4 release. I'm not convinced yet this has seen enough testing internally to warrant it'll work in prod without incident.
Ummmmmmmm so upon getting ssh to the app server after a few reboots tor is working again.... WTH
Relevant section from tor man page says:
Can not be changed while tor is running.
Which makes sense if you think about how syscall sandboxes are implemented. I had issues with this directive a while back, so my recollections are somewhat dim, but I seem to remember that Tor freaks out if you add this directive to the config while Tor is running and then restart it. Check the Tor logs, I seem to remember them being helpful. And I vaguely recall that in order to get this to work, you need to:
Which might be problematic in the context of Ansible over SSH over Tor ATHS. It is critical to consider how this could affect production deploys: if there is a situation in which the Ansible playbooks could leave Tor in a non-operative state, that would lock admins out of their systems and be extremely difficult to recover from.
consider ripping out the sandbox 1 option before the 0.4 release. I'm not convinced yet this has seen enough testing internally to warrant it'll work in prod without incident.
That is probably going to be the best option here, given our strong desire to stay on track for the scheduled release date. More research definitely warranted. Hope this vague braindump helps!
This change came in via 8426a9f640852fa3d77a3183913d11c97b52061e in #994, but never shipped. While adding the sandbox option is appealing, we should test it thoroughly before shipping the feature. Therefore I agree with @msheiny and @garrettr that this change be backed out from 0.4, so we can proceed with QA in other areas.
The problem of severed connections is caused by interaction between Ansible's handling of ControlPersist files and the fact that the Tor service handles connections before they reach SSH on the servers.
The handler task that restarts the tor service completes successfully, meaning tor is both stopped and restarted, even when the task performing that stop/start is executed over Tor. One can confirm this proper execution by running e.g. ssh app uptime from a new terminal session after the failure has occurred: it returns output, proving that the SSH-over-ATHS connection worked. The raw SSH command does _not_ use ControlPersist files; Ansible by default (and in our explicit config at install_files/ansible-base/ansible.cfg) does.
The use of ControlPersist files means that Ansible attempts to reuse an on-disk socket file pointing to the multiplexed SSH connection going out to the remote servers over Tor. Since the Tor service was restarted, this socket is no longer active, but SSH doesn't know that, and neither does Ansible鈥攁t least not implicitly. If we manually clean up those ControlPersist files, the connection should reestablish normally.
We don't want to disable all use of ControlPersist files for two reasons:
install_files/ansible-base/ansible.cfg and set ssh_args = -o ControlMaster=no -S none -o ConnectTimeout=60, then run the prod playbook with --tags tor.)So we're left with cleaning up the connection explicitly, which is an acceptable consequence of our nonstandard networking setup of SSH-over-ATHS. In Ansible 2.3, a meta: reset_connection action was introduced, which comes close to doing what we need, but doesn't satisfy: it passes -O stop to SSH, which closes the multiplexed session, but leaves the ControlMaster active. For the use case of "tor was just restarted and we're trying to connect over tor," we need a more comprehensive destruction of the connection. Our options, then, seem to be:
~/.ansible/cp on the Admin Workstation as part of the handler run. The changes in #1707 provided a method for multiple tasks to run as part of a single handler notification.ssh -O exit <hostname> on the Admin Workstation to reset the ControlPersist settings. While ostensibly more elegant than 1), this approach may be complicated due to differing defaults between Ansible and SSH in terms of connection settings such as ControlPath.Will continue with testing and report back. Naturally we can still back this out of a stable implementation isn't found, but it's important that we understand in detail the consequences of architectural decisions such as SSH-over-ATHS, and how to maintain them confidently.
Being the one that originally advocated enabling the seccomp syscall sandbox and having used it in a variety of environments, I just want to note that I've since seen it cause issues beyond requiring a service restart. For example, with Tor version 0.2.9.10-1 in combination with certain versions of OpenSSL, the sandbox was completely broken due to a bug disallowing getpid() which I reported, but it's since been fixed. It is still regarded as an experimental feature. When it works it works well, but changes in Tor or updates to its dependent dynamic libraries could cause breakage. Therefore, I would be comfortable if ya'll want to remove it, at least until it has more time to bake in and get marked non-experimental.
While ostensibly more elegant than 1), this approach may be complicated due to differing defaults between Ansible and SSH in terms of connection settings such as ControlPath.
Turns out the above is true, so sticking with 1) for resolution. Tested on multiple Ansible versions, and the formatting of the ControlPath attribute changes as of Ansible 2.3. To avoid micromanaging those connection mappings between SSH config and Ansible's additional SSH config options, we'll simply clear out the ControlPersist files to force establishment of a fresh ControlMaster.
@ageis Thanks for jumping in and sharing some of your relevant experience. That only convinces me further that we should back this out of the next release, and seek to re-land once we have more time to test it.
Most helpful comment
@ageis Thanks for jumping in and sharing some of your relevant experience. That only convinces me further that we should back this out of the next release, and seek to re-land once we have more time to test it.