When using salt-cloud to create a new AWS EC2 instance, it works, but often fails to get salt-minion installed (error below). I see this often with AWS EC2 instances... if you try and run apt/dpkg too quickly after the EC2 instance was created, you get errors about the dpkg lock not being released. Basically, you need to sleep for a couple seconds and try again, and it'll work.
This was confirmed by taking the command from the salt error and just running it again manually... it worked.
Is there a way to have salt-bootstrap sleep before trying this or have it retry a few times before dying in error? In the case of an Ubuntu box, I often use a function like this to make sure apt is not running and if it is, to sleep for a couple seconds and try again:
isAptGetRunning(){
while [ $(ps -e | grep "apt" -c) -gt 0 ]
do
sleep 2
done
}
Thanks!!
Error:
* INFO: Running version: 2017.08.17
* INFO: Executed by: /bin/sh
* INFO: Command line: '/tmp/.saltcloud-da5c774e-ffbf-42da-a79d-bb9b29307b28/deploy.sh -c /tmp/.saltcloud-da5c774e-ffbf-42da-a79d-bb9b29307b28'
* INFO: System Information:
* INFO: CPU: GenuineIntel
* INFO: CPU Arch: x86_64
* INFO: OS Name: Linux
* INFO: OS Version: 4.4.0-1038-aws
* INFO: Distribution: Ubuntu 16.04
* INFO: Installing minion
* INFO: Found function install_ubuntu_stable_deps
* INFO: Found function config_salt
* INFO: Found function preseed_master
* INFO: Found function install_ubuntu_stable
* INFO: Found function install_ubuntu_stable_post
* INFO: Found function install_ubuntu_restart_daemons
* INFO: Found function daemons_running
* INFO: Found function install_ubuntu_check_services
* INFO: Running install_ubuntu_stable_deps()
Hit:1 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu xenial-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu xenial-security InRelease
Reading package lists...Connection to 10.22.29.170 closed.
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
* ERROR: Failed to run install_ubuntu_stable_deps()!!!
Error: There was a profile error: Command 'ssh -t -t -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oControlPath=none -oPasswordAuthentication=no -oChallengeResponseAuthentication=no -oPubkeyAuthentication=yes -oIdentitiesOnly=yes -oKbdInteractiveAuthentication=no -i /etc/salt/biren.pem -p 223 [email protected] 'sudo /tmp/.saltcloud-da5c774e-ffbf-42da-a79d-bb9b29307b28/deploy.sh -c '"'"'/tmp/.saltcloud-da5c774e-ffbf-42da-a79d-bb9b29307b28'"'"''' failed. Exit code: 1
Salt Version:
Salt: 2017.7.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.4.2
docker-py: Not Installed
gitdb: 0.6.4
gitpython: 1.0.1
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 1.0.3
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.12 (default, Nov 19 2016, 06:48:10)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.2.0
RAET: Not Installed
smmap: 0.9.0
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: Ubuntu 16.04 xenial
locale: UTF-8
machine: x86_64
release: 4.4.0-1038-aws
system: Linux
version: Ubuntu 16.04 xenial
Thanks for filing this @birengoodco
@birengoodco Can you give the fix in #1186 a try? That should fix this for you.
Hi @rallytime I'm not sure how I should try this? Where is the fix deployed to? I figured it might've been published and tried running a new salt-cloud -p deployment to see if it would just pick up an updated salt-bootstrap script... but I'm getting the same error...
So perhaps I'm not understanding how you'd like me to test... sorry for being dense haha Very psyched that there's a fix coming though!
Updated my bootstrap-salt.sh with salt-cloud -u and tried re-running... Still getting the same error... Not sure how to update bootstrap-salt.sh to a non-release version?
```
INFO: Distribution: Ubuntu 16.04
INFO: Installing minion
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
Hi @birengoodco - Thanks for being willing to test this! My apologies for not answering your question sooner. The fix hasn't been merged in yet, or released. So you'll need to test from the raw file of the PR itself. You can find the file here: https://raw.githubusercontent.com/rallytime/salt-bootstrap/7844bddffc348c6c45af101faa977d1f8f2cf80a/bootstrap-salt.sh
Let me know how it goes!
Hi @rallytime , great news, it appears to be working! I just ran a few test deployments, and no hiccups! I feel like the actual Salt Minion installation and overall time for salt-cloud to complete is taking a lot longer now, though... I'm clocking in at about 15-20m per run for salt-cloud to return the prompt. I'd much rather have it complete without the hiccups, though, so I'm happy enough... just mentioning it in case there are some extra/unnecessary waits that got implemented somewhere along the line? Or perhaps it is what it is...
Also, on a side note... I was looking at the __wait_for_apt() function, and it looks like APT_WAIT_TIMEOUT is set to 900 (15 min) and then 1 is subtracted per iteration of the loop, but the function sleeps for 2 seconds per iteration... So I think the overall timeout ends up being 1800 seconds (30 minutes), right?
@birengoodco Thanks for testing that! And that definitely seems a lot slower. I didn't time it when I was testing it. Originally my pr used pgrep to search for running processes and that seemed faster. I will update this to use pgrep if available, and then fall back to the other command if pgrep isn't available.
And for your side not question - you're totally right! I changed one variable and not the other when I was testing some things. I'll fix that. Good catch!
@birengoodco I have updated the PR. Try this one: https://raw.githubusercontent.com/rallytime/salt-bootstrap/6d7788da0ebe9364ccf213c1d8e7d6b33927e4a4/bootstrap-salt.sh.
Hi @rallytime so I'm trying out the new script this morning, and it was MUCH faster this time... first iteration completed in just under 7m and the second iteration under 3m!
This new script appears to be outputting the process/pids from the pgrep (ex below), any chance that could be suppressed (or redirected to a log) for release? Perhaps that was your plan to begin with, but I thought I'd ask :)
Anyway, this is awesome, thanks so much!!!!!!!! This change cuts my complete build times in half and requires no more intervention... It's a beautiful thing :)
1496 apt.systemd.dai
1537 apt.systemd.dai
2420 apt-get
1496 apt.systemd.dai
1537 apt.systemd.dai
2420 apt-get
1496 apt.systemd.dai
1537 apt.systemd.dai
2420 apt-get
1496 apt.systemd.dai
1537 apt.systemd.dai
2483 apt.systemd.dai
1496 apt.systemd.dai
1537 apt.systemd.dai
2496 apt.systemd.dai
2502 apt.systemd.dai
2496 apt.systemd.dai
2502 apt.systemd.dai
2496 apt.systemd.dai
2502 apt.systemd.dai
2496 apt.systemd.dai
2502 apt.systemd.dai
2496 apt.systemd.dai
2502 apt.systemd.dai
2496 apt.systemd.dai
@birengoodco Wonderful! That's great to hear. I'll look at getting that output redirected. Thanks again for testing and for your feedback. 馃憤
@birengoodco I had to made one more change since the last time you tested, but the fix looks good and is merged into develop. This will be available in the next release of bootstrap.
Hi @rallytime , was this ever released? I had been running with the version you supplied above ever since, and decided to update bootstrap-salt.sh today, to make sure I wasn't missing out on any other updates... and saw that it failed like above... I've reverted back to my other copy, but wondering if this was actually released, or if something else is up again?
Hi @birengoodco - Yeah, it should be in the latest release.
Can you post which version you're running and where it is failing?
The updated script shows __ScriptVersion="2018.04.25"
It's failing the same way as before:
Reading package lists...
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
* ERROR: Failed to run install_ubuntu_stable_deps()!!!
Perhaps there is a place in the script that needs the process check function that wasn't in the original fix. Can you post your debug logs please?
Sorry for being dense haha, but what debug logs? Happy to provide whatever you need, but wasn't sure where there might be additional information than what is coming to the screen (which is what I pasted above).
Can you run the bootstrap script with -D and post the output? That will help narrow down where this might be hanging.