I set up a Docker install of AWX. However, once every couple of days, I cannot connect to the web interface. It just hangs. If I reboot the server, everything is fine, for a couple more days. When I check the scheduler, nothing is running once it hangs.
It continues working
It times out or I get an "Internal Server Error"
awx-task container log had this:
[2018-02-19 22:22:15,355: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: ''<ExceptionInfo: OperationalError(\'could not connect to server: Connection timed out\\n\\tIs the server running on host "postgres" (172.17.0.4) and accepting\\n\\tTCP/IP connections on port 5432?\\n\',)>''. Reason: ''TypeError("can\'t pickle traceback objects",)''.>
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/billiard/reduction.py", line 82, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib64/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib64/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib64/python2.7/pickle.py", line 419, in save_reduce
save(state)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.7/pickle.py", line 663, in _batch_setitems
save(v)
File "/usr/lib64/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib64/python2.7/pickle.py", line 419, in save_reduce
save(state)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.7/pickle.py", line 663, in _batch_setitems
save(v)
File "/usr/lib64/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib64/python2.7/pickle.py", line 419, in save_reduce
save(state)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.7/pickle.py", line 663, in _batch_setitems
save(v)
File "/usr/lib64/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/var/lib/awx/venv/awx/lib64/python2.7/copy_reg.py", line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
MaybeEncodingError: Error sending result: ''<ExceptionInfo: OperationalError(\'could not connect to server: Connection timed out\\n\\tIs the server running on host "postgres" (172.17.0.4) and accepting\\n\\tTCP/IP connections on port 5432?\\n\',)>''. Reason: ''TypeError("can\'t pickle traceback objects",)''.
Its possible its related to #677 and/or #796, but both are closed and no information in them has helped solve my issue.
I did a reinstall twice just to verify I didn't do anything wrong.
Is something happening to the postgres container? Is it running? are there error logs for that container? or rabbitmq service in the container? What are the specs on the system running AWX.
I have a similar experience. I have a project to implement AWX, and I keep getting distracted by other priorities. Everytime I get to come back to work on AWX, I get a 502 bad gateway error.
Currently, my postgres docker is filled with:
ERROR: could not open file "base/16384/2601": No such file or directory
Mixed with:
FATAL: could not open file "base/16384/2601": No such file or directory
If I reboot the machine, everything comes back up except the postgres container. It goes into a reboot loop. I've tried other things (namely the issues linked in https://github.com/ansible/awx/issues/796) but every few weeks, I have to start over.
There are no scheduled tasks, and the database is fresh and empty (because I keep getting pulled away). The logs for awx_web and awx_task only complain that they can't connect to a postgres database.
@apple4ever It looks like we talked about this in other issues, as well. I think I'm just going to move away from using the docker postgres and set up my own. Everything else works fine.
Yeah we did, but they are all closed. I wanted to try to bring it back up. But I may go your route if nothing changes. Right now AWX is unusable for us.
@matburt sorry I totally missed your questions:
Is something happening to the postgres container? Is it running?
No and yes. I can psql to it from the container itself.
are there error logs for that container? or rabbitmq service in the container?
I didn't find any relevant ones.
What are the specs on the system running AWX.
Linux awx 4.13.0-41-generic #46~16.04.1-Ubuntu SMP Thu May 3 10:06:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
So I dug into this morning after having some time to look at it.
It seems to be the docker network failing to forward packets. I can ping to the containers from the host just fine, but when I try to ping between containers it fails. I've also tried curl to the web interface and psql to the postgres containers from other containers with no luck.
A simple reboot brings the network up. For me, this happens overnight for some reason. I'll keep digging.
I believe the problem is the volume mount for /var/lib/postgresql/data. This is mounted from /tmp and OS cleans up that directory for untouched files.. Probably a bad idea. It will go bad every X days.
Change your inventory to something outside of /tmp...
inventory:postgres_data_dir=/tmp/pgdocker
Docker inspect:
"Mounts": [
{
"Type": "bind",
"Source": "/tmp/pgdocker",
"Destination": "/var/lib/postgresql/data",
"Mode": "Z",
"RW": true,
"Propagation": "rprivate"
}
],
Yes I already do that, and this problem still occurs.
"Mounts": [
{
"Type": "bind",
"Source": "/srv/pgdocker",
"Destination": "/var/lib/postgresql/data",
"Mode": "Z",
"RW": true,
"Propagation": "rprivate"
}
I've lost everything I've done looks like... Let's start again. I'll see if changing the directory will work or not. But apparently same thing happened even without using /tmp for you.
$ systemctl status systemd-tmpfiles-clean
● systemd-tmpfiles-clean.service - Cleanup of Temporary Directories
Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-clean.service; static; vendor preset: disabled)
Active: inactive (dead) since Tue 2018-05-15 19:57:23 UTC; 20h ago
Docs: man:tmpfiles.d(5)
man:systemd-tmpfiles(8)
Process: 12569 ExecStart=/usr/bin/systemd-tmpfiles --clean (code=exited, status=0/SUCCESS)
Main PID: 12569 (code=exited, status=0/SUCCESS)
May 15 19:57:23 awx.coxlab.net systemd[1]: Starting Cleanup of Temporary Directories...
May 15 19:57:23 awx.coxlab.net systemd[1]: Started Cleanup of Temporary Directories.
Alright, I set up monitoring to figure out what the heck is happening. Here is the log from awx_web
172.17.0.1 - - [22/May/2018:14:07:00 +0000] "GET /api/v2/ping/ HTTP/1.0" 200 234 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
[pid: 27|app: 0|req: 522/2331] 172.17.0.1 () {44 vars in 842 bytes} [Tue May 22 14:07:00 2018] GET /api/v2/ping/ => generated 234 bytes in 32 msecs (HTTP/1.0 200) 8 headers in 223 bytes (1 switches on core 0)
172.17.0.1 - - [22/May/2018:14:08:40 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:09:40 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
Tue May 22 14:09:41 2018 - *** HARAKIRI ON WORKER 1 (pid: 27, try: 1) ***
Tue May 22 14:09:41 2018 - HARAKIRI !!! worker 1 status !!!
Tue May 22 14:09:41 2018 - HARAKIRI [core 0] 172.17.0.1 - GET /api/v2/ping since 1526998060
Tue May 22 14:09:41 2018 - HARAKIRI !!! end of worker 1 status !!!
DAMN ! worker 1 (pid: 27) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 85)
172.17.0.1 - - [22/May/2018:14:09:46 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:10:36 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:10:41 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
Tue May 22 14:10:41 2018 - *** HARAKIRI ON WORKER 2 (pid: 28, try: 1) ***
Tue May 22 14:10:41 2018 - HARAKIRI !!! worker 2 status !!!
Tue May 22 14:10:41 2018 - HARAKIRI [core 0] 172.17.0.1 - GET /api/v2/ping since 1526998120
Tue May 22 14:10:41 2018 - HARAKIRI !!! end of worker 2 status !!!
DAMN ! worker 2 (pid: 28) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 94)
Tue May 22 14:10:47 2018 - *** HARAKIRI ON WORKER 4 (pid: 30, try: 1) ***
Tue May 22 14:10:47 2018 - HARAKIRI !!! worker 4 status !!!
Tue May 22 14:10:47 2018 - HARAKIRI [core 0] 172.17.0.1 - GET /api/v2/ping since 1526998126
Tue May 22 14:10:47 2018 - HARAKIRI !!! end of worker 4 status !!!
DAMN ! worker 4 (pid: 30) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 103)
Tue May 22 14:11:37 2018 - *** HARAKIRI ON WORKER 5 (pid: 31, try: 1) ***
Tue May 22 14:11:37 2018 - HARAKIRI !!! worker 5 status !!!
Tue May 22 14:11:37 2018 - HARAKIRI [core 0] 172.17.0.1 - GET /api/v2/ping since 1526998176
Tue May 22 14:11:37 2018 - HARAKIRI !!! end of worker 5 status !!!
DAMN ! worker 5 (pid: 31) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 5 (new pid: 112)
Tue May 22 14:11:42 2018 - *** HARAKIRI ON WORKER 3 (pid: 29, try: 1) ***
Tue May 22 14:11:42 2018 - HARAKIRI !!! worker 3 status !!!
Tue May 22 14:11:42 2018 - HARAKIRI [core 0] 172.17.0.1 - GET /api/v2/ping since 1526998181
Tue May 22 14:11:42 2018 - HARAKIRI !!! end of worker 3 status !!!
DAMN ! worker 3 (pid: 29) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 3 (new pid: 121)
172.17.0.1 - - [22/May/2018:14:11:54 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:12:53 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:13:43 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:14:33 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:16:37 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:16:55 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:18:38 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:18:42 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:19:50 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:20:52 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:22:36 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:22:52 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:23:46 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:25:23 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:25:52 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
172.17.0.1 - - [22/May/2018:14:26:37 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:28:28 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.200.74"
172.17.0.1 - - [22/May/2018:14:28:42 +0000] "GET /api/v2/ping HTTP/1.0" 499 0 "-" "PHP-curl/Villa Remote Monitoring/1.0" "10.176.196.226"
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/awx/wsgi.py", line 33, in <module>
application = get_wsgi_application()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/wsgi.py", line 13, in get_wsgi_application
django.setup(set_prefix=False)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/__init__.py", line 27, in setup
apps.populate(settings.INSTALLED_APPS)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/apps/registry.py", line 116, in populate
app_config.ready()
File "/usr/lib/python2.7/site-packages/awx/conf/apps.py", line 18, in ready
configure_external_logger(settings)
File "/usr/lib/python2.7/site-packages/awx/main/utils/handlers.py", line 342, in configure_external_logger
is_enabled = settings_module.LOG_AGGREGATOR_ENABLED
File "/usr/lib/python2.7/site-packages/awx/conf/settings.py", line 476, in __getattr_without_cache__
return getattr(self._wrapped, name)
File "/usr/lib/python2.7/site-packages/awx/conf/settings.py", line 389, in __getattr__
value = self._get_local(name)
File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/site-packages/awx/conf/settings.py", line 66, in _log_database_error
if get_tower_migration_version() < '310':
File "/usr/lib/python2.7/site-packages/awx/main/utils/db.py", line 14, in get_tower_migration_version
loader = MigrationLoader(connection, ignore_no_migrations=True)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/migrations/loader.py", line 52, in __init__
self.build_graph()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/migrations/loader.py", line 209, in build_graph
self.applied_migrations = recorder.applied_migrations()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/migrations/recorder.py", line 65, in applied_migrations
self.ensure_schema()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/migrations/recorder.py", line 52, in ensure_schema
if self.Migration._meta.db_table in self.connection.introspection.table_names(self.connection.cursor()):
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 254, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 229, in _cursor
self.ensure_connection()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 213, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 213, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 189, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/postgresql/base.py", line 176, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/psycopg2/__init__.py", line 130, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not connect to server: Connection timed out
Is the server running on host "postgres" (172.17.0.3) and accepting
TCP/IP connections on port 5432?
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. going in full dynamic mode ***
--- no python application found, check your startup logs for errors ---
[pid: 85|app: -1|req: -1/2332] 172.17.0.1 () {44 vars in 840 bytes} [Tue May 22 14:29:27 2018] GET /api/v2/ping => generated 21 bytes in 0 msecs (HTTP/1.0 500) 2 headers in 83 bytes (0 switches on core 0)
--- no python application found, check your startup logs for errors ---
I've determined this to be Rackspace issue. On RackConnect v2, they run an agent which messes with iptables, which breaks Docker networking. We're moving away from Rackspace eventually.
Due to this issue and this bad policy, I'm not going to use AWX for the foreseeable future.
I'll close this as its not a AWX bug.
Still the issue of restart host is not resolved , I am facing this right now,
The web ui is not accessible , at-least provide some documentation regarding how to take local backup of all configurations
@apple4ever - thanks for your investigation on this, has been very helpful. Any chance you could point towards any docs on the Rackspace error?