I started a Workflow job, it is in running state and it is hunging. The first job has not yet been spawned. With WJ cancellation nothing happens.
All the following jobs are in queue.
The WJ has a survey, I inserted some custom values and run it.
WJ comes to an end (success/failure).
It stucks in running state. No job spawned.
Single jobs (who compose the workflow) are working passing the same values asked in survey.
Ram seems ok.
I have no errors in UI. The only error in logs seems to be this one:
[2018-01-10 15:39:50,299: ERROR/ForkPoolWorker-3649] Task awx.main.scheduler.tasks.run_task_manager encountered exception.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/tasks.py", line 37, in run_task_manager
TaskManager().schedule()
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 628, in schedule
finished_wfjs = self._schedule()
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 613, in _schedule
self.spawn_workflow_graph_jobs(running_workflow_tasks)
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 195, in spawn_workflow_graph_jobs
job = spawn_node.unified_job_template.create_unified_job(**kv)
File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 358, in create_unified_job
raise Exception('Fields {} are not allowed as overrides.'.format(unallowed_fields))
Exception: Fields set(['extra_vars']) are not allowed as overrides.
Could the "override" thing be related to the fact that the Workflow has a survey, but also Job Templates used to build the Worflow have surveys? That's because JT could be also executed atomically.
As the system is unusable I both tried restarting all containers and rebooting the server, with no luck.
Please help to cancel that Workflow job.
Thanks a lot!
This issue is making the environment unusable.
If anybody suggests a solution (or at least a way to force the job cancellation), on Monday I'll try to backup the db and rebuild/upgrade containers hoping to have luck.
Thanks
I was just about to report this bug as well. Does your workflow include an inventory refresh by chance? I only notice this issue cropping up if the workflow has both a survey as well as an inventory refresh.
I ran the exact same workflow _including_ the inventory refresh but _without_ the survey (extra_vars), and it works fine.
I think the expected behavior should be to ignore the extra_vars on an inventory refresh. Also the workflow job should die on error and not run forever. I also had to restore my DB from backup when it was in this state.
No, my workflow has no inventory refresh, but it starts with a SCM update (Project sync).
As you said, the same workflow completed successfully without survey. When I added surveys I encountered the issue.
So any chance to force the cancellation (through api or the db)? The only way is to restore a previous backup?
AWX ver = 1.0.2.0
I have the same problem. My workflow template also has a survey. UI doesn't cancel the job so it's not possible to run new jobs anymore. Tower host is not usable.
Following message is continuously repeating in log:
[2018-01-14 19:31:33,509: ERROR/ForkPoolWorker-10314] Task awx.main.scheduler.tasks.run_task_manager[1d54d396-b4c2-4a41-a1b6-51e3b980ab0e] raised unexpected: Exception("Fields set(['extra_vars']) are not allowed as overrides.",)
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/tasks.py", line 37, in run_task_manager
TaskManager().schedule()
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 628, in schedule
finished_wfjs = self._schedule()
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 613, in _schedule
self.spawn_workflow_graph_jobs(running_workflow_tasks)
File "/usr/lib/python2.7/site-packages/awx/main/scheduler/task_manager.py", line 195, in spawn_workflow_graph_jobs
job = spawn_node.unified_job_template.create_unified_job(**kv)
File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 352, in create_unified_job
raise Exception('Fields {} are not allowed as overrides.'.format(unallowed_fields))
Exception: Fields set(['extra_vars']) are not allowed as overrides.
I reverted to a previous postgres backup and everything is fine. However this issue blocks the expected behaviour.
I'm at 1.0.2.327, I'll try to upgrade to one of the latest tag to see if something happens.
TL;DR: fixed by deleting some entries in the awx database.
I was having the same issue with the same environment. Anything that shared an inventory with the stuck workflow was pending and would not execute (inventory syncs, schedule jobs, workflows). Unfortunately I don't have a database backup to use. I tried upgrading to 1.0.2.372 but the migration doesn't complete and it looks like the awx_task container is still trying to schedule the failed job (ie: the same extra vars exception repeats in the awx_task logs for hours).
I was able to revert to 1.0.2.327 and the issue persisted. Given that there wasn't much choice but to start from scratch (this is non-prod/POC) I decided it was worth taking a look at the underling database.
I tried deleting the job:
awx=# delete from main_workflowjob where unifiedjob_ptr_id='103';
ERROR: update or delete on table "main_workflowjob" violates foreign key constraint "main_workflowjobnode_workflow_job_id_dcd715c7_fk_main_work" on table "main_workflowjobnode"
DETAIL: Key (unifiedjob_ptr_id)=(103) is still referenced from table "main_workflowjobnode".
So I tried deleting the reference in main_workflowjob:
awx=# delete from main_workflowjobnode where workflow_job_id='103';
ERROR: update or delete on table "main_workflowjobnode" violates foreign key constraint "main_workflowjobnode_to_workflowjobnode_i_0edcda07_fk_main_work" on table "main_workflowjobnode_always_nodes"
DETAIL: Key (id)=(12) is still referenced from table "main_workflowjobnode_always_nodes".
Still unsuccessful, but closer... I deleted an entry from main_workflowjobnode_always_nodes:
awx=# delete from main_workflowjobnode_always_nodes where to_workflowjobnode_id='12';
DELETE 1
Now I tried to delete from main_workflowjobnode but there were more references to delete. I ran the same delete as above for IDs 9 and 10. Then...
awx=# delete from main_workflowjobnode where workflow_job_id='103';
DELETE 4
I had a look at the AWX ui and the the pending inventory syncs I had started were now running, and new jobs using the affected inventory are starting without problems.
I eventually ended up with the same issue. It looks like workflows with inventory syncs cause the problem. If I cleanup the same way before, delete the inventory sync from the workflow, re-run then the workflows complete. Inventory syncs started through the UI work. It almost looks like the workflow isn't scheduling the sync and then waiting for the non-existent job to complete.
Thanks @ewithak , I can confirm that your sql tips work very well.
I upgraded to 1.0.2.337 but the issue with the combo "workflow+survey (extra_vars)+project sync in first step" persists.
I cannot upgrade to following versions to test them due to #984
I was also able to unblock jobs by removal references to stuck job in the DB but unfortunately it caused DB integrity errors in AWX. It started generating err 500 on workflow template which stuck previously, I wasn't able to modify or delete that workflow template.
Edit:
2018-01-16 10:55:16,155 ERROR django.request Internal Server Error: /api/v2/workflow_job_templates/9/
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner
response = get_response(request)
...
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/models/query.py", line 380, in get
self.model._meta.object_name
DoesNotExist: UnifiedJob matching query does not exist.
Delete workflowjob template:
2018-01-16 10:55:21,632 ERROR django.request Internal Server Error: /api/v2/workflow_job_templates/9/
Traceback (most recent call last):
...
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/utils/decorators.py", line 185, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/transaction.py", line 223, in __exit__
connection.commit()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 262, in commit
self._commit()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 236, in _commit
return self.connection.commit()
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__
six.reraise(dj_exc_type, dj_exc_value, traceback)
File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/db/backends/base/base.py", line 236, in _commit
return self.connection.commit()
IntegrityError: update or delete on table "main_unifiedjobtemplate" violates foreign key constraint "a929973a1df6fbc68a0f04d8ead49c37" on table "main_unifiedjob"
Sorry to hear about your integrity issues, @savealive.
I have tried upgrading to 1.0.2.335, which still has the issue, and 1.0.2.372 which won't upgrade (migration stuck indefinitely, even after clearing the jobs). I will try some of the versions inbetween if I get a chance today.
I'm not using workflows, but I have job templates that require inventory sync. I have a stuck inventory sync that i'm not able to cancel, i can't even find the job id via tower-cli. Does anyone know the proper way of removing this job without orphaning any data?
I'm able to reproduce this. The simplest way to reproduce it is to create a workflow that has a survey with one question and run it. The workflow must have an inventory sync in it. This causes the following traceback:
awx_1 | 14:04:57 celeryd.1 | 2018-01-22 14:04:57,542 ERROR awx.main.scheduler Task awx.main.scheduler.tasks.run_task_manager encountered exception.
awx_1 | 14:04:57 celeryd.1 | Traceback (most recent call last):
awx_1 | 14:04:57 celeryd.1 | File "/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
awx_1 | 14:04:57 celeryd.1 | R = retval = fun(*args, **kwargs)
awx_1 | 14:04:57 celeryd.1 | File "/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 629, in __protected_call__
awx_1 | 14:04:57 celeryd.1 | return self.run(*args, **kwargs)
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/scheduler/tasks.py", line 37, in run_task_manager
awx_1 | 14:04:57 celeryd.1 | TaskManager().schedule()
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 628, in schedule
awx_1 | 14:04:57 celeryd.1 | finished_wfjs = self._schedule()
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 613, in _schedule
awx_1 | 14:04:57 celeryd.1 | self.spawn_workflow_graph_jobs(running_workflow_tasks)
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 195, in spawn_workflow_graph_jobs
awx_1 | 14:04:57 celeryd.1 | job = spawn_node.unified_job_template.create_unified_job(**kv)
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/models/inventory.py", line 1456, in create_unified_job
awx_1 | 14:04:57 celeryd.1 | return super(InventorySource, self).create_unified_job(**kwargs)
awx_1 | 14:04:57 celeryd.1 | File "/awx_devel/awx/main/models/unified_jobs.py", line 358, in create_unified_job
awx_1 | 14:04:57 celeryd.1 | raise Exception('Fields {} are not allowed as overrides.'.format(unallowed_fields))
awx_1 | 14:04:57 celeryd.1 | Exception: Fields set(['extra_vars']) are not allowed as overrides
...which kills the task scheduler.
Investigating...
Sorry this one took so long to nail down - it should now be fixed in devel via https://github.com/ansible/awx/commit/ac3f7d0fac886c8c7cac82ccb3c10661b1ad8366
@ewithak did you get the chance to test again the upgrade to any of the latest tags?
This issue seems fixed and also #935, but I cannot upgrade from where I am without losing all my data.
I opened #984, but devel cannot help, we only have to be lucky :)
I had two instances (home and work) running the same version, but neither would upgrade past 1.0.2.337. They both hung during the awx migration process, with awx-mange showmigrations showing these two tasks uncompleted:
[ ] 0018_v330_add_additional_stdout_events
[ ] 0019_v330_custom_virtualenv
@ewithak the same for me. I needed that fix so in the end I installed latest version from scratch (1.0.2.72) and reconfigured everything. I confirm that this issue and #935 are fixed.
@shortsteps Every time I have to blow away AWX and restart (much less now that I moved to standalone postgresql) I realize that it's on me to start configuring AWX using tower-cli and ansible playbooks (if the modules ever catch up to the API) instead of doing it all through the gui. I swear, this time I'll really do it! ;)
I have 100-150 job templates including pretty complex ones. Disappointing that I have to wipe my AWX instance to get bugs fixed.
Wow. That would suck to have to re-do. I have about 1/5 that and it's pretty tedious to rebuild. All jokes aside, I really am going to try to build the jobs out with scripting of some form so that I can not only re-build quicker, but also to make that part of the overall process. If it's not too late already, you could gather all of the surveys as json files before tearing down your system to aid in the rebuilding.
Yeah, I鈥檝e already collected setting like ldap config and the most critical jobs.
Guess infra-as-code it鈥檚 the only way to keep it alive across updates. Looking into tower-cli too. :)
@savealive I agree.. but I understand that this could happen with devel branch and community software, and actually I only have ten templates at the moment.
I'll try to follow @ewithak tip! Please share if you find something useful!
About the json, do you mean something like export through API?
@shortsteps yepp, some things possible to easily export via API.
For example:
/api/v2/settings/ldap/
or even
/api/v2/settings/all/
and then possible to put configs back via API on fresh AWX instance.
I also ran into this issue today. Looks like I was able to resolve this using a similar method described in this post - https://github.com/ansible/awx/issues/888
docker exec -it awx_web bash
awx-manage shell_plus
from awx.main.models import workflow
workflow_object=workflow.WorkflowJob.objects
workflow_job_id=workflow_object.get(id=ENTERJOBID)
workflow_job_id.delete()
To delete stand-alone jobs, I found this worked for me:
docker exec -it awx_web bash
awx-manage shell_plus
from awx.main.models import UnifiedJob
unified_job_obj=UnifiedJob()
unified_job_obj.id=ENTERJOBID
unified_job_obj.delete()
Even though it's been said by awx team and others, from personal experience. do not upgrade awx ever, unless you are 100% positive and comfortable restoring the exact git commit and database schema you are running before attempting to update.
Most helpful comment
TL;DR: fixed by deleting some entries in the awx database.
I was having the same issue with the same environment. Anything that shared an inventory with the stuck workflow was pending and would not execute (inventory syncs, schedule jobs, workflows). Unfortunately I don't have a database backup to use. I tried upgrading to 1.0.2.372 but the migration doesn't complete and it looks like the awx_task container is still trying to schedule the failed job (ie: the same extra vars exception repeats in the awx_task logs for hours).
I was able to revert to 1.0.2.327 and the issue persisted. Given that there wasn't much choice but to start from scratch (this is non-prod/POC) I decided it was worth taking a look at the underling database.
I tried deleting the job:
So I tried deleting the reference in main_workflowjob:
Still unsuccessful, but closer... I deleted an entry from
main_workflowjobnode_always_nodes:Now I tried to delete from main_workflowjobnode but there were more references to delete. I ran the same delete as above for IDs 9 and 10. Then...
I had a look at the AWX ui and the the pending inventory syncs I had started were now running, and new jobs using the affected inventory are starting without problems.