Galaxy: Deleting an history should cancel running workflow invocations

Created on 16 Oct 2020  路  6Comments  路  Source: galaxyproject/galaxy

This is especially an issue if the inputs for the workflow are from another non-deleted history, meaning that scheduled jobs won't even be paused.

areworkflows kinbug

Most helpful comment

I'm working on it now, and the test was not executed under pytest because the test method didn't start with test. I did find a bunch of things to fix, will have a PR later today.

All 6 comments

We do cancel new scheduling iterations because of deleted histories (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/workflow/run.py#L170). So this must be within a scheduling iteration I assume. I'm nervous about simply rechecking the history between each step, between each job, etc... but clearly scheduling iterations are too long right now if this is a problem. I assume the jobs don't run at least?

We do cancel new scheduling iterations because of deleted histories (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/workflow/run.py#L170). So this must be within a scheduling iteration I assume.

Yes, the user deleted the history while a very large mapping step was being scheduled.

I'm nervous about simply rechecking the history between each step, between each job, etc... but clearly scheduling iterations are too long right now if this is a problem. I assume the jobs don't run at least?

I didn't check at the time if the jobs were run (I have separate workflow and job handlers), and now this is too buried in the logs for me to be confident in the answer, unfortunately. But, as you mention, the unnecessary workflow scheduling of such large steps is any way a problem for us.

A possible set of solutions for this could be:

  • from galaxy.managers.histories.HistoryManager.delete() call galaxy.managers.workflows.WorkflowsManager.cancel_invocation() on each invocation running on the history to delete
  • change the default of maximum_workflow_jobs_per_scheduling_iteration from -1 to a large but positive number.

The use of maximum_workflow_jobs_per_scheduling_iteration was suggested by @mvdbeek , but changing its default is my idea (so don't blame him if the idea is stupid :D ).
As mentioned on Gitter, I am not sure the current implementation of maximum_workflow_jobs_per_scheduling_iteration is working as expected though. By looking on how it's passed as max_num_jobs to https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/execute.py#L29 , jobs_executed seems to be set to 0 and never updated (could be partially due to https://github.com/galaxyproject/galaxy/pull/7449 ).

from galaxy.managers.histories.HistoryManager.delete() call galaxy.managers.workflows.WorkflowsManager.cancel_invocation() on each invocation running on the history to delete

This wouldn't change anything - since the invocation will cancel itself if it is over a deleted history. The problem is this happens in the middle of an invocation scheduling step. I guess we could also check the invocation is cancelled after each step - that might be a slight improvement over repeatedly checking the history.

I thought maximum_workflow_jobs_per_scheduling_iteration was working when I implemented it, but it is hard to test and may have regressed. It is worth fixing.

Hopefully @mvdbeek's recent job scheduling enhancements will reduce this scope of this problem. The fast we schedule jobs the more we can free resources to do more checking and the less likely conflicts like this will be to occur.

I'm working on it now, and the test was not executed under pytest because the test method didn't start with test. I did find a bunch of things to fix, will have a PR later today.

How do you do so much - you're amazing. Good luck let me know if I can help.

https://github.com/galaxyproject/galaxy/pull/10490 restores maximum_workflow_jobs_per_scheduling_iteration. We could also check at each scheduling step that the history is not deleted. That should at most result in one extra job per deleted history.

Was this page helpful?
0 / 5 - 0 ratings