Galaxy: Workflows not being scheduled when workflow handlers set to db-skip-locked

Created on 20 Jun 2019 · 19Comments · Source: galaxyproject/galaxy

Running web (calling uwsgi directly) and job handlers (using scripts/galaxy-main) separately and having the following config/workflow_schedulers.xml (or not having that file at all), leads to workflow invocations never being scheduled (they remain in new state):

<?xml version="1.0"?>
    <workflow_schedulers default="core">
    <core id="core" />
    <handlers assign_with="db-skip-locked" />
</workflow_schedulers>

Changing the handlers assignment method as follows triggers job scheduling.

   <handlers assign_with="db-self" />

Pouring through the logs with @natefoo, everything looks like is should, including the database values in the workflow_invocations table (which is _default_) the missing link is somewhere deeper.

areworkflows kinbug

Source

afgane

All 19 comments

Ok, so, same issue I saw @natefoo. Cool, glad it's a bug and not just our weird setup.

hexylena on 21 Jun 2019

@afgane for an interim solution I just have the following bash script running which makes things work well enough.

#!/bin/bash
while true; do
        psql -c "update workflow_invocation set handler = 'handler_main_' || (random() * 10)::integer where state = 'new' and handler = '_default_';" | grep -v 'UPDATE 0'
        sleep 1;
done

hexylena on 21 Jun 2019

👍1

I've added this to gxadmin

hexylena on 28 Jun 2019

👍1

bump this issue again. It seems to be a severe bug or we need to pull this option from the documentation.

bgruening on 3 Aug 2019

Is it safe to use db-self for workflows while using db-skip-lock for normal job handlers in a multi master webless setup with dynamic handlers? I came accross this issue on current tip of release_19.05. Thanks

pcm32 on 3 Nov 2019

apparently the same happens when using db-transaction-isolation instead of db-skip-lock :-(.

pcm32 on 3 Nov 2019

Ok, I'm using the gxadmin call. However, I wonder, if this is being called from multiple hosts at the same time (because handler prefixes are host dependent) so that the workflows are balanced to handlers in different host, is this transactionally safe from the database point of view? Thanks!

pcm32 on 3 Nov 2019

I thought I did a better job documenting the deal with workflow schedulers and assignment methods but the only thing I see is what's in the sample config.

db-skip-locked and db-transaction-isolation are both supposed to work but are discouraged because they can't guarantee serial workflow execution in a single history. Either using mules or db-preassign with a statically configured <handlers> solution are preferred for that reason. If you can run a single static workflow scheduler with --server-name=whatever and <handlers><handler id="whatever"/></handlers> in your workflow schedulers config, that should solve the issue.

That said, this bug ought to be addressed, and I'll try to find the time this week to look at it.

natefoo on 5 Nov 2019

I walked into this trap when doing the update to 20.01 and following the documentation ☹️

The preferred method depends on your deployment strategy:

    uWSGI + Mules - uWSGI Mule Messaging is preferred.
    uWSGI + Webless - Either Database SKIP LOCKED or Database Transaction Isolation is preferred.
    uWSGI + Hybrid - Either Database SKIP LOCKED or Database Transaction Isolation is preferred. If your mule and webless handlers are in non-overlapping pools (i.e. tags, or untagged), you can alternatively use both uWSGI Mule Messaging followed by either Database SKIP LOCKED or Database Transaction Isolation. If pools overlap, using uWSGI Mule Messaging would prevent any non-mule handlers in that pool from being assigned jobs.

scholtalbers on 3 Mar 2020

@natefoo

So then with a job conf like this:

        <handlers assign_with="db-skip-locked" max_grab="8">
                <handler id="handler_main_0"/>
                <handler id="handler_main_1"/>
                <handler id="handler_main_2"/>
                <handler id="handler_main_3"/>
                <handler id="handler_main_4"/>
                <handler id="handler_main_5"/>
                <handler id="handler_main_6"/>
                <handler id="handler_main_7"/>
        </handlers>

this is wrong? There should only be a single workflow scheduler? Then it works or?

<?xml version="1.0"?>
    <workflow_schedulers default="core">
    <core id="core" />
    <handlers default="schedulers">
        <handler id="workflow_scheduler_main_0" tags="schedulers"/>
        <handler id="workflow_scheduler_main_1" tags="schedulers"/>
    </handlers>
</workflow_schedulers>

still an issue for EU

hexylena on 4 May 2020

@natefoo do you have any ideas here?

Is the following a valid and recommended config?

<?xml version="1.0"?>

<workflow_schedulers default="core">
    <core id="core" />
    <handlers assign_with="db-self" default="schedulers">
        <handler id="workflow_scheduler_main_0" tags="schedulers"/>
        <handler id="workflow_scheduler_main_1" tags="schedulers"/>
    </handlers>
</workflow_schedulers>

bgruening on 23 May 2020

Use assign_with="db-preassign" rather than db-self. You can use multiple workflow schedulers (.org does).

@hexylena we figured out in Barcelona what the issue was but I am not sure if we recorded that revelation - do you recall? Is the issue that a db-skip-locked job conf without a workflow scheduler conf is broken?

natefoo on 26 May 2020

Here is .org's workflow scheduler conf, job conf handlers section (individual handlers are only defined here for plugin loading restrictions), and the workflow scheduler and handler supervisor configs.

natefoo on 26 May 2020

hexylena on 27 May 2020

natefoo on 27 May 2020