Galaxy: Galaxy 18.05 - Job execution failing (SGE DRM) - job_conf, handlers/Mule

Created on 22 Aug 2018 · 17Comments · Source: galaxyproject/galaxy

Hi, we are trying to upgrade our institute's Galaxy instance from v17.05 to v18.05.

I pulled the v18.05 code from GitHub and tested it on our _Galaxy-test_ instances (runs jobs _locally_) and that works fine.

However, when upgrading our _production Galaxy_ instance, it doesn't submit jobs to our SGE DRM which used to work perfectly with Galaxy 17.05 until last week.

Any ideas on what we are missing or how we can fix job_conf to submit jobs to SGE via Galaxy would be very useful.

Source

AjitPS

All 17 comments

Our updated Galaxy-prod job_conf (have added drmaa_library_path):

<job_conf>
    <plugins workers="4">
        <!-- "workers" is the number of threads for the runner's work queue.
             The default from <plugins> is used if not defined for a <plugin>.
          -->
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner">
            <param id="invalidjobexception_state">ok</param>
            <param id="invalidjobexception_retries">0</param>
            <param id="internalexception_state">ok</param>
            <param id="internalexception_retries">0</param>
            <param id="drmaa_library_path">/usr/lib64/libdrmaa.so.1.0</param> <!-- Override the $DRMAA_LIBRARY_PATH environment variable -->
        </plugin>
    </plugins>

    <handlers default="handlers">
        <handler id="handler0" tags="handlers"/>
        <handler id="handler1" tags="handlers"/>
    </handlers>

    <destinations default="cluster">
        <destination id="local" runner="local">
            <param id="local_slots">6</param>
        </destination>
        <destination id="cluster" runner="drmaa">
            <param id="local_slots">6</param>
            <env file="/home/usern/galaxy/galaxy/setup_galaxy_venv.sh" /> <!-- will be sourced -->
        </destination>
    </destinations>

    <tools>
        <tool id="bwa" destination="cluster"/>
    </tools>
    <limits>
         <limit type="registered_user_concurrent_jobs">20</limit>
    </limits>
</job_conf>

AjitPS on 22 Aug 2018

Note: Also, we did the config. setup in the new galaxy.yml (replacing the old galaxy.ini) and noticed that our older main.log and handlers (handler0.log, handler1.log) that were created in the past no longer exist. Instead, Galaxy, on startup, creates galaxy.log and _doesn't create_ the 2 handlers I defined in job_conf.

AjitPS on 22 Aug 2018

You need to choose from one of the handler patterns at https://docs.galaxyproject.org/en/master/admin/scaling.html?highlight=mule#deployment-options

The typical and recommended scenarios is https://docs.galaxyproject.org/en/master/admin/scaling.html?highlight=mule#uwsgi-for-web-serving-with-mules-as-job-handlers

So you ned to follow the instructions at https://docs.galaxyproject.org/en/master/admin/scaling.html?highlight=mule#uwsgi-mule-job-handling

In your job_conf.xml you have to remove

    <handlers default="handlers">
        <handler id="handler0" tags="handlers"/>
        <handler id="handler1" tags="handlers"/>
    </handlers>

and add the corresponding farm to your galaxy.yml.

(Not that you can also leave the galaxy.ini in place nd then you don't have to do anything at all -- it'll just use the old config)

mvdbeek on 22 Aug 2018

Thanks, if I leave galaxy.ini as is, I guess we don't have to set up this then: https://docs.galaxyproject.org/en/master/admin/scaling.html?highlight=mule#uwsgi-mule-job-handling

Also, then I can leave the section and job_conf as is?

AjitPS on 22 Aug 2018

Yes

mvdbeek on 22 Aug 2018

but you'll miss out on some great and very reliable job handling -- if you don't have an issue though it's not worth it and we'll keep on supporting this for quite some time

mvdbeek on 22 Aug 2018

👍1

Thanks @mvdbeek , what features besides job_handling are affected by using the older galaxy.ini?

I'd be happy to use the new code, if I can manage to get it to work with our SGE. I can't see what I'm missing in our job_conf.

We are planning to move to Slurm in a few months time so maybe can try the new config. then.

AjitPS on 22 Aug 2018

Thanks @mvdbeek , what features besides job_handling are affected by using the older galaxy.ini?

So far nothing

if I can manage to get it to work with our SGE. I can't see what I'm missing in our job_conf.

Like I said, drop the handlers section in your job_conf.xml file (or follow the instructions if you have to assign handlers to specific destinations) and add the farms as described to your galaxy.yml

mvdbeek on 22 Aug 2018

👍1

Thanks I'll give that a go now

AjitPS on 22 Aug 2018

FYI, for earlier test jobs, galaxy.log showed:

galaxy.tools DEBUG 2018-08-22 15:18:33,870 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Validated and populated state for tool request (51.331 ms)
galaxy.tools.actions INFO 2018-08-22 15:18:34,042 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Handled output named out_file1 for tool Grouping1 (135.123 ms)
galaxy.tools.actions INFO 2018-08-22 15:18:34,057 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Added output datasets to history (13.704 ms)
galaxy.tools.actions INFO 2018-08-22 15:18:34,077 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Verified access to datasets for Job[unflushed,tool_id=Grouping1] (5.543 ms)
galaxy.tools.actions INFO 2018-08-22 15:18:34,080 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Setup for job Job[unflushed,tool_id=Grouping1] complete, ready to flush (21.771 ms)
galaxy.tools.actions INFO 2018-08-22 15:18:34,181 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Flushed transaction for job Job[id=86655,tool_id=Grouping1] (99.826 ms)
galaxy.tools.execute DEBUG 2018-08-22 15:18:34,183 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Tool [Grouping1] created job [86655] (290.321 ms)
galaxy.tools.execute DEBUG 2018-08-22 15:18:34,192 [p:9278,w:1,m:0] [uWSGIWorker1Core1] Executed 1 job(s) for tool Grouping1 request: (320.768 ms)

but no job folder was created in jobs_directory/, which I'm guessing could be as the handlers failed?

AjitPS on 22 Aug 2018

Yes, that is the expected output when the handlers are not active

mvdbeek on 22 Aug 2018

👍1

That worked, thanks a lot :)

AjitPS on 22 Aug 2018

cool!

mvdbeek on 22 Aug 2018

Would be good to document in galaxy.yml and in job_conf.xml. Cheers.

AjitPS on 22 Aug 2018

Also getting a pesky html error for favicon.ico (404 not found) despite us not making any changes/additions to the <head> code for the UI.

AjitPS on 23 Aug 2018

That is interesting, if you want to follow up please open a new issue.

mvdbeek on 23 Aug 2018

Thanks, will do

AjitPS on 24 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Filter Failed tool in workflow coerces wrong noodle type when connecting collection data

mtekman · 3Comments

Problems with Collection output created via structured_like from a data input with multiple=True

blankenberg · 4Comments

On history switch in multi view, go to Analyze data

afgane · 4Comments

Enhance admin UI to allow manual user activation.

martenson · 5Comments

Handle preview and count of empty lines correctly in datasets

tnabtaf · 4Comments