Flux-core: apply selected gpus in wreck

Created on 3 Jul 2018 · 20Comments · Source: flux-framework/flux-core

With the merging of GPU detection and scheduling support, it would be quite useful to be able to apply the selected GPUs to the tasks being run by wreck. Ideally, this would be setting CUDA_VISIBLE_DEVICES to the list of GPU ids selected by sched for each task. It is also possible that a user would want all GPUs allocated on the node to the job visible to all tasks, so we might want a switch between the two, but the most common use-case seems to be handing out a single GPU per task.

Source

trws

Most helpful comment

Here's a version that can set CUDA_VISIBLE_DEVICES per task if requested with a -o gpubind=per-task and the number of gpus assigned to the rank is evenly divisible into the number of tasks.

This requires a small change to src/bindings/lua/wreck.lua to allow the -o gpubind option, but just throwing it up here for comments:

e.g. (simulated 1 GPU per task)

grondo:~/flux-core.git $ flux wreckrun -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0,1
1: 0,1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=per-task -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=off -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
wreckrun: tasks [0-1]: exited with exit code 1

local gpubind = wreck:getopt ("gpubind")
if gpubind == "no" or gpubind == "off" then
    return
end

-- Set CUDA_VISIBLE_DEVICES for all tasks on any rank with one or
--  more "gpu" resources

local gpuinfo = {}
function gpuinfo_create (wreck, gpus)
    local g = {}
    -- Use affinity.cpuset as a convenience to parse the GPU list, which
    --  is in nodeset form (e.g. "0-1" or "0,2-5", etc.)
    --
    local gset, err = require 'flux.affinity'.cpuset.new (gpus)
    if not gset then
        wreck:log_error ("Unable to parse GPU list [%s]: %s", gpus, err)
        return nil
    end
    local g = {
        gpuids = gset:expand (),
        ngpus  = gset:count (),
        ntasks = wreck.tasks_per_node [wreck.nodeid]
    }

    -- If per-task binding is requested, ensure ngpus is evenly divisible
    --  into ntasks:
    if gpubind == "per-task" and g.ngpus % g.ntasks == 0 then
        g.ngpus_per_task = g.ngpus/g.ntasks
    end
    return g
end

function rexecd_init ()
    -- NB: Lua arrays are indexed starting at 1, so this rank's index
    --  into R_lite rank array is nodeid + 1:
    --
    local index = wreck.nodeid + 1

    -- Grab local resources structure from kvs for this nodeid:
    --
    local Rlocal = wreck.kvsdir.R_lite[index].children

    -- If a gpu resource list is set for this rank, then expand it and
    --  set CUDA_VISIBLE_DEVICES to the result:
    --
    local gpus = Rlocal.gpu
    if not gpus then return end

    gpuinfo = gpuinfo_create (wreck, gpus)

    -- If ngpus_per_task is not set, then set CUDA_VISIBLE_DEVICES the same
    --  for all tasks:
    wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (gpuinfo.gpuids, ",")
end

function rexecd_task_init ()
    -- If ngpus_per_task is set, then select that many GPUs from the gpuids
    --  list assigned to this rank for the current task:
    if not gpuinfo.ngpus_per_task then return end

    local basis = gpuinfo.ngpus_per_task * wreck.taskid
    local t = {}
    for i = basis,gpuinfo.ngpus_per_task do
        table.insert (t, gpuinfo.gpuids [basis + i])
    end
    wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (t, ",")
end

grondo on 21 Jul 2018

👍2

All 20 comments

This should be possible (for wreck system) via a new, simple, lua plugin that sets CUDA_VISIBLE_DEVICES to the list I believe sched puts in R_lite[rank].gpu.

grondo on 4 Jul 2018

That sounds like a perfect solution to me.

On 3 Jul 2018, at 16:42, Mark Grondona wrote:

This should be possible (for wreck system) via a new, simple, lua
plugin that sets CUDA_VISIBLE_DEVICES to the list I believe sched
puts in R_lite[rank].gpu.

--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1562#issuecomment-402321908

trws on 4 Jul 2018

This should be possible (for wreck system) via a new, simple, lua plugin that sets CUDA_VISIBLE_DEVICES to the list I believe sched puts in R_lite[rank].gpu.

Yes, the scheduler should already pass the gpu in R_lite. I will test one last time on Sierra though.

Sounds like a great solution. If this can be done soonish, this will make @knoing's life far easier.

dongahn on 4 Jul 2018

Yes, the scheduler should already pass the gpu in R_lite. I will test one last time on Sierra though.

Ideally, this would be setting CUDA_VISIBLE_DEVICES to the list of GPU ids selected by sched for each task.

Currently R_lite only denotes resources assigned to each broker rank, not to individual tasks. A plugin will therefore have to assign GPUs to tasks from the local R_lite.rank.gpu list based on an assumption that either opts.gpus-per-task or the count of GPUs in R_lite.rank.gpu can be evenly divided between local tasks.

It would be much easier to assign the same CUDA_VISIBLE_DEVICES to all locally executed tasks, but I understand that is not quite as useful for the application.

@dongahn, what is the current use case exactly?

grondo on 5 Jul 2018

Here's a test wreck/lua plugin that does the easy thing -- it sets CUDA_VISIBLE_DEVICES variable to the list of GPUs set in R_lite for the current rank and applies it to all tasks being spawned locally.

I wasn't able to do any "real" testing. @dongahn, If you drop this script as something like cuda_devices.lua (name is irrelevant) into $sysconfdir/wreck/lua.d/ of your installation (or src/modules/wreck/lua.d/ of your working directory), wrexecd should start using it on the next invocation.

Let me know if it works as expected or, as expected, still needs work. ;-)

-- Set CUDA_VISIBLE_DEVICES for all tasks on any rank with one or
--  more "gpu" resources
--
function rexecd_init ()
    -- NB: Lua arrays are indexed starting at 1, so this rank's index
    --  into R_lite rank array is nodeid + 1:
    --
    local index = wreck.nodeid + 1

    -- Grab local resources structure from kvs for this nodeid:
    --
    local Rlocal = wreck.kvsdir.R_lite[index].children

    -- If a gpu resource list is set for this rank, then expand it and
    --  set CUDA_VISIBLE_DEVICES to the result:
    --
    local gpus = Rlocal.gpu
    if gpus then
        -- Use affinity.cpuset as a convenience to parse the GPU list, which
        --  is in nodeset form (e.g. "0-1" or "0,2-5", etc.)
        --
        local gset, err = require 'flux.affinity'.cpuset.new (gpus)
        if not gset then
            wreck:log_error ("Unable to parse GPU list [%s]: %s", gpus, err)
            return
        end

        -- Presumably CUDA_VISIBLE_DEVICES must be strictly a comma-separated
        --  list of device ids. Expand the "set" into a Lua table, then
        --  concat the result into a list:
        --
        local t = gset:expand()
        wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (t, ",")
    end
end

grondo on 6 Jul 2018

Great! Thanks @grondo. @koning isn't quite ready yet to test Meryln with Flux so I will test this once he is being ready. Thanks.

dongahn on 6 Jul 2018

If we can get verification that the lua plugin above works as expected, we could drop this into flux-core as a default plugin for 0.10.0 release. O/w, since it is a plugin, a refined version could always be distributed out-of-band after the fact, and this issue could be removed as a blocker for 0.10.0.

grondo on 13 Jul 2018

Will try to get to this soonish.

From: Mark Grondona notifications@github.com
Sent: Friday, July 13, 2018 7:00:04 AM
To: flux-framework/flux-core
Cc: Ahn, Dong H.; Mention
Subject: Re: [flux-framework/flux-core] apply selected gpus in wreck (#1562)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1562#issuecomment-404841931, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA0nq2BcG1QpaPthPEruBF5s6nVUQrJ_ks5uGKfkgaJpZM4VBf7I.

dongahn on 13 Jul 2018

@SteVwonder: I thought I was going to get to this this morning but then meetings and other Flux issues consumed me. Now I really to get to STAT/LauchMON port on Sierra while the machine can be used. Would you be ingested in verifying this instead? How to launch flux on Sierra is documented in our CZ confluence WIKI.

dongahn on 13 Jul 2018

@garlick: did you create a ticket to integrate pmi compatible libraries for PMIX into flux? Somehow I can't find it.

dongahn on 13 Jul 2018

Sorry, just did: #1581

garlick on 16 Jul 2018

👍1

It looks like the case of 1 task per node works fine:

# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:37:51]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n2 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1
sierra1568 CUDA_VISIBLE_DEVICES=0,1

# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:05]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n2 -g4 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3

Once you oversubscribe the node, the env variable is set per rank rather than per task. Is that what we want?

# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:22]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n4 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3

Or do we want something like:

# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:22]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n4 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1
sierra1566 CUDA_VISIBLE_DEVICES=2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1
sierra1568 CUDA_VISIBLE_DEVICES=2,3

SteVwonder on 20 Jul 2018

@grondo can chime in. But my understanding was this limitation is due to the fact that we can't do per-task binding with the current wreck subsystem. Only per-job binding is possible - we bind all of the tasks on the node to the node resource. I think the situation is the same for CPU binding.

dongahn on 20 Jul 2018

I'm trying to remember, but I think this comment sums it up?

R_lite only does per rank binding at this time. We could extend the plugin to evenly divide GPUs between tasks, but we had decided this wasn't needed for now.

grondo on 20 Jul 2018

👍1

Sorry. I should have re-read this issue in its entireity:

Currently R_lite only denotes resources assigned to each broker rank, not to individual tasks. A plugin will therefore have to assign GPUs to tasks from the local R_lite.rank.gpu list based on an assumption that either opts.gpus-per-task or the count of GPUs in R_lite.rank.gpu can be evenly divided between local tasks.

So I guess the question then becomes, do we want to go down the rabbit hole of modifying R_lite, making assumptions in the plugin about GPU and task mappings, or leaving the plugin as is?

SteVwonder on 20 Jul 2018

It would not be terribly difficult to evenly divide GPUs among tasks if that is what we want. If the GPUs are not evenly divisible into the tasks (unlikely since the option to get GPUs is --gpus-per-task), then we could fall back to setting the same CUDA_VISIBLE_DEVICES for all tasks on each node/rank.

grondo on 20 Jul 2018

So I guess the question then becomes, do we want to go down the rabbit hole of modifying R_lite, making assumptions in the plugin about GPU and task mappings, or leaving the plugin as is?

Ultimately, the scheduler will not even see the task section of jobspec. So it will not be able to create a resource set per rank. IMO, it should be the function of the execution system to map the R into per-rank resource and bind.

For now, lack of slot support, I think we will either live with the current implementation or add some heuristics like @grondo suggests above.

dongahn on 20 Jul 2018

I think the heuristic makes sense as long as we provide a way to turn it off. If I remember correctly from a conversation with @trws, the splash app team was doing their own binding of tasks and GPUs. Maybe -o gpu=nobind?

SteVwonder on 20 Jul 2018

Here's a version that can set CUDA_VISIBLE_DEVICES per task if requested with a -o gpubind=per-task and the number of gpus assigned to the rank is evenly divisible into the number of tasks.

This requires a small change to src/bindings/lua/wreck.lua to allow the -o gpubind option, but just throwing it up here for comments:

e.g. (simulated 1 GPU per task)

grondo:~/flux-core.git $ flux wreckrun -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0,1
1: 0,1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=per-task -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=off -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
wreckrun: tasks [0-1]: exited with exit code 1

local gpubind = wreck:getopt ("gpubind")
if gpubind == "no" or gpubind == "off" then
    return
end

-- Set CUDA_VISIBLE_DEVICES for all tasks on any rank with one or
--  more "gpu" resources

local gpuinfo = {}
function gpuinfo_create (wreck, gpus)
    local g = {}
    -- Use affinity.cpuset as a convenience to parse the GPU list, which
    --  is in nodeset form (e.g. "0-1" or "0,2-5", etc.)
    --
    local gset, err = require 'flux.affinity'.cpuset.new (gpus)
    if not gset then
        wreck:log_error ("Unable to parse GPU list [%s]: %s", gpus, err)
        return nil
    end
    local g = {
        gpuids = gset:expand (),
        ngpus  = gset:count (),
        ntasks = wreck.tasks_per_node [wreck.nodeid]
    }

    -- If per-task binding is requested, ensure ngpus is evenly divisible
    --  into ntasks:
    if gpubind == "per-task" and g.ngpus % g.ntasks == 0 then
        g.ngpus_per_task = g.ngpus/g.ntasks
    end
    return g
end

function rexecd_init ()
    -- NB: Lua arrays are indexed starting at 1, so this rank's index
    --  into R_lite rank array is nodeid + 1:
    --
    local index = wreck.nodeid + 1

    -- Grab local resources structure from kvs for this nodeid:
    --
    local Rlocal = wreck.kvsdir.R_lite[index].children

    -- If a gpu resource list is set for this rank, then expand it and
    --  set CUDA_VISIBLE_DEVICES to the result:
    --
    local gpus = Rlocal.gpu
    if not gpus then return end

    gpuinfo = gpuinfo_create (wreck, gpus)

    -- If ngpus_per_task is not set, then set CUDA_VISIBLE_DEVICES the same
    --  for all tasks:
    wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (gpuinfo.gpuids, ",")
end

function rexecd_task_init ()
    -- If ngpus_per_task is set, then select that many GPUs from the gpuids
    --  list assigned to this rank for the current task:
    if not gpuinfo.ngpus_per_task then return end

    local basis = gpuinfo.ngpus_per_task * wreck.taskid
    local t = {}
    for i = basis,gpuinfo.ngpus_per_task do
        table.insert (t, gpuinfo.gpuids [basis + i])
    end
    wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (t, ",")
end

grondo on 21 Jul 2018

👍2

Thanks @grondo. This looks great to me. Just so that I understand:

grondo:~/flux-core.git $ flux wreckrun -o gpubind=per-task -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1

In this case, we still can’t ensure each rank gets the closest GPU. Correct? And this would be the current limitation in R_lite format… We should probably add this to our R_lite to R path discussion.

In the meantime, I wonder we should at least try to capture the current main use case, though. AFAIK, the most common case would be to run 4 MPI process per node each with its own GPU. So the per-rank R_light would be either

core: {0-39}, gpu: {0-3}

core: {0-1, 21-22}, gpu: {0-3} (though currently, we don’t have a way to specify socket as our constraint in wreck so scheduler won’t be able to generate this schedule. But assuming the socket constraint can be added.)

My guess is, as far as the core list and gpu list appear such that when they are each partitioned from left to right, common cases like these can be easily handled (i.e., tasks will assigned to the closest GPUs)

One thing, though, to make such a scheme work, don’t we also need cpubind=per-task?

dongahn on 21 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

docker: start munge daemons by default

SteVwonder · 7Comments

job-manager: allow specific job id's to be listed

garlick · 8Comments

libflux: change flux_future_error_string() to return flux_strerror() if textual error was not set

chu11 · 6Comments

increase minimum jansson version

chu11 · 3Comments

The mustache template doesn't work with flux mini run

dongahn · 7Comments