With the merging of GPU detection and scheduling support, it would be quite useful to be able to apply the selected GPUs to the tasks being run by wreck. Ideally, this would be setting CUDA_VISIBLE_DEVICES to the list of GPU ids selected by sched for each task. It is also possible that a user would want all GPUs allocated on the node to the job visible to all tasks, so we might want a switch between the two, but the most common use-case seems to be handing out a single GPU per task.
This should be possible (for wreck system) via a new, simple, lua plugin that sets CUDA_VISIBLE_DEVICES to the list I believe sched puts in R_lite[rank].gpu.
That sounds like a perfect solution to me.
On 3 Jul 2018, at 16:42, Mark Grondona wrote:
This should be possible (for wreck system) via a new, simple, lua
plugin that setsCUDA_VISIBLE_DEVICESto the list I believe sched
puts inR_lite[rank].gpu.--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1562#issuecomment-402321908
This should be possible (for wreck system) via a new, simple, lua plugin that sets CUDA_VISIBLE_DEVICES to the list I believe sched puts in R_lite[rank].gpu.
Yes, the scheduler should already pass the gpu in R_lite. I will test one last time on Sierra though.
Sounds like a great solution. If this can be done soonish, this will make @knoing's life far easier.
Yes, the scheduler should already pass the gpu in R_lite. I will test one last time on Sierra though.
Ideally, this would be setting CUDA_VISIBLE_DEVICES to the list of GPU ids selected by sched for each task.
Currently R_lite only denotes resources assigned to each broker rank, not to individual tasks. A plugin will therefore have to assign GPUs to tasks from the local R_lite.rank.gpu list based on an assumption that either opts.gpus-per-task or the count of GPUs in R_lite.rank.gpu can be evenly divided between local tasks.
It would be much easier to assign the same CUDA_VISIBLE_DEVICES to all locally executed tasks, but I understand that is not quite as useful for the application.
@dongahn, what is the current use case exactly?
Here's a test wreck/lua plugin that does the easy thing -- it sets CUDA_VISIBLE_DEVICES variable to the list of GPUs set in R_lite for the current rank and applies it to all tasks being spawned locally.
I wasn't able to do any "real" testing. @dongahn, If you drop this script as something like cuda_devices.lua (name is irrelevant) into $sysconfdir/wreck/lua.d/ of your installation (or src/modules/wreck/lua.d/ of your working directory), wrexecd should start using it on the next invocation.
Let me know if it works as expected or, as expected, still needs work. ;-)
-- Set CUDA_VISIBLE_DEVICES for all tasks on any rank with one or
-- more "gpu" resources
--
function rexecd_init ()
-- NB: Lua arrays are indexed starting at 1, so this rank's index
-- into R_lite rank array is nodeid + 1:
--
local index = wreck.nodeid + 1
-- Grab local resources structure from kvs for this nodeid:
--
local Rlocal = wreck.kvsdir.R_lite[index].children
-- If a gpu resource list is set for this rank, then expand it and
-- set CUDA_VISIBLE_DEVICES to the result:
--
local gpus = Rlocal.gpu
if gpus then
-- Use affinity.cpuset as a convenience to parse the GPU list, which
-- is in nodeset form (e.g. "0-1" or "0,2-5", etc.)
--
local gset, err = require 'flux.affinity'.cpuset.new (gpus)
if not gset then
wreck:log_error ("Unable to parse GPU list [%s]: %s", gpus, err)
return
end
-- Presumably CUDA_VISIBLE_DEVICES must be strictly a comma-separated
-- list of device ids. Expand the "set" into a Lua table, then
-- concat the result into a list:
--
local t = gset:expand()
wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (t, ",")
end
end
Great! Thanks @grondo. @koning isn't quite ready yet to test Meryln with Flux so I will test this once he is being ready. Thanks.
If we can get verification that the lua plugin above works as expected, we could drop this into flux-core as a default plugin for 0.10.0 release. O/w, since it is a plugin, a refined version could always be distributed out-of-band after the fact, and this issue could be removed as a blocker for 0.10.0.
Will try to get to this soonish.
From: Mark Grondona notifications@github.com
Sent: Friday, July 13, 2018 7:00:04 AM
To: flux-framework/flux-core
Cc: Ahn, Dong H.; Mention
Subject: Re: [flux-framework/flux-core] apply selected gpus in wreck (#1562)
If we can get verification that the lua plugin above works as expected, we could drop this into flux-core as a default plugin for 0.10.0 release. O/w, since it is a plugin, a refined version could always be distributed out-of-band after the fact, and this issue could be removed as a blocker for 0.10.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/1562#issuecomment-404841931, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA0nq2BcG1QpaPthPEruBF5s6nVUQrJ_ks5uGKfkgaJpZM4VBf7I.
@SteVwonder: I thought I was going to get to this this morning but then meetings and other Flux issues consumed me. Now I really to get to STAT/LauchMON port on Sierra while the machine can be used. Would you be ingested in verifying this instead? How to launch flux on Sierra is documented in our CZ confluence WIKI.
@garlick: did you create a ticket to integrate pmi compatible libraries for PMIX into flux? Somehow I can't find it.
Sorry, just did: #1581
It looks like the case of 1 task per node works fine:
# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:37:51]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n2 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1
sierra1568 CUDA_VISIBLE_DEVICES=0,1
# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:05]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n2 -g4 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3
Once you oversubscribe the node, the env variable is set per rank rather than per task. Is that what we want?
# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:22]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n4 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1566 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1,2,3
Or do we want something like:
# herbein1 at sierra4368 in /nfs/tmp2/herbein1/spectrum-test [10:38:22]
→ lrun -T1 --nolbind flux start flux wreckrun -N2 -n4 -g2 bash -c 'echo $(hostname; printenv | grep CUDA)'
sierra1566 CUDA_VISIBLE_DEVICES=0,1
sierra1566 CUDA_VISIBLE_DEVICES=2,3
sierra1568 CUDA_VISIBLE_DEVICES=0,1
sierra1568 CUDA_VISIBLE_DEVICES=2,3
@grondo can chime in. But my understanding was this limitation is due to the fact that we can't do per-task binding with the current wreck subsystem. Only per-job binding is possible - we bind all of the tasks on the node to the node resource. I think the situation is the same for CPU binding.
I'm trying to remember, but I think this comment sums it up?
R_lite only does per rank binding at this time. We could extend the plugin to evenly divide GPUs between tasks, but we had decided this wasn't needed for now.
Sorry. I should have re-read this issue in its entireity:
Currently R_lite only denotes resources assigned to each broker rank, not to individual tasks. A plugin will therefore have to assign GPUs to tasks from the local R_lite.rank.gpu list based on an assumption that either opts.gpus-per-task or the count of GPUs in R_lite.rank.gpu can be evenly divided between local tasks.
So I guess the question then becomes, do we want to go down the rabbit hole of modifying R_lite, making assumptions in the plugin about GPU and task mappings, or leaving the plugin as is?
It would not be terribly difficult to evenly divide GPUs among tasks if that is what we want. If the GPUs are not evenly divisible into the tasks (unlikely since the option to get GPUs is --gpus-per-task), then we could fall back to setting the same CUDA_VISIBLE_DEVICES for all tasks on each node/rank.
So I guess the question then becomes, do we want to go down the rabbit hole of modifying R_lite, making assumptions in the plugin about GPU and task mappings, or leaving the plugin as is?
Ultimately, the scheduler will not even see the task section of jobspec. So it will not be able to create a resource set per rank. IMO, it should be the function of the execution system to map the R into per-rank resource and bind.
For now, lack of slot support, I think we will either live with the current implementation or add some heuristics like @grondo suggests above.
I think the heuristic makes sense as long as we provide a way to turn it off. If I remember correctly from a conversation with @trws, the splash app team was doing their own binding of tasks and GPUs. Maybe -o gpu=nobind?
Here's a version that can set CUDA_VISIBLE_DEVICES per task if requested with a -o gpubind=per-task and the number of gpus assigned to the rank is evenly divisible into the number of tasks.
This requires a small change to src/bindings/lua/wreck.lua to allow the -o gpubind option, but just throwing it up here for comments:
e.g. (simulated 1 GPU per task)
grondo:~/flux-core.git $ flux wreckrun -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0,1
1: 0,1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=per-task -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1
grondo:~/flux-core.git $ flux wreckrun -o gpubind=off -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
wreckrun: tasks [0-1]: exited with exit code 1
local gpubind = wreck:getopt ("gpubind")
if gpubind == "no" or gpubind == "off" then
return
end
-- Set CUDA_VISIBLE_DEVICES for all tasks on any rank with one or
-- more "gpu" resources
local gpuinfo = {}
function gpuinfo_create (wreck, gpus)
local g = {}
-- Use affinity.cpuset as a convenience to parse the GPU list, which
-- is in nodeset form (e.g. "0-1" or "0,2-5", etc.)
--
local gset, err = require 'flux.affinity'.cpuset.new (gpus)
if not gset then
wreck:log_error ("Unable to parse GPU list [%s]: %s", gpus, err)
return nil
end
local g = {
gpuids = gset:expand (),
ngpus = gset:count (),
ntasks = wreck.tasks_per_node [wreck.nodeid]
}
-- If per-task binding is requested, ensure ngpus is evenly divisible
-- into ntasks:
if gpubind == "per-task" and g.ngpus % g.ntasks == 0 then
g.ngpus_per_task = g.ngpus/g.ntasks
end
return g
end
function rexecd_init ()
-- NB: Lua arrays are indexed starting at 1, so this rank's index
-- into R_lite rank array is nodeid + 1:
--
local index = wreck.nodeid + 1
-- Grab local resources structure from kvs for this nodeid:
--
local Rlocal = wreck.kvsdir.R_lite[index].children
-- If a gpu resource list is set for this rank, then expand it and
-- set CUDA_VISIBLE_DEVICES to the result:
--
local gpus = Rlocal.gpu
if not gpus then return end
gpuinfo = gpuinfo_create (wreck, gpus)
-- If ngpus_per_task is not set, then set CUDA_VISIBLE_DEVICES the same
-- for all tasks:
wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (gpuinfo.gpuids, ",")
end
function rexecd_task_init ()
-- If ngpus_per_task is set, then select that many GPUs from the gpuids
-- list assigned to this rank for the current task:
if not gpuinfo.ngpus_per_task then return end
local basis = gpuinfo.ngpus_per_task * wreck.taskid
local t = {}
for i = basis,gpuinfo.ngpus_per_task do
table.insert (t, gpuinfo.gpuids [basis + i])
end
wreck.environ ["CUDA_VISIBLE_DEVICES"] = table.concat (t, ",")
end
Thanks @grondo. This looks great to me. Just so that I understand:
grondo:~/flux-core.git $ flux wreckrun -o gpubind=per-task -ln 2 -c 2 printenv CUDA_VISIBLE_DEVICES
0: 0
1: 1
In this case, we still can’t ensure each rank gets the closest GPU. Correct? And this would be the current limitation in R_lite format… We should probably add this to our R_lite to R path discussion.
In the meantime, I wonder we should at least try to capture the current main use case, though. AFAIK, the most common case would be to run 4 MPI process per node each with its own GPU. So the per-rank R_light would be either
core: {0-39}, gpu: {0-3}
or
core: {0-1, 21-22}, gpu: {0-3} (though currently, we don’t have a way to specify socket as our constraint in wreck so scheduler won’t be able to generate this schedule. But assuming the socket constraint can be added.)
My guess is, as far as the core list and gpu list appear such that when they are each partitioned from left to right, common cases like these can be easily handled (i.e., tasks will assigned to the closest GPUs)
One thing, though, to make such a scheme work, don’t we also need cpubind=per-task?
Most helpful comment
Here's a version that can set CUDA_VISIBLE_DEVICES per task if requested with a
-o gpubind=per-taskand the number of gpus assigned to the rank is evenly divisible into the number of tasks.This requires a small change to
src/bindings/lua/wreck.luato allow the-o gpubindoption, but just throwing it up here for comments:e.g. (simulated 1 GPU per task)