When testing the nested instances we’re starting with GPUs allocated, the nested broker don’t seem to see the all of the GPU resources. I was able to confirm this by getting the local URI and logging in. In each of our flux mini run calls we're ask for -g 1 — with that request the first one is the only one to start, but the rest are in pending (PD) state waiting on resources (but if I remove the -g in our workflow they all run). However, the broker that was started should have all 4 GPUs on the node, so I’m confused why it thinks there are less. I’ve confirmed that the highest level Flux instance is allocating 4 GPUs via the Jobspec and that is satisfied as the job starts running.
In discussion with @dongahn -- I have the following information.
Flux version
flux version
commands: 0.17.0
libflux-core: 0.17.0
broker: 0.17.0
FLUX_URI: local:///var/tmp/flux-SjQDZA/0/local
build-options: +hwloc==1.11.6
Module listing
flux module list
Module Size Digest Idle S Service
job-exec 1465960 772EA46 22 S
job-manager 1530544 3AD17B6 22 S
connector-local 1240536 CAD956C 0 R 59021-shell-647365656576,59021-shell-283669168128,59021-shell-463286042624,59021-shell-99706994688
kvs-watch 1490496 57FED1B 22 S
resource 1396392 E0D79A6 11 S
barrier 1249464 D57CC8D 12 S
cron 1391416 B0B6100 0 S
job-ingest 1410344 FB08FCB 20 S
kvs 1802256 8D3FDB4 0 S
job-info 1624112 D5D9876 12 S
aggregator 1261616 CEEF1E2 12 S
content-sqlite 1253800 BFB45C1 22 S content-backing,kvs-checkpoint
sched-fluxion-qmanag 8837544 450DF83 22 S sched
sched-fluxion-resour 24476904 E90F9BF 22 S
The resources that the job sees from the master Flux instance (using flux job info <JOBID> R):
{"version": 1, "execution": {"R_lite": [{"rank": "3", "node": "lassen8", "children": {"core": "0-19", "gpu": "0-3"}}], "starttime": 1599092109, "expiration": 1599696909}}
It was confirmed that the nested broker only sees a single GPU:
<object type="PCIDev" os_index="4210688" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0004:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="1.969231">
<info name="PCIVendor" value="NVIDIA Corporation"/>
<info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
<object type="OSDev" name="card1" osdev_type="1"/>
<object type="OSDev" name="renderD128" osdev_type="1"/>
<object type="OSDev" name="cuda0" osdev_type="5">
<info name="CoProcType" value="CUDA"/>
<info name="Backend" value="CUDA"/>
<info name="GPUVendor" value="NVIDIA Corporation"/>
<info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
<info name="CUDAGlobalMemorySize" value="16515072"/>
<info name="CUDAL2CacheSize" value="6144"/>
<info name="CUDAMultiProcessors" value="80"/>
<info name="CUDACoresPerMP" value="64"/>
<info name="CUDASharedMemorySizePerMP" value="48"/>
</object>
<object type="PCIDev" os_index="4214784" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0004:05:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="1.969231">
<info name="PCIVendor" value="NVIDIA Corporation"/>
<info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
<object type="OSDev" name="card2" osdev_type="1"/>
<object type="OSDev" name="renderD129" osdev_type="1"/>
</object>
flux hwloc topology | grep "CoProcType"
<info name="CoProcType" value="CUDA"/>
Are you able to try reproducing with flux-core v0.18.0?
This sounds like it could be the issue @SteVwonder fixed in 9464337e764d268a1525cc0bba828f7ed13b5932.
@grondo -- An important consideration for us is whether or not the Python bindings have changed between [email protected] and [email protected]. Will the JobSpecV1 class work in the newer flux-core? I'll go checkout the bindings on my own in a second, but figured it was worth an ask.
I don't believe there were any changes in the JobspecV1 class since v0.17, but let me check.
Appreciated -- So far, it looks like it might be a drop in replacement from what I can tell. I know that the Maestro backed passes in attributes and other things directly, so as long as that's the case I think it might work.
$ git log --pretty='%h %s' v0.17.0..v0.18.0 -- src/bindings/python/flux/job.py
9af79437a python: add JobID class
aafa076ea bindings/python: Fix pylint ungrouped-imports
4f23f4450 bindings/python: Fix pylint import issue
8ff300357 bindings/python: Fix pylint invalid-name
f2cf610f2 bindings/python: Fix pylint unidiomatic-typecheck
cbabe5d89 bindings/python: Fix pylint no-else-raise
59ef0e694 bindings/python: Fix pylint no-else-return
2a3cffabd python: add Jobspec interface to 'mini batch'
f57ed089d python: add stdio properties to jobspec
I think we are trying not to break existing interfaces in the Python API (as much as possible with a quickly moving target)
None of the above changes look like they would break existing use of JobspecV1. The main thing is the addition of the stdin, stdout and stderr properties.
Got it -- I notice the addition of the JobID class. Prior to that, 0.17.0 just passed around integers. I'll look into that and see if I need to retrofit my existing solution.
jobids are still in essence integers. The JobID class is a convenience subclass of integer that allows encoding and decoding Flux jobids from other formats (e.g. "F58", a base58 encoding, hexadecimal, kvs path, etc.) Therefore, I don't think you have to retrofit, though it may be convenient at some point.
Also take a look at our most recent tag v0.19.0. In this version we've abstracted the JobInfo class used by flux jobs for easier access to job properties, and a JobList class for easier job listing.
@grondo and @FrankD412: I had the same trouble with our July release but I couldn't narrow this down exactly due to other things on my plate yesterday. (Need more testing). If this can wait a few days I can take a look at this if this is okay by @FrankD412.
Yeah, if this is still reproducible in v0.18.0 then we'll need help debugging. My intuition is that it is related to what HW objects hwloc is ignoring by default when using hwloc_topology_restrict(3).
Yeah I have a similar thought. But the fact that this is only happening with the nest instance also tells me there might be something related to how our exec system binds the flux broker process as this might affect how hwloc xml is generated for the nested instance on Lassen.
I need some time to do more testing.
Yes, the exec system will bind the flux-broker process to the core ids in R_lite. A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info on one of the GPU lassen nodes.
A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info on one of the GPU lassen nodes.
Exactly!
BTW, I assumed the presence of CUDA_VISIBLE_DEVICES won't affect the nested hwloc.
@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
I didn't know this was tomorrow. I will try to get to it this evening then.
One workaround might be to set -o cpu-affinity=off or the equivalent in JobspecV1 class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
I didn't know this was tomorrow. I will try to get to it this evening then.
It's one of our weekly test DATs, so it would be nice to have something by then -- however, we will have one next week. Yes, this is critical for us -- but I'm happy if there's a workaround that doesn't derail you.
One workaround might be to set
-o cpu-affinity=offor the equivalent inJobspecV1class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
I can give this a shot. I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
is this an hwloc bug? Or how flux-core is interpreting what it gets.
I'm not really sure at this time. When a flux instance starts up it gathers hwloc topology information and then calls hwloc_topology_restrict(3) so that the topology is pruned of resources that are not currently accessible due to cpu affinity or other binding. This could be pruning GPUs that are somehow children of cores that are not in the job's assigned resource set, and therefore not in the current affinity of the flux-broker process.
Perhaps there is some flag we should be passing to ensure GPU or Coproc devices aren't dropped.
BTW, I assumed the presence of CUDA_VISIBLE_DEVICES won't affect the nested hwloc.
I haven't tried this, but according to @eleon CUDA_VISIBLE_DEVICES has no effect on libhwloc.
I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. :crossed_fingers:
But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet.
I was testing v0.18 and saw similar issues. Still more testing is needed though.
But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.
At this point I'm not worried about performance. We're currently just trying to make sure our workflow is operational. If it's slow, that's fine -- we just need things to run.
Shoot:
leon@pascal4:~$ lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
CoProc(OpenCL) "opencl1d1"
CoProc(CUDA) "cuda1"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0,1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
CoProc(OpenCL) "opencl1d1"
CoProc(CUDA) "cuda1"
Still something weird as I cannot select the cuda1 device.
Very interesting @eleon!
Ugh.... this may be a part of this. Interesting.
Similar issue on the ROCm side:
leon@corona107:~$ lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
CoProc(OpenCL) "opencl0d2"
CoProc(OpenCL) "opencl0d3"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=3 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=0,1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
At this point, I wouldn't use this method to restrict the hwloc GPUs. I wish that hwloc either fully followed the environment variable guidance or not at all.
At this point, I wouldn't use this method to restrict the hwloc GPUs.
Unfortunately, flux doesn't really have much control over that. (at least for now).
@eleon, do you know if there is a way to remove specific objects from an hwloc topology?
Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.
Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.
+1
@grondo , unfortunately, not that I am aware of. I thought about this and looked through the hwloc API, but did not find a reasonable way of removing vertices from the tree. The only alternative I can think of is pruning the XML topology file, but it may be an involved operation since other vertices may need to be updated in addition to removing the GPU vertex.
Still something weird as I cannot select the cuda1 device.
I wonder if you are getting the "cuda1" device, but that lstopo is printing the logical index rather than the physical one (and the GPUs always are indexed starting at 0). Does lstopo --physical behave the way you expect?
The logical indexing of course causes all sorts of issues when you start nesting. We should look into using the full UUID of the GPUs at some point.
Good thought, @SteVwonder . Here's what we have so far:
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics -p | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics -p | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
More testing needed...
I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. 🤞
I added the ability to pass options to Flux through Maestro and ran into the following error. I'm feeling like I'm not constructing something correctly, unless the whole of the -o option needs to be in quotations?
flux job attach 1610847617024
flux-job: task(s) exited with exit code 1
2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s
The jobspec looks like:
"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}
The -o cpu-affinity=off gets passed to flux mini run. You can do the equivalent thing through the Python API with:
jobspec = JobspecV1(["hostname"])
jobspec.setattr_shell_option("cpu-affinity", "off")
EDIT: just confirmed that the above code snippet is equivalent to flux mini run -o cpu-affinity=off hostname (minus the environment variables)
Sorry, I should have clarified, cpu-affinity=off should be set as a "job
shell" option in the jobspec, not as an option to flux-start.
I.e. in the json jobspec attributes.system.shell.options["cpu-affinity"]
should be set to "off"
There may be a convenience method in JobspecV1 to set job shell options. I
can look for that tomorrow.
On Thu, Sep 3, 2020, 7:41 PM Francesco Di Natale notifications@github.com
wrote:
I'm assuming I'd pass this to the nested instance broker? If that's the
case, then this isolates the fix just to our 20 core/4 GPU job with the
rest of the workflow being none the wiser.Yes, it would only be a required workaround for the nested instance that
needs access to all GPUs.BTW, I still held some hope that v0.18 would magically fix this issue if
you haven't tried it yet. 🤞I added the ability to pass options to Flux through Maestro and ran into
the following error. I'm feeling like I'm not constructing something
correctly, unless the whole of the -o option needs to be in quotations?flux job attach 1610847617024
flux-job: task(s) exited with exit code 1
2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s
The jobspec looks like:
"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/flux-framework/flux-core/issues/3193#issuecomment-686866249,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAFVEUSMUBB6RZBSEW5LSWTSEBHT3ANCNFSM4QVWYYSQ
.
I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.
I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.
You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.
I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.
You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.
Right -- I actually realized I was still scheduling with GPUs in my mini run so I'll give that a shot here in a moment.
FYI -- I can circle back to this early this afternoon.
FYI -- I can circle back to this early this afternoon.
No worries -- I have four ddcmd processes on the node through python going, but they register as zombie processes. Trying to see if that's related.
Ok. Talked to @FrankD412 by phone. I believe we have come up with a reasonable workaround for today's DAT so that he can quickly test the rest of the workflow. @FrankD412 will update us on that. If that's working, I will defer my testing to either this weekend or early next week.
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
2020-09-04 11:51:00,693 - mummi.online:run:118 - INFO - cmd = /usr/gapps/kras/sierra/ddcmd-gpu8/bin/ddcMD-sierra -o object.data molecule.data
2020-09-04 11:51:05,704 - mummi.online:run:122 - INFO - Process Running? 1
2020-09-04 11:51:05,705 - mummi.online:run:124 - INFO - CUDA_VISIBLE_DEVICES=0
2020-09-04 11:51:05,705 - mummi.online:run:128 - INFO - ---------------- ddcMD stdout --------------
2020-09-04 11:51:05,705 - mummi.online:run:129 - ERROR - ---------------- ddcMD stderr --------------
[lassen13:44229] mca_base_component_repository_open: unable to open mca_schizo_flux.so: File not found (ignored)
[lassen13:44229] mca_base_component_repository_open: unable to open mca_pmix_flux.so: File not found (ignored)
[lassen13:44229] PMI_Init [../../../../../../opensrc/ompi/opal/mca/pmix/flux/pmix_flux.c:386:flux_init]: Operation failed
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
pmix init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[lassen13:44229] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
Did you add -o mpi=spectrum to your flux mini run?
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
Did you add
-o mpi=spectrumto yourflux mini run?
Yeah -- here's what it looks like: flux mini run -N 1 -n 1 -c 4 -o "mpi=spectrum" sh -c "export CUDA_VISIBLE_DEVICES=$CDEV ; /usr/gapps/kras/install/bin/autobind-12 cganalysis --simname $sim ...
@grondo , @SteVwonder , @dongahn ,
CUDA_VISIBLE_DEVICES for NVIDIA GPUs is playing nicely with hwloc.
Completing the experiments started in this thread:
leon@lassen15:~$ lstopo-no-graphics | grep -B1 CoProc
PCI 0004:04:00.0 (3D)
CoProc(CUDA) "cuda0"
--
PCI 0004:05:00.0 (3D)
CoProc(CUDA) "cuda1"
--
PCI 0035:03:00.0 (3D)
CoProc(CUDA) "cuda2"
--
PCI 0035:04:00.0 (3D)
CoProc(CUDA) "cuda3"
leon@lassen15:~$ CUDA_VISIBLE_DEVICES=3 lstopo-no-graphics | grep -B1 CoProc
PCI 0035:04:00.0 (3D)
CoProc(CUDA) "cuda0"
leon@lassen15:~$ CUDA_VISIBLE_DEVICES=1,2 lstopo-no-graphics | grep -B1 CoProc
PCI 0004:05:00.0 (3D)
CoProc(CUDA) "cuda0"
--
PCI 0035:03:00.0 (3D)
CoProc(CUDA) "cuda1"
Similar, good, behavior on the AMD side:
leon@corona8:~$ lstopo-no-graphics | grep -B1 CoProc
PCI 13:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
--
PCI 23:00.0 (Display)
CoProc(OpenCL) "opencl0d1"
--
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d2"
--
PCI 73:00.0 (Display)
CoProc(OpenCL) "opencl0d3"
leon@corona8:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep -B1 CoProc
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
leon@corona8:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics | grep -B1 CoProc
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
--
PCI 73:00.0 (Display)
CoProc(OpenCL) "opencl0d1"
Important to note that the CoProc hwloc identifiers (e.g., cudax or openclxxx) are "relative" to the GPUs available to that process. To figure out which GPU(s) one is using in a multi-GPU node, use the PCI ID instead!
Thanks @eleon.
@eleon: Do you know if the logical id of other resources like core and socket show the same behavior with hwloc as well?
For example, if a process is pinned to two cores like core1 in socket0 and core18 in socket1, when the process fetches its hwloc xml, would they remapped be to core0 in socket0 and core1 in socket1?
I think after filtering hwloc the logical IDs of all objects are remapped:
grondo@fluke6:~$ lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB)
Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
grondo@fluke6:~$ taskset -c 7 lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB) + Core L#0 + PU L#0 (P#7)
@grondo:
Very Interesting!
It seems we can use a remapping logic similar to "rank" to remap the resource IDs as well in supporting a nested instance with RV1/JGF reader. (exception for excluded ranks, that is). Did I get this right?
Yes, I think you are correct, we may have to remap resource IDs when inheriting R from parent.
However, I'm a bit worried that this will not work as expected for GPUs, because an environment variable (e.g. CUDA_VISIBLE_DEVICES) doesn't lend itself to being used hierarchically. E.g. if gpu[1-2] are assigned to a child instance and are remapped to gpu[0-1], and then gpu0 is assigned to a job within the child instance, the job will have CUDA_VISIBLE_DEVICES=0 set.
CUDA_VISIBLE_DEVICES) doesn't lend itself to being used hierarchically. E.g. if gpu[1-2] are assigned to a child instance and are remapped to gpu[0-1], and then gpu0 is assigned to a job within the child instance, the job will have CUDA_VISIBLE_DEVICES=0 set.
Yeah, this is a problem.
I am wondering if we can solve this problem generally by adding some minimal remap info into RV1's in execution section. For example, if you have dont_remap or similar:
{
"execution": {
"dont_remap": "gpu",
"R_lite": [
{
"rank": "2-3",
"children": {
"cores": "2-3",
"gpus": "1-2"
}
}
]
}
}
A nested instance can use this augmented execution key is to treat gpu IDs differently from rank, cores IDs.
One way to do this would be: the nest instance populates the rebased IDs for all resource types so that it keeps the IDs in its own ID reference space. But for those types in dont_remap like gpu in this case, it looks up the corresponding children key ("gpus": "1-2" in this case) to emit the IDs in the parent's ID reference space.
Yes, in the near term we may have to do something like that. The noremap resource list could also be part of the scheduler or instance configuration, with a default of "gpu", so that the extra key doesn't have to be inserted into R for every job. Unless you forsee cases where some jobs will have remapped GPUs and some will not?
In Rv2 perhaps, much like hwloc, we can track both physical and logical ids, and the logic to determine which to use can be pushed down into the affinity and containment modules.
In Rv2 perhaps, much like hwloc, we can track both physical and logical ids, and the logic to determine which to use can be pushed down into the affinity and containment modules.
That sounds like a great solution.
FWIW, CUDA_VISIBLE_DEVICES also supports UUIDs [1]: CUDA_VISIBLE_DEVICES=GPU-8932f937-d72c-4106-c12f-20bd9faed9f6, but I don't think ROCM is quite there yet. So at least on the nvidia front, physical IDs could potentially be a short term solution too.
FWIW, CUDA_VISIBLE_DEVICES also supports UUIDs [1]: CUDA_VISIBLE_DEVICES=GPU-8932f937-d72c-4106-c12f-20bd9faed9f6, but I don't think ROCM is quite there yet. So at least on the nvidia front, physical IDs could potentially be a short term solution too.
One concern of using UUID would be difficulty in configuring for the system instance -- unless I'm missing something.
Yes, in the near term we may have to do something like that. The noremap resource list could also be part of the scheduler or instance configuration, with a default of "gpu", so that the extra key doesn't have to be inserted into R for every job. Unless you forsee cases where some jobs will have remapped GPUs and some will not?
@grondo: since nest gpu scheduling is broken (with hwloc reader), I think our immediate action item should be to tackle this problem as our driver for RV1/jgf reader-based work. Since this is an interim solution before rv2, perhaps we should hardcode gpu (compile in) as the noremap resource type?
I am working with @rcarson3 to apply nested scheduling on his ExaAM/Exaconstit workflow on rzansel and found how GPUs are auto-detected within a nested instance is still an issue.
The top-level's flux resource list reports:
flux resource list
STATE NNODES NCORES NGPUS
free 2 80 8
allocated 0 0 0
down 0 0 0
We run the following:
#! /bin/bash
echo ${FLUX_URI}
flux resource list
flux mini batch -N 1 -c 40 -g 4 -n 1 --broker-opts=-Slog-filename=batch1.out ./hip_mechanics.flux
flux mini batch -N 1 -c 40 -g 4 -n 1 --broker-opts=-Slog-filename=batch2.out ./hip_mechanics.flux
flux jobs -a
flux queue drain
flux jobs -a
One of the nested instance logfile reports. Notice that num of GPU is reported only as 1.
2020-11-03T19:00:27.760411Z sched-simple.info[0]: resource update: {"resources":{"0":{"NUMANode":2,"Package":2,"Core":40,"PU":160,"GPU":1,"coreids":"0-39","cpuset":"8-87,96-175"}},"up":"0"}
I will spend some time over the next week or so since this needs to be fixed for our RV1 extension work anyway.
What does the R look like for each of these jobs?
Note: in flux-core v0.21.0 and later, resources will no longer by dynamically discovered using the by_rank object pasted above. Instead the core resource module will use the R assigned by the parent. However, the dynamically gathered hwloc xml will still be provided to Fluxion via the resource.get-xml RPC, so perhaps the same issue will be present as far as Fluxion is concerned.
Two topical problems that led to this issue were discussed with @grondo and filed against flux-core:
https://github.com/flux-framework/flux-core/issues/3315
https://github.com/flux-framework/flux-core/issues/3316
@rcarson3:
We got down to the bottom of this problem and found that the fixes were simple one line change per each issue.
See @grondo's changes posted to the above flux-core issue tickets. Until this will make into our next release cycle, you probably want to keep using the same version of flux you have been using. So I'd suggest the following work around.
Since each of your nested instances will be assigned to an entire compute node, the following work-around should work for the time being.
3,4c3,4
< flux mini batch -N 1 -c 40 -g 4 -n 1 --broker-opts=-Slog-filename=batch1.out ./hip_mechanics.flux
< flux mini batch -N 1 -c 40 -g 4 -n 1 --broker-opts=-Slog-filename=batch2.out ./hip_mechanics.flux
---
> flux mini batch -N 1 -c 40 -n 1 --broker-opts=-Slog-filename=batch1.out ./hip_mechanics.flux
> flux mini batch -N 1 -c 40 -n 1 --broker-opts=-Slog-filename=batch2.out ./hip_mechanics.flux
Although you don't request GPUs in your nested batch job, the combination of the above bugs works in a way that each of your nested instance scheduler will fully discover 4 GPUs for nested scheduling.
I confirmed my reproducer for your workflow works well as below. As @grondo's fixes make into our regular release, I will install it for your use if the release is good enough for our production use.
STATE NNODES NCORES NGPUS
free 2 80 8
allocated 0 0 0
down 0 0 0
f5eG8ZZR
f5kYqWtF
JOBID USER NAME ST NTASKS NNODES RUNTIME RANKS
f5kYqWtF dahn hip_mechan R 1 1 0.198s 1
f5eG8ZZR dahn hip_mechan R 1 1 0.448s 0
JOBID USER NAME ST NTASKS NNODES RUNTIME RANKS
f5kYqWtF dahn hip_mechan CD 1 1 6.446s 1
f5eG8ZZR dahn hip_mechan CD 1 1 5.923s 0
rzansel61{dahn}202: cat flux-176731193344.out
local:///var/tmp/flux-8wSVeK/0/local
STATE NNODES NCORES NGPUS
free 1 40 4
allocated 0 0 0
down 0 0 0
@FrankD412: you may want to pay attention to this bug as well, this will also affect your workflow. My belief is that the same work around should work for you as well.
@dongahn -- I intended to mention previously, in our workflow we follow a slightly different pattern. Because the highest level Flux instance is aware of the GPUs in the allocation, we were changing to (and have changed to) our simulations being unbundled and requesting only their individually required resources. We submit all of these instances via a nested instance limited to their scope of resources, which has the CUDA_VISIBLE_DEVICES environment variable set. We still adhere to dropping the -g flag from all flux mini run calls in the sub-instances as I believe those would still come back as unsatisfiable.
Thanks @FrankD412. This sounds like a reasonable work around. This problem has been fixed recently (in flux-core master) and better support in Fluxion scheduler is also coming. So soon you should be able to do this with no work around. I will keep you posted once we have actually releases for them.