Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.
splash:test_hycop_20180329-130523$ flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
1 1 complete 2018-03-29T11:15:11 0.141s 1 hostname
2 10 complete 2018-03-29T11:18:03 0.400s [1-10] flux
3 1 complete 2018-03-29T11:18:03 2.754s 1 run-ddcmd.flu
4 10 complete 2018-03-29T11:37:25 54.472s [1-10] flux
5 1 complete 2018-03-29T11:37:25 3.045s [1-11] run-ddcmd.flu
6 10 complete 2018-03-29T11:49:28 0.152s [1-10] hostname
7 10 complete 2018-03-29T12:01:16 56.422s [1-10] flux
8 1 complete 2018-03-29T12:01:16 3.095s [1-11] run-ddcmd.flu
9 10 complete 2018-03-29T12:12:52 51.523s [1-10] flux
10 1 complete 2018-03-29T12:12:52 3.154s [1-11] run-ddcmd.flu
11 10 complete 2018-03-29T12:20:03 10.520m [1-10] flux
12 1 complete 2018-03-29T12:20:03 3.103s [1-11] run-ddcmd.flu
13 10 complete 2018-03-29T12:39:59 24.765m [1-10] flux
14 1 complete 2018-03-29T12:39:59 24.885m [1-11] run-ddcmd.flu
15 5 complete 2018-03-29T13:03:38 0.192s [1-5] hostname
16 15 complete 2018-03-29T13:03:43 0.296s [1-15] hostname
17 10 running 2018-03-29T13:05:25 8.380m [1-10] flux
18 1 running 2018-03-29T13:05:25 8.375m [1-11] run-ddcmd.flu
19 10 complete 2018-03-29T13:10:05 1.003m [1-21] sleep
20 1 complete 2018-03-29T13:10:12 0.258s [1-22] hostname
Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.
I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this?
It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.
On 29 Mar 2018, at 14:14, Dong H. Ahn wrote:
Looking at the flux wreck ls output below, or the corresponding kvs
entries, it looks like job 8 is running at least 11 processes, one on
each of ranks 0-11, overlapping with job 7. In truth, it's running one
process, only on rank 11. Somehow sched is populating the kvs with
both the new resource request, and all of the resources from the
previous job which is still running. This is even with the release
requested resources call put back in. Somehow the result, in terms of
executing the thing, still looks correct, but the output is really
messed up.I'm looking at this part of the code anyway so I can take a look at
it. What's the easiest way to reproduce this?--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377374678
Currently flux wreck ls is just summarizing the integer keys in lwj.x.y.ranks..
A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra ranks dirs is that the presence of those directories determines on which ranks wrexecd is launched, so we may have a lot of unnecessary fork/exec goings on.
This issue reminded me that on my PR branch (#1399) flux wreck ls is broken. I still have to add code to parse R_lite there to get the RANKS field...
The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.
On 29 Mar 2018, at 14:27, Mark Grondona wrote:
Currently
flux wreck lsis just summarizing the integer keys in
lwj.x.y.ranks..A quick fix would be to ignore rank directories that don't have a
cores= field. Another side effect of having extraranksdirs is that
the presence of those directories determines on which rankswrexecd
is launched, so we may have a lot of unnecessary fork/exec goings on.This issue reminded me that on my PR branch (#1399)
flux wreck lsis
broken. I still have to add code to parseR_litethere to get the
RANKSfield...--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377377779
for example, job 8 above has this in the kvs:
splash:test_hycop_20180329-142212$ flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.10.cores = 1
lwj.0.0.8.rank.11.cores = 1
lwj.0.0.8.rank.2.cores = 1
lwj.0.0.8.rank.3.cores = 1
lwj.0.0.8.rank.4.cores = 1
lwj.0.0.8.rank.5.cores = 1
lwj.0.0.8.rank.6.cores = 1
lwj.0.0.8.rank.7.cores = 1
lwj.0.0.8.rank.8.cores = 1
lwj.0.0.8.rank.9.cores = 1
The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.
erm, oof. That's not expected! :cry:
Oh crap... it's not just running one process, it's actually overscheduling them. We only get output from one, somehow, but it runs them all. This is a really, really bad one.
This is a really, really bad one.
Let me look into this.
Can you dump the whole lwj.0.0.8 directory?
This should be fixed after merge of #1399, if lwj.0.0.8.ntasks = 1, fyi. (One would hope anyway)
Unfortunately the instance is dead as of about a minute ago... The lwj.0.0.8.ntasks was 1 though.
also ncores and nnodes were 1, I had that saved off in history.
It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.
Ok. I can use sleep <k> to emulate this of course. What are the submit options did you use? -N x -n y or did you use some other combination. This should be very helpful.
The quick test I'm using is this:
flux submit -N 10 sleep 60
flux submit -N 1 -O out hostname
Seems to work okay on 4 nodes on quartz. Let me try 10 nodes.
quartz20{dahn}22: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz20{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 260
submit: Submitted jobid 1
Try `flux --help' for more information.
quartz20{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz20{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
1 4 running 2018-03-29T14:58:03 24.266s [0-3] sleep
2 1 complete 2018-03-29T14:58:20 0.047s 0 hostname
quartz20{dahn}25: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz20{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1
FYI -- user issue and then dental appointment. I will pick this up tonight.
Ok, thanks dong, trying a vanilla flux to see if it's something in the perf tweaks we were making.
@SteVwonder @dongahn, any idea where a "error fetching job and event" error would be coming from using most recent master core and sched?
Found it, current sched can't tolerate the lack of a null state yet.
@trws: would it be better if I just merge my flux-sched PR with temporary emulator breakage? @SteVwonder is busy with other stuff at the moment.
At least for the 4 node case, that branch worked okay.
That PR should also speed up the scheduling performance for high job submission rate.
If that's what you were testing on, that would be much appreciated. I'm looking at not being able to run correctly at all right now.
@dongahn, your PR needs to be rebased before we can hit the merge button.
For this bug to repro, node_exclusive has to be turned on in sched.
Also, @dongahn, the output you sent shows overscheduling a node...
@trws: probably exclusivity isn't turned on?
Actually, should have seen your comment above.
This is now a complete blocker. It means I can't run anything without overscheduling.
It appears the sched is configured to do only core-level scheduling!
I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior.
I did, if it’s turned on, or even if you just set the number of node
resources to request to 1 rather than 0 in the request generation, it
goes completely off the deep end. The only workaround I’ve thought of
is to use hwloc reload to load a single core resource description in for
each node, so they all pretend to only have one core.
On 29 Mar 2018, at 17:34, Dong H. Ahn wrote:
It appears the sched is configured to do only core-level
scheduling!I guess you changed this code and turned on the node exclusive
scheduling and the error cropped up? If so, I will do the same and
reproduce the misbehavior.--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377412459
bad!
Using the latest master for both flux-core and sched with one line change in this code setting the node exclusive to be true, I reproduced the over-scheduling problem. This would be the right behavior if you do core-level scheduling, but def. incorrect scheduling for exclusive node-level scheduling.
I will first see if I can fix this issue for this simple reproducer and then see if I can use a more complex case @trws posted in the beginning of this issue.
quartz1922{dahn}52: salloc -N 4 -ppdebug
salloc: Granted job allocation 535400
quartz10{dahn}21: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
1 4 running 2018-03-29T18:53:03 16.995s [0-3] sleep
2 1 exited 2018-03-29T18:53:13 0.059s 0 hostname
quartz10{dahn}24: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1
An equally acceptable result for splash would be to fix it so that the ncores, or cores per task, or something can be used to say ntasks:1 ncores:<cores-per-node> to get exclusive behavior. For now, we're actually using the single-core hwloc xml solution in production to get results over the weekend. 😨
Well. I have to take it back. I ran the test again with that one line change and it seems nodes are exclusively scheduled.
quartz10{dahn}22: srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 8 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 7 sleep 10
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 6 sleep 10
submit: Submitted jobid 3
quartz10{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 5 sleep 10
submit: Submitted jobid 4
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 10
submit: Submitted jobid 5
quartz10{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 3 sleep 10
submit: Submitted jobid 6
quartz10{dahn}27: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 2 sleep 10
submit: Submitted jobid 7
quartz10{dahn}28: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 sleep 10
submit: Submitted jobid 8
quartz10{dahn}42: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
1 8 exited 2018-03-29T20:43:25 1.006m [0-7] sleep
2 7 exited 2018-03-29T20:44:26 10.100s [0-6] sleep
3 6 exited 2018-03-29T20:44:36 10.079s [0-5] sleep
4 5 exited 2018-03-29T20:44:46 10.079s [0-4] sleep
5 4 exited 2018-03-29T20:44:56 10.085s [0-3] sleep
6 3 running 2018-03-29T20:44:56 1.158m [0-6] sleep
7 2 exited 2018-03-29T20:45:06 10.079s [0-1] sleep
8 1 running 2018-03-29T20:45:06 59.338s [0-2] sleep
The problem I see is job 6 and job 8 are incorrectly marked as running. And if I do ps,
quartz10{dahn}45: ps x
PID TTY STAT TIME COMMAND
14306 pts/0 Ss 0:00 -bin/tcsh
16282 pts/0 Sl+ 0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
16283 pts/0 S+ 0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
16303 pts/1 Ssl 0:01 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/cmd/flux-broker
16406 pts/1 S 0:00 -bin/tcsh
16486 ? S 0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=6 --kvs-path=lwj.0.0.6
16495 ? S 0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=8 --kvs-path=lwj.0.0.8
16509 pts/1 R+ 0:00 ps x
I see wrexecd processes are still running... probably didn't get the notification of the program exits?
I will do some more testing for scheduling, though.
Ah, actually
6 3 running 2018-03-29T20:44:56 1.158m [0-6] sleep
8 1 running 2018-03-29T20:45:06 59.338s [0-2] sleep
Those two jobs seem wrong and that maybe why these jobs are still marked as running
No wonder:
quartz10{dahn}48: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.0.cores = 1
lwj.0.0.6.rank.1.cores = 1
lwj.0.0.6.rank.2.cores = 1
lwj.0.0.6.rank.3.cores = 1
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz10{dahn}49: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.0.cores = 1
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.2.cores = 1
Still looking. But I have to guess the bug is within select_resources of the scheduler plugin code called from here
I looked at the log file for the job 6: flux submit -N 3 sleep 10 While the log says:
sched.debug[0]: Found 4 node(s) for job 6, required: 3
When I dumped the select_tree object, it contains 7 compute nodes selected! Then, the logic to generate lwj...rank.N.cores emits the core counts for those 7 nodes.
Need to go a bit deeper to pinpoint the bug...
A bit difficult to diagnose because of all this recursion but I think I got it. It seems this code which is giving trouble for node-exclusive scheduling.
When the node request is exclusive, this code shouldn't select the node type in this else branch. The node level selection should only be done in the if branch.
For quick testing, I added the following conditional:
index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi,
* defined. E.g., it might only stipulate a node with 4 cores
* and omit the intervening socket.
*/
- selected_tree = resrc_tree_new (selected_parent, resrc);
+ if (strcmp (resrc_type (resrc), "node") != 0)
+ selected_tree = resrc_tree_new (selected_parent, resrc);
children = resrc_tree_children (found_tree);
child_tree = resrc_tree_list_first (children);
while (child_tree) {
W/ this, at least my reproducer behaves correctly:
quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
ID NTASKS STATE START RUNTIME RANKS COMMAND
1 8 exited 2018-03-29T22:53:00 1.002m [0-7] sleep
2 7 exited 2018-03-29T22:54:01 10.097s [0-6] sleep
3 6 exited 2018-03-29T22:54:11 10.077s [0-5] sleep
4 5 exited 2018-03-29T22:54:21 10.081s [0-4] sleep
5 4 exited 2018-03-29T22:54:31 10.087s [0-3] sleep
6 3 exited 2018-03-29T22:54:31 10.061s [4-6] sleep
7 2 exited 2018-03-29T22:54:41 10.067s [4-5] sleep
8 1 exited 2018-03-29T22:54:41 10.083s 6 sleep
quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1
@trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend...
Well it is kind of getting late, and thinking about this, the patch logic may not be quite complete. I will spend a bit more time tomorrow morning. But this IS the right bug site.
Wow, heroic effort @dongahn! Nice work!
Those two jobs seem wrong and that maybe why these jobs are still marked as running
There is a bug in the wreck use of Rlite (derived from ranks.N.cores) here. It is setting the nnodes of the job to the total number of ranks assigned, not the number of ranks used. However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.
Is that correct output? I can’t tell by when you submitted, but it
looks like jobs 1 and 2 are completely overlapped?
On 29 Mar 2018, at 23:09, Dong H. Ahn wrote:
A bit difficult to diagnose because of all this recursion but I think
I got it. It seems this
code
which is giving trouble for node-exclusive scheduling.When the node request is exclusive, this code shouldn't select the
nodetype in thiselsebranch. The node level selection should
only be done in theifbranch.For quick testing, I added the following conditional:
index 9e3764f..0dbda87 100644 --- a/sched/sched_fcfs.c +++ b/sched/sched_fcfs.c @@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi, * defined. E.g., it might only stipulate a node with 4 cores * and omit the intervening socket. */ - selected_tree = resrc_tree_new (selected_parent, resrc); + if (strcmp (resrc_type (resrc), "node") != 0) + selected_tree = resrc_tree_new (selected_parent, resrc); children = resrc_tree_children (found_tree); child_tree = resrc_tree_list_first (children); while (child_tree) {W/ this, at least my reproducer behaves correctly:
quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls ID NTASKS STATE START RUNTIME RANKS COMMAND 1 8 exited 2018-03-29T22:53:00 1.002m [0-7] sleep 2 7 exited 2018-03-29T22:54:01 10.097s [0-6] sleep 3 6 exited 2018-03-29T22:54:11 10.077s [0-5] sleep 4 5 exited 2018-03-29T22:54:21 10.081s [0-4] sleep 5 4 exited 2018-03-29T22:54:31 10.087s [0-3] sleep 6 3 exited 2018-03-29T22:54:31 10.061s [4-6] sleep 7 2 exited 2018-03-29T22:54:41 10.067s [4-5] sleep 8 1 exited 2018-03-29T22:54:41 10.083s 6 sleepquartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank lwj.0.0.6.rank.4.cores = 1 lwj.0.0.6.rank.5.cores = 1 lwj.0.0.6.rank.6.cores = 1 quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank lwj.0.0.8.rank.6.cores = 1@trws: I will need more validation including its effect on core-level
scheduling and also make the code a bit more generic... But it seems
worth a shot and maybe can be used for production run over the
weekend...--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377454237
Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though.
Gotcha. That’s definitely an improvement.
On 30 Mar 2018, at 9:22, Dong H. Ahn wrote:
Yes this is correct. job 1 ran 1 min and exited. Then job 2 started
right after and then ran 10 sec. I still haven't had a chance to think
about a complete solution though.--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377561911
However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.
Yes, this is a bug cascaded from a bug from the scheduler. I sort of like the first semantics with a big warning message... But I can see why you may want the second semantics though.
I understand this problem better now. This is a deficiency within the scheduler for node exclusive scheduling mode. I can see why this case wasn't covered as the sched folks have focused on core-level scheduling as our testing coverage and the initial use cases.
I think the better place to fix this problem is actually in the resrc_tree_search function.
The purpose of this else branch is so that one can select the high-level resources leading to each exclusively allocated resource. This is needed when the job request is partially specified: when only core: 1 is requested, the logic should select the cluster, node, socket that contain that particular core.
But there is a deficiency in the code. When resrc walks the resource tree and visits a node vertex that is already allocated to another job, the match test will fail and this logic puts us on the else branch.
But unfortunately, the else branch doesn't check exclusivity. So the visiting node can be selected even if it's exclusively allocated to another job. Over scheduling!
My current patch is:
diff --git a/resrc/resrc_reqst.c b/resrc/resrc_reqst.c
index 9a3244c..fff6748 100644
--- a/resrc/resrc_reqst.c
+++ b/resrc/resrc_reqst.c
@@ -551,7 +551,10 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
nfound = 1;
resrc_reqst->nfound++;
}
- } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))) {
+ } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))
+ && !(resrc_reqst_exclusive (resrc_reqst)
+ && (resrc_size_allocs (resrc_in) || resrc_size_reservtns (resrc_in)))) {
+
/*
* This clause visits the children of the current resource
* searching for a match to the resource request. The found
@@ -562,6 +565,12 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
* defined. E.g., it might only stipulate a node with 4 cores
* and omit the intervening socket.
*/
I can't do a PR for this yet because I don't fully understand its impact to backfill schedulers. But for FCFS that @trws uses, I think there is a good chance this will work.
Tom, you are welcome to test this on your branch if you're interested.
I need to work on various milestone reports this afternoon. I'll see if I can circle back and understand its impact to backfill.
I'm trying this out in the splash branch, will see what happens.
Verdict?
It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.
Great! Please keep this open though. Need to double check this is safe with backfill scheduling before positing a PR.
Need to double check this is safe with backfill scheduling before positing a PR.
It turns out thisresrc logic will require a redesign. For backfill cases, in this else branch we need to even check the future reservation state in determining whether to select the visiting (high-level) resource or not.
One can call resrc_walltime_match once again in this branch after creating a "fake" resrc_reqst object. But patching things like this over and over will likely make the code pretty unreadable.
When I circle around, I will try to patch this as much as possible. But it seems future really should be the new resource layer.
For @trws' purpose, this should work find though. That is, this commit, https://github.com/flux-framework/flux-sched/pull/305/commits/2317aaabeeeaa6cab41bdb46d561ee46d75d8814 in flux-sched PR #306.
FYI -- I think I patched this enough so resrc now should work for fcfs AND backfill in the latest PR I just pushed forward: https://github.com/flux-framework/flux-sched/pull/305
OK. This has been fixed in https://github.com/flux-framework/flux-sched/pull/305.
Most helpful comment
It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.