Flux-core: disturbing but apparently innocuous bug in reporting of ranks working on a job

Created on 29 Mar 2018 · 51Comments · Source: flux-framework/flux-core

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

splash:test_hycop_20180329-130523$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      1 complete   2018-03-29T11:15:11       0.141s        1 hostname
     2     10 complete   2018-03-29T11:18:03       0.400s   [1-10] flux
     3      1 complete   2018-03-29T11:18:03       2.754s        1 run-ddcmd.flu
     4     10 complete   2018-03-29T11:37:25      54.472s   [1-10] flux
     5      1 complete   2018-03-29T11:37:25       3.045s   [1-11] run-ddcmd.flu
     6     10 complete   2018-03-29T11:49:28       0.152s   [1-10] hostname
     7     10 complete   2018-03-29T12:01:16      56.422s   [1-10] flux
     8      1 complete   2018-03-29T12:01:16       3.095s   [1-11] run-ddcmd.flu
     9     10 complete   2018-03-29T12:12:52      51.523s   [1-10] flux
    10      1 complete   2018-03-29T12:12:52       3.154s   [1-11] run-ddcmd.flu
    11     10 complete   2018-03-29T12:20:03      10.520m   [1-10] flux
    12      1 complete   2018-03-29T12:20:03       3.103s   [1-11] run-ddcmd.flu
    13     10 complete   2018-03-29T12:39:59      24.765m   [1-10] flux
    14      1 complete   2018-03-29T12:39:59      24.885m   [1-11] run-ddcmd.flu
    15      5 complete   2018-03-29T13:03:38       0.192s    [1-5] hostname
    16     15 complete   2018-03-29T13:03:43       0.296s   [1-15] hostname
    17     10 running    2018-03-29T13:05:25       8.380m   [1-10] flux
    18      1 running    2018-03-29T13:05:25       8.375m   [1-11] run-ddcmd.flu
    19     10 complete   2018-03-29T13:10:05       1.003m   [1-21] sleep
    20      1 complete   2018-03-29T13:10:12       0.258s   [1-22] hostname

Source

trws

Most helpful comment

It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.

trws on 2 Apr 2018

🎉2

All 51 comments

Looking at the flux wreck ls output below, or the corresponding kvs entries, it looks like job 8 is running at least 11 processes, one on each of ranks 0-11, overlapping with job 7. In truth, it's running one process, only on rank 11. Somehow sched is populating the kvs with both the new resource request, and all of the resources from the previous job which is still running. This is even with the release requested resources call put back in. Somehow the result, in terms of executing the thing, still looks correct, but the output is really messed up.

I'm looking at this part of the code anyway so I can take a look at it. What's the easiest way to reproduce this?

dongahn on 29 Mar 2018

It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.

On 29 Mar 2018, at 14:14, Dong H. Ahn wrote:

Looking at the flux wreck ls output below, or the corresponding kvs
entries, it looks like job 8 is running at least 11 processes, one on
each of ranks 0-11, overlapping with job 7. In truth, it's running one
process, only on rank 11. Somehow sched is populating the kvs with
both the new resource request, and all of the resources from the
previous job which is still running. This is even with the release
requested resources call put back in. Somehow the result, in terms of
executing the thing, still looks correct, but the output is really
messed up.

I'm looking at this part of the code anyway so I can take a look at
it. What's the easiest way to reproduce this?

--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377374678

trws on 29 Mar 2018

Currently flux wreck ls is just summarizing the integer keys in lwj.x.y.ranks..

A quick fix would be to ignore rank directories that don't have a cores= field. Another side effect of having extra ranks dirs is that the presence of those directories determines on which ranks wrexecd is launched, so we may have a lot of unnecessary fork/exec goings on.

This issue reminded me that on my PR branch (#1399) flux wreck ls is broken. I still have to add code to parse R_lite there to get the RANKS field...

grondo on 29 Mar 2018

The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.

On 29 Mar 2018, at 14:27, Mark Grondona wrote:

Currently flux wreck ls is just summarizing the integer keys in
lwj.x.y.ranks..

A quick fix would be to ignore rank directories that don't have a
cores= field. Another side effect of having extra ranks dirs is that
the presence of those directories determines on which ranks wrexecd
is launched, so we may have a lot of unnecessary fork/exec goings on.

This issue reminded me that on my PR branch (#1399) flux wreck ls is
broken. I still have to add code to parse R_lite there to get the
RANKS field...

--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377377779

trws on 29 Mar 2018

for example, job 8 above has this in the kvs:

splash:test_hycop_20180329-142212$ flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.10.cores = 1
lwj.0.0.8.rank.11.cores = 1
lwj.0.0.8.rank.2.cores = 1
lwj.0.0.8.rank.3.cores = 1
lwj.0.0.8.rank.4.cores = 1
lwj.0.0.8.rank.5.cores = 1
lwj.0.0.8.rank.6.cores = 1
lwj.0.0.8.rank.7.cores = 1
lwj.0.0.8.rank.8.cores = 1
lwj.0.0.8.rank.9.cores = 1

trws on 29 Mar 2018

The disturbing part of this is that looking at one of the ones that
should be running only one process, and end up running only one, all of
the rank.N.cores files are there, and they all have the value “1”.

erm, oof. That's not expected! :cry:

grondo on 29 Mar 2018

Oh crap... it's not just running one process, it's actually overscheduling them. We only get output from one, somehow, but it runs them all. This is a really, really bad one.

trws on 29 Mar 2018

This is a really, really bad one.

Let me look into this.

dongahn on 29 Mar 2018

Can you dump the whole lwj.0.0.8 directory?

This should be fixed after merge of #1399, if lwj.0.0.8.ntasks = 1, fyi. (One would hope anyway)

grondo on 29 Mar 2018

Unfortunately the instance is dead as of about a minute ago... The lwj.0.0.8.ntasks was 1 though.

trws on 29 Mar 2018

also ncores and nnodes were 1, I had that saved off in history.

trws on 29 Mar 2018

It seems like the easiest way is to run a multi-node job that takes a
little while, and submit another job that takes one before the first
ends.

Ok. I can use sleep <k> to emulate this of course. What are the submit options did you use? -N x -n y or did you use some other combination. This should be very helpful.

dongahn on 29 Mar 2018

The quick test I'm using is this:

flux submit -N 10 sleep 60
flux submit -N 1 -O out hostname

trws on 29 Mar 2018

Seems to work okay on 4 nodes on quartz. Let me try 10 nodes.

quartz20{dahn}22: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz20{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 260
submit: Submitted jobid 1
Try `flux --help' for more information.
quartz20{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz20{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T14:58:03      24.266s    [0-3] sleep
     2      1 complete   2018-03-29T14:58:20       0.047s        0 hostname
quartz20{dahn}25: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz20{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

dongahn on 30 Mar 2018

FYI -- user issue and then dental appointment. I will pick this up tonight.

dongahn on 30 Mar 2018

Ok, thanks dong, trying a vanilla flux to see if it's something in the perf tweaks we were making.

trws on 30 Mar 2018

@SteVwonder @dongahn, any idea where a "error fetching job and event" error would be coming from using most recent master core and sched?

trws on 30 Mar 2018

Found it, current sched can't tolerate the lack of a null state yet.

trws on 30 Mar 2018

@trws: would it be better if I just merge my flux-sched PR with temporary emulator breakage? @SteVwonder is busy with other stuff at the moment.

At least for the 4 node case, that branch worked okay.

dongahn on 30 Mar 2018

👍1

That PR should also speed up the scheduling performance for high job submission rate.

dongahn on 30 Mar 2018

If that's what you were testing on, that would be much appreciated. I'm looking at not being able to run correctly at all right now.

trws on 30 Mar 2018

@dongahn, your PR needs to be rebased before we can hit the merge button.

grondo on 30 Mar 2018

For this bug to repro, node_exclusive has to be turned on in sched.

trws on 30 Mar 2018

Also, @dongahn, the output you sent shows overscheduling a node...

trws on 30 Mar 2018

@trws: probably exclusivity isn't turned on?

dongahn on 30 Mar 2018

Actually, should have seen your comment above.

dongahn on 30 Mar 2018

This is now a complete blocker. It means I can't run anything without overscheduling.

trws on 30 Mar 2018

It appears the sched is configured to do only core-level scheduling!

I guess you changed this code and turned on the node exclusive scheduling and the error cropped up? If so, I will do the same and reproduce the misbehavior.

dongahn on 30 Mar 2018

I did, if it’s turned on, or even if you just set the number of node
resources to request to 1 rather than 0 in the request generation, it
goes completely off the deep end. The only workaround I’ve thought of
is to use hwloc reload to load a single core resource description in for
each node, so they all pretend to only have one core.

On 29 Mar 2018, at 17:34, Dong H. Ahn wrote:

It appears the sched is configured to do only core-level
scheduling!

I guess you changed this code and turned on the node exclusive
scheduling and the error cropped up? If so, I will do the same and
reproduce the misbehavior.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377412459

trws on 30 Mar 2018

bad!

dongahn on 30 Mar 2018

Using the latest master for both flux-core and sched with one line change in this code setting the node exclusive to be true, I reproduced the over-scheduling problem. This would be the right behavior if you do core-level scheduling, but def. incorrect scheduling for exclusive node-level scheduling.

I will first see if I can fix this issue for this simple reproducer and then see if I can use a more complex case @trws posted in the beginning of this issue.

quartz1922{dahn}52: salloc -N 4 -ppdebug
salloc: Granted job allocation 535400
quartz10{dahn}21: srun --pty --mpi=none -N 4 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 -O out hostname
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      4 running    2018-03-29T18:53:03      16.995s    [0-3] sleep
     2      1 exited     2018-03-29T18:53:13       0.059s        0 hostname
quartz10{dahn}24: flux kvs dir -R lwj.0.0.1.rank
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.1.cores = 1
lwj.0.0.1.rank.2.cores = 1
lwj.0.0.1.rank.3.cores = 1
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.2.rank
lwj.0.0.2.rank.0.cores = 1

dongahn on 30 Mar 2018

An equally acceptable result for splash would be to fix it so that the ncores, or cores per task, or something can be used to say ntasks:1 ncores:<cores-per-node> to get exclusive behavior. For now, we're actually using the single-core hwloc xml solution in production to get results over the weekend. 😨

trws on 30 Mar 2018

Well. I have to take it back. I ran the test again with that one line change and it seems nodes are exclusively scheduled.

quartz10{dahn}22: srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
quartz10{dahn}21: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 8 sleep 60
submit: Submitted jobid 1
quartz10{dahn}22: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 7 sleep 10
submit: Submitted jobid 2
quartz10{dahn}23: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 6 sleep 10
submit: Submitted jobid 3
quartz10{dahn}24: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 5 sleep 10
submit: Submitted jobid 4
quartz10{dahn}25: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 4 sleep 10
submit: Submitted jobid 5
quartz10{dahn}26: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 3 sleep 10
submit: Submitted jobid 6
quartz10{dahn}27: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 2 sleep 10
submit: Submitted jobid 7
quartz10{dahn}28: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux submit -N 1 sleep 10
submit: Submitted jobid 8

quartz10{dahn}42: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T20:43:25       1.006m    [0-7] sleep
     2      7 exited     2018-03-29T20:44:26      10.100s    [0-6] sleep
     3      6 exited     2018-03-29T20:44:36      10.079s    [0-5] sleep
     4      5 exited     2018-03-29T20:44:46      10.079s    [0-4] sleep
     5      4 exited     2018-03-29T20:44:56      10.085s    [0-3] sleep
     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     7      2 exited     2018-03-29T20:45:06      10.079s    [0-1] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

The problem I see is job 6 and job 8 are incorrectly marked as running. And if I do ps,

quartz10{dahn}45: ps x
   PID TTY      STAT   TIME COMMAND
 14306 pts/0    Ss     0:00 -bin/tcsh
 16282 pts/0    Sl+    0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16283 pts/0    S+     0:00 srun --pty --mpi=none -N 8 /g/g0/dahn/workspace/flux-cancel/inst/bin/flux start
 16303 pts/1    Ssl    0:01 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/cmd/flux-broker
 16406 pts/1    S      0:00 -bin/tcsh
 16486 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=6 --kvs-path=lwj.0.0.6
 16495 ?        S      0:00 /g/g0/dahn/workspace/flux-cancel/inst/libexec/flux/wrexecd --lwj-id=8 --kvs-path=lwj.0.0.8
 16509 pts/1    R+     0:00 ps x

I see wrexecd processes are still running... probably didn't get the notification of the program exits?

I will do some more testing for scheduling, though.

dongahn on 30 Mar 2018

Ah, actually

     6      3 running    2018-03-29T20:44:56       1.158m    [0-6] sleep
     8      1 running    2018-03-29T20:45:06      59.338s    [0-2] sleep

Those two jobs seem wrong and that maybe why these jobs are still marked as running

dongahn on 30 Mar 2018

No wonder:

quartz10{dahn}48: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.0.cores = 1
lwj.0.0.6.rank.1.cores = 1
lwj.0.0.6.rank.2.cores = 1
lwj.0.0.6.rank.3.cores = 1
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1
quartz10{dahn}49: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.0.cores = 1
lwj.0.0.8.rank.1.cores = 1
lwj.0.0.8.rank.2.cores = 1

dongahn on 30 Mar 2018

Still looking. But I have to guess the bug is within select_resources of the scheduler plugin code called from here

I looked at the log file for the job 6: flux submit -N 3 sleep 10 While the log says:

sched.debug[0]: Found 4 node(s) for job 6, required: 3

When I dumped the select_tree object, it contains 7 compute nodes selected! Then, the logic to generate lwj...rank.N.cores emits the core counts for those 7 nodes.

Need to go a bit deeper to pinpoint the bug...

dongahn on 30 Mar 2018

A bit difficult to diagnose because of all this recursion but I think I got it. It seems this code which is giving trouble for node-exclusive scheduling.

When the node request is exclusive, this code shouldn't select the node type in this else branch. The node level selection should only be done in the if branch.

For quick testing, I added the following conditional:

index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, resrc_api_ctx_t *rsapi,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */
-        selected_tree = resrc_tree_new (selected_parent, resrc);
+        if (strcmp (resrc_type (resrc), "node") != 0)
+            selected_tree = resrc_tree_new (selected_parent, resrc);
         children = resrc_tree_children (found_tree);
         child_tree = resrc_tree_list_first (children);
         while (child_tree) {

W/ this, at least my reproducer behaves correctly:

quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      8 exited     2018-03-29T22:53:00       1.002m    [0-7] sleep
     2      7 exited     2018-03-29T22:54:01      10.097s    [0-6] sleep
     3      6 exited     2018-03-29T22:54:11      10.077s    [0-5] sleep
     4      5 exited     2018-03-29T22:54:21      10.081s    [0-4] sleep
     5      4 exited     2018-03-29T22:54:31      10.087s    [0-3] sleep
     6      3 exited     2018-03-29T22:54:31      10.061s    [4-6] sleep
     7      2 exited     2018-03-29T22:54:41      10.067s    [4-5] sleep
     8      1 exited     2018-03-29T22:54:41      10.083s        6 sleep

quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1

quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1

@trws: I will need more validation including its effect on core-level scheduling and also make the code a bit more generic... But it seems worth a shot and maybe can be used for production run over the weekend...

dongahn on 30 Mar 2018

Well it is kind of getting late, and thinking about this, the patch logic may not be quite complete. I will spend a bit more time tomorrow morning. But this IS the right bug site.

dongahn on 30 Mar 2018

Wow, heroic effort @dongahn! Nice work!

Those two jobs seem wrong and that maybe why these jobs are still marked as running

There is a bug in the wreck use of Rlite (derived from ranks.N.cores) here. It is setting the nnodes of the job to the total number of ranks assigned, not the number of ranks used. However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

grondo on 30 Mar 2018

Is that correct output? I can’t tell by when you submitted, but it
looks like jobs 1 and 2 are completely overlapped?

On 29 Mar 2018, at 23:09, Dong H. Ahn wrote:

A bit difficult to diagnose because of all this recursion but I think
I got it. It seems this
code
which is giving trouble for node-exclusive scheduling.

When the node request is exclusive, this code shouldn't select the
node type in this else branch. The node level selection should
only be done in the if branch.

For quick testing, I added the following conditional:
index 9e3764f..0dbda87 100644
--- a/sched/sched_fcfs.c
+++ b/sched/sched_fcfs.c
@@ -228,7 +228,8 @@ resrc_tree_t *select_resources (flux_t *h, 
resrc_api_ctx_t *rsapi,
          * defined.  E.g., it might only stipulate a node with 4 
cores
          * and omit the intervening socket.
          */
-        selected_tree = resrc_tree_new (selected_parent, resrc);
+        if (strcmp (resrc_type (resrc), "node") != 0)
+            selected_tree = resrc_tree_new (selected_parent, resrc);
         children = resrc_tree_children (found_tree);
         child_tree = resrc_tree_list_first (children);
         while (child_tree) {
W/ this, at least my reproducer behaves correctly:
quartz16{dahn}38: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux wreck 
ls
    ID NTASKS STATE                    START      RUNTIME    RANKS 
COMMAND
     1      8 exited     2018-03-29T22:53:00       1.002m    [0-7] 
sleep
     2      7 exited     2018-03-29T22:54:01      10.097s    [0-6] 
sleep
     3      6 exited     2018-03-29T22:54:11      10.077s    [0-5] 
sleep
     4      5 exited     2018-03-29T22:54:21      10.081s    [0-4] 
sleep
     5      4 exited     2018-03-29T22:54:31      10.087s    [0-3] 
sleep
     6      3 exited     2018-03-29T22:54:31      10.061s    [4-6] 
sleep
     7      2 exited     2018-03-29T22:54:41      10.067s    [4-5] 
sleep
     8      1 exited     2018-03-29T22:54:41      10.083s        6 
sleep
quartz16{dahn}39: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs 
dir -R lwj.0.0.6.rank
lwj.0.0.6.rank.4.cores = 1
lwj.0.0.6.rank.5.cores = 1
lwj.0.0.6.rank.6.cores = 1

quartz16{dahn}40: /g/g0/dahn/workspace/flux-cancel/inst/bin/flux kvs 
dir -R lwj.0.0.8.rank
lwj.0.0.8.rank.6.cores = 1
@trws: I will need more validation including its effect on core-level
scheduling and also make the code a bit more generic... But it seems
worth a shot and maybe can be used for production run over the
weekend...

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377454237

trws on 30 Mar 2018

Yes this is correct. job 1 ran 1 min and exited. Then job 2 started right after and then ran 10 sec. I still haven't had a chance to think about a complete solution though.

dongahn on 30 Mar 2018

Gotcha. That’s definitely an improvement.

On 30 Mar 2018, at 9:22, Dong H. Ahn wrote:

Yes this is correct. job 1 ran 1 min and exited. Then job 2 started
right after and then ran 10 sec. I still haven't had a chance to think
about a complete solution though.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1400#issuecomment-377561911

trws on 30 Mar 2018

However, what should the correct behavior be? We could either run successfully on the number of actually used nodes, or generate a fatal error at job startup when ntasks < nnodes.

Yes, this is a bug cascaded from a bug from the scheduler. I sort of like the first semantics with a big warning message... But I can see why you may want the second semantics though.

dongahn on 30 Mar 2018

I understand this problem better now. This is a deficiency within the scheduler for node exclusive scheduling mode. I can see why this case wasn't covered as the sched folks have focused on core-level scheduling as our testing coverage and the initial use cases.

I think the better place to fix this problem is actually in the resrc_tree_search function.

The purpose of this else branch is so that one can select the high-level resources leading to each exclusively allocated resource. This is needed when the job request is partially specified: when only core: 1 is requested, the logic should select the cluster, node, socket that contain that particular core.

But there is a deficiency in the code. When resrc walks the resource tree and visits a node vertex that is already allocated to another job, the match test will fail and this logic puts us on the else branch.

But unfortunately, the else branch doesn't check exclusivity. So the visiting node can be selected even if it's exclusively allocated to another job. Over scheduling!

My current patch is:

diff --git a/resrc/resrc_reqst.c b/resrc/resrc_reqst.c
index 9a3244c..fff6748 100644
--- a/resrc/resrc_reqst.c
+++ b/resrc/resrc_reqst.c
@@ -551,7 +551,10 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
             nfound = 1;
             resrc_reqst->nfound++;
         }
-    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))) {
+    } else if (resrc_tree_num_children (resrc_phys_tree (resrc_in))
+               && !(resrc_reqst_exclusive (resrc_reqst)
+                   && (resrc_size_allocs (resrc_in) || resrc_size_reservtns (resrc_in)))) {
+
         /*
          * This clause visits the children of the current resource
          * searching for a match to the resource request.  The found
@@ -562,6 +565,12 @@ int64_t resrc_tree_search (resrc_api_ctx_t *ctx,
          * defined.  E.g., it might only stipulate a node with 4 cores
          * and omit the intervening socket.
          */

I can't do a PR for this yet because I don't fully understand its impact to backfill schedulers. But for FCFS that @trws uses, I think there is a good chance this will work.

Tom, you are welcome to test this on your branch if you're interested.

I need to work on various milestone reports this afternoon. I'll see if I can circle back and understand its impact to backfill.

dongahn on 30 Mar 2018

I'm trying this out in the splash branch, will see what happens.

trws on 2 Apr 2018

Verdict?

dongahn on 2 Apr 2018

It seems to help. We're running it right now, but the predominent mode is actually using the new ncores functionality which helps a lot. I'm actually getting coscheduling the way they want now.

trws on 2 Apr 2018

🎉2

Great! Please keep this open though. Need to double check this is safe with backfill scheduling before positing a PR.

dongahn on 2 Apr 2018

Need to double check this is safe with backfill scheduling before positing a PR.

It turns out thisresrc logic will require a redesign. For backfill cases, in this else branch we need to even check the future reservation state in determining whether to select the visiting (high-level) resource or not.

One can call resrc_walltime_match once again in this branch after creating a "fake" resrc_reqst object. But patching things like this over and over will likely make the code pretty unreadable.

When I circle around, I will try to patch this as much as possible. But it seems future really should be the new resource layer.

For @trws' purpose, this should work find though. That is, this commit, https://github.com/flux-framework/flux-sched/pull/305/commits/2317aaabeeeaa6cab41bdb46d561ee46d75d8814 in flux-sched PR #306.

dongahn on 4 Apr 2018

FYI -- I think I patched this enough so resrc now should work for fcfs AND backfill in the latest PR I just pushed forward: https://github.com/flux-framework/flux-sched/pull/305

dongahn on 6 Apr 2018

OK. This has been fixed in https://github.com/flux-framework/flux-sched/pull/305.

dongahn on 10 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

cppcheck: [src/modules/kvs/kvstxn.c:509]: (error) Uninitialized variable: ref

grondo · 7Comments

cleanup: use __func__ not __FUNCTION__

garlick · 4Comments

libflux: change flux_future_error_string() to return flux_strerror() if textual error was not set

chu11 · 6Comments

best practice for launching flux with preset FLUX_URI

grondo · 7Comments

increase minimum jansson version

chu11 · 3Comments