As a starting caveat, I'm operating "out-of-bounds" currently as we prepare an AWS image for the ECP tutorial. Since the EC2 instances only have a single core, we are "faking" a cluster by running flux hwloc reload (which points to a directory with 3 XML files from an LC system), and then reloading the simple-sched. The various tweaks I made in ${PREFIX}/etc can be found here: https://github.com/SteVwonder/Tutorials/tree/ecp-2020/2020-ECP/workflow-examples/etc
When running jobs on multiple cores with affinity on, we get the following:
โ flux mini submit -N2 -n4 -c2 ./compute.py 10
65708167790592
โ flux job attach 65708167790592
0.131s: flux-shell[1]: ERROR: affinity: affinity: core1 not in topology
0.131s: flux-shell[1]: WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
0.127s: flux-shell[0]: ERROR: affinity: affinity: core1 not in topology
0.127s: flux-shell[0]: WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
This is expected. So we turn affinity off, and we get:
โ flux mini submit -N2 -n4 -c2 -o cpu-affinity=off ./compute.py 10
65166364377088
โ flux job attach 65166364377088
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
So far, so good. If I do the same thing in Python, no issues:
#!/usr/bin/env python3
# test.py
import os
import flux
from flux.job import JobspecV1
f = flux.Flux()
compute_jobreq = JobspecV1.from_command(
command=["./compute.py", "120"], num_tasks=4, num_nodes=2, cores_per_task=2
)
compute_jobreq.cwd = os.getcwd()
compute_jobreq.environment = dict(os.environ)
compute_jobreq.setattr_shell_option('cpu-affinity', 'off')
print(flux.job.submit(f, compute_jobreq))
โ python3 ./test.py
58183049543680
โ flux job attach 58183049543680
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Then I attempt to run example2 in the workflow examples (lightly modified to not use Slurm and to disable cpu-affinity), and I get an error:
โ python3 ./submitter.py
59534672396288
59535007940608
โ flux job attach 59534672396288
0.240s: job.exception type=exec severity=0 flux_job_event_watch_get: No such file or directory
flux-job: task(s) exited with exit code 1
0.226s: flux-shell[1]: FATAL: flux_job_event_watch_get: No such file or directory
โ flux job attach 59535007940608
Will forward I/O requests for 120 seconds.
Again, no issue when doing the same from the CLI:
โ flux mini submit -N2 -n4 -c2 -o cpu-affinity=off ./compute.py 120 && flux mini submit -N1 -n1 -c1 -o cpu-affinity=off ./io-forwarding.py 120
62240535347200
62246927466496
โ flux job attach 62240535347200
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
โ flux job attach 62246927466496
Will forward I/O requests for 120 seconds.
I'm scratching my head why the CLI doesn't exhibit issues and why the back-to-back submission matters.
Forgot to include, if I run submitter.py with cpu-affinity on, then I get the affinity error/warning but not the job event watch one:
โ python3 ./submitter.py
72380248489984
72380617588736
โ flux job attach 72380248489984
0.162s: flux-shell[1]: ERROR: affinity: affinity: core1 not in topology
0.162s: flux-shell[1]: WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
0.159s: flux-shell[0]: ERROR: affinity: affinity: core1 not in topology
0.159s: flux-shell[0]: WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
โ flux job attach 72380617588736
Will forward I/O requests for 120 seconds.
The patch that turns cpu-affinity off in python:
diff --git a/2020-ECP/workflow-examples/example2/submitter.py b/2020-ECP/workflow-examples/example2/submitter.py
index 51f2408..4381f8c 100755
--- a/2020-ECP/workflow-examples/example2/submitter.py
+++ b/2020-ECP/workflow-examples/example2/submitter.py
@@ -13,6 +13,7 @@ compute_jobreq = JobspecV1.from_command(
)
compute_jobreq.cwd = os.getcwd()
compute_jobreq.environment = dict(os.environ)
+compute_jobreq.setattr_shell_option('cpu-affinity', 'off')
print(flux.job.submit(f, compute_jobreq))
io_jobreq = JobspecV1.from_command(
@@ -20,4 +21,5 @@ io_jobreq = JobspecV1.from_command(
)
io_jobreq.cwd = os.getcwd()
io_jobreq.environment = dict(os.environ)
+io_jobreq.setattr_shell_option('cpu-affinity', 'off')
print(flux.job.submit(f, io_jobreq))
```
If you grab the guest.output eventlog for the failing job, we can pinpoint the line that generated the generic flux_job_event_watch_get. e.g.
$ flux job eventlog -p guest.output 59534672396288
That will help with the next step of diagnosis.
The patch that turns cpu-affinity off in python:
BTW, you can turn off cpu-affinity globally in initrc.lua:
if not shell.options['cpu-affinity'] then
shell.options['cpu-affinity'] = "off"
end
Edit: nevermind, I see you have this already in the ECP /etc directory!
Well, this is where I rage quit for the day. I had to stop the ec2 instance last night to create an AMI. After restarting the instance, I can no longer reproduce this error :man_facepalming:. I close the issue to keep the GH issue tracker clean, and re-open if it pops back up. If it does reoccur, I'll grab the full eventlog as you suggested @grondo.
Sorry for the noise and thanks for the debugging pointer.
And I spoke too soon. I spammed the system with a bunch of jobs and the error returned. So maybe it's tied to the affinity thing (the job has to be scheduled on a particular core).
Output from flux dmesg and the job eventlog below:
2020-01-28T18:01:19.139407Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.141992Z sched-simple.debug[0]: req: 3208726446080: spec={2,4,2}
2020-01-28T18:01:19.142068Z sched-simple.debug[0]: alloc: 3208726446080: rank[1-2]/core[5-8]
2020-01-28T18:01:19.184648Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.206549Z sched-simple.debug[0]: req: 3209045213184: spec={1,1,1}
2020-01-28T18:01:19.206612Z sched-simple.debug[0]: alloc: 3209045213184: rank0/core8
2020-01-28T18:01:19.249406Z barrier.debug[0]: create shell-3208726446080-init 2
2020-01-28T18:01:19.256482Z barrier.debug[0]: destroy shell-3208726446080-init 2
2020-01-28T18:01:19.271975Z barrier.debug[0]: create shell-3208726446080-start 2
2020-01-28T18:01:19.290077Z kvs.debug[0]: aggregated 2 transactions (2 ops)
2020-01-28T18:01:19.315687Z connector-local.debug[0]: subscribe shell-3209045213184.
2020-01-28T18:01:19.324100Z barrier.debug[0]: destroy shell-3208726446080-start 2
2020-01-28T18:01:19.326138Z job-exec.info[0]: 3208726446080: 2: stderr: flux-shell: FATAL: flux_job_event_watch_get: No such file or directory
2020-01-28T18:01:19.351345Z job-exec.debug[0]: exec aborted: id=3208726446080
2020-01-28T18:01:19.351428Z job-exec.debug[0]: exec_kill: 3208726446080: signal 15
2020-01-28T18:01:19.549392Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.550455Z sched-simple.debug[0]: req: 3215621881856: spec={2,4,2}
2020-01-28T18:01:19.550517Z sched-simple.debug[0]: alloc: 3215621881856: rank[0-1]/core[9-12]
2020-01-28T18:01:19.582044Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.591096Z sched-simple.debug[0]: req: 3215957426176: spec={1,1,1}
2020-01-28T18:01:19.591151Z sched-simple.debug[0]: alloc: 3215957426176: rank2/core9
2020-01-28T18:01:19.606318Z connector-local.debug[0]: subscribe shell-3215621881856.
2020-01-28T18:01:19.659111Z barrier.debug[0]: create shell-3215621881856-init 2
2020-01-28T18:01:19.663504Z barrier.debug[0]: destroy shell-3215621881856-init 2
2020-01-28T18:01:19.685123Z barrier.debug[0]: create shell-3215621881856-start 2
2020-01-28T18:01:19.731578Z barrier.debug[0]: destroy shell-3215621881856-start 2
2020-01-28T18:01:20.007743Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.008752Z sched-simple.debug[0]: req: 3223305846784: spec={2,4,2}
2020-01-28T18:01:20.034656Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.449079Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.475412Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.899949Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.926374Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:21.365716Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:21.392349Z job-manager.debug[0]: submit_cb: added 1 jobs
โ flux job eventlog -p guest.output 3208726446080
1580234479.247394 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":4,"stderr":4} options={}
1580234479.322726 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580234479.359454 data stream="stderr" rank="0" eof=true
1580234479.360061 data stream="stdout" rank="0" eof=true
1580234479.367177 data stream="stderr" rank="1" eof=true
1580234479.367729 data stream="stdout" rank="1" eof=true
Looks like it is in the guest.input eventlog which is unexpectedly missing.
I think rank 0 shell is supposed to create this eventlog with a header before any other ranks attempt to watch the eventlog. I have to remind myself how this works, but there shouldn't be a race here, because the rank 0 shell would create the guest.input eventlog before the shell.init barrier, and tasks would "watch" the eventlog after the barrier.
I wonder if the cpu-affinity plugin failures change the timing somehow so there is a race though.. requires more investigation.
It would be great if we had a smaller reproducer.
Can you grab guest.input too?
flux job eventlog -p guest.input 3208726446080
cc @chu11 because he may more quickly have some insight.
Can you grab guest.input too?
โ flux job eventlog -p guest.input 9374823088128
1580234674.825106 header version=1 encoding={"stdin":"base64"} count={"stdin":1} options={}
I can reproduce this issue reliably with the following script:
#!/bin/bash
RC=0
RANKS="[0-$(($(flux getattr size)-1))]"
flux kvs put resource.hwloc.by_rank="{\"$RANKS\": {\"Package\": 1, \"Core\": 16, \"PU\": 16, \"cpuset\": \"0-15\"}}"
flux kvs get resource.hwloc.by_rank
flux module remove sched-simple
flux module load sched-simple
flux dmesg | grep 'sched-simple.*ready' | tail -1
flux mini submit --dry-run -N2 -n2 sleep 0 \
| t/ingest/submitbench -r 24 - >/tmp/jobs.list.$$
for id in $(cat /tmp/jobs.list.$$); do
if ! flux job attach $id >/dev/null 2>&1; then
printf "job %s\n" $id
flux job eventlog -p guest.output $id
RC=1
fi
done
exit $RC
E.g.:
grondo@asp:~/git/flux-core.git$ src/cmd/flux start -s 4 ./test.sh
{"[0-3]": {"Package": 1, "Core": 16, "PU": 16, "cpuset": "0-15"}}
2020-01-28T19:02:48.091555Z sched-simple.debug[0]: ready: 64 of 64 cores: rank[0-3]/core[0-15]
2020-01-28T19:02:49.216148Z broker.err[3]: kill: No such process
2020-01-28T19:02:49.230905Z job-exec.err[0]: 9210691587: exec_kill: No such process
job 9193914373
1580238168.894457 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238169.108064 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238169.180861 data stream="stderr" rank="1" eof=true
1580238169.180901 data stream="stdout" rank="1" eof=true
1580238169.181402 data stream="stderr" rank="0" eof=true
1580238169.181508 data stream="stdout" rank="0" eof=true
job 9210691587
1580238168.961750 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238168.964689 log rank=1 level=3 message="affinity: core5 not in topology" component="affinity" file="affinity.c" line=124
1580238168.964755 log rank=1 level=4 message="unable to get cpuset for cores 5. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.134269 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238168.998769 log rank=0 level=3 message="affinity: core5 not in topology" component="affinity" file="affinity.c" line=124
1580238168.998814 log rank=0 level=4 message="unable to get cpuset for cores 5. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.198040 data stream="stderr" rank="1" eof=true
1580238169.198080 data stream="stdout" rank="1" eof=true
1580238169.201722 data stream="stderr" rank="0" eof=true
1580238169.202409 data stream="stdout" rank="0" eof=true
job 9227468802
1580238169.095960 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238169.121105 log rank=1 level=3 message="affinity: core7 not in topology" component="affinity" file="affinity.c" line=124
1580238169.121163 log rank=1 level=4 message="unable to get cpuset for cores 7. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.238275 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238169.106030 log rank=0 level=3 message="affinity: core7 not in topology" component="affinity" file="affinity.c" line=124
1580238169.106078 log rank=0 level=4 message="unable to get cpuset for cores 7. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.260278 data stream="stderr" rank="0" eof=true
1580238169.260331 data stream="stdout" rank="0" eof=true
1580238169.265109 data stream="stderr" rank="1" eof=true
1580238169.265160 data stream="stdout" rank="1" eof=true
2020-01-28T19:02:50.449134Z broker.err[0]: Run level 2 Exited with non-zero status (rc=1) 2.5s
flux-start: 0 (pid 4564) exited with rc=1
The issue reproduces with or without cpu-affinity=off
Ah, unless I'm mistaken, shell_input_kvs_init asynchronously creates the input header and thus the guest.input eventlog. This means other ranks can race with creation of the input header. Since we don't have WAITCREATE for eventlogs afaik, the input plugin will have to be modified to wait synchronously for creation of the input eventlog.
proof of concept patch: try it out:
diff --git a/src/shell/input.c b/src/shell/input.c
index 9155ac7fd..3e2f7df13 100644
--- a/src/shell/input.c
+++ b/src/shell/input.c
@@ -259,6 +259,9 @@ static int shell_input_kvs_init (struct shell_input *in, json_t *header)
goto error;
if (!(f = flux_kvs_commit (in->shell->h, NULL, 0, txn)))
goto error;
+
+ flux_future_get (f, NULL);
+
if (flux_future_then (f, -1, shell_input_kvs_init_completion, in) < 0)
goto error;
if (flux_shell_add_completion_ref (in->shell, "input.kvs-init") < 0) {
Ran the reproducer with the above patch many times without any failures. I'll work up a real fix and PR.
Thanks for catching this issue @SteVwonder!
@grondo You beat me a bit to it :-)
Your suggested fix will definitely work.
I'm not sure if there was a miss-assumption on my part or if shell code had changed and it was an previously ok assumption. Trying to go through old commits to see.
I believe the shell input init code (and probably output code too) assumes the "completion_ref" is a barrier. That we can't continue until the "remove_ref" has completed. shell_barrier() uses flux_barrier(), so that definitely is not the case.
Thanks for catching this issue @SteVwonder!
Thank you for running this down! Nice and concise reproducer btw!
Yeah, that is an outstanding issue. The "completion_ref" keeps the reactor running so that the job isn't "complete" until all refs have been released.
There's an open issue that we also need an "initialization" reference, so that we don't enter the barrier until all init refs have been released. This will probably greatly speed up initialization since it will remove a lot of synchronous code. To do that, however, we'll need to enter the reactor before the init barrier, which isn't done now.
Sorry for the confusion!
Looking through the old commits, the completion ref was used to fix raciness on the output side (#2363), then was then it was copied onto the input side, believing it also to be sufficient. Doh!