Flux-core: `flux_job_event_watch_get` error when submitting via python bindings

Created on 28 Jan 2020 · 15Comments · Source: flux-framework/flux-core

As a starting caveat, I'm operating "out-of-bounds" currently as we prepare an AWS image for the ECP tutorial. Since the EC2 instances only have a single core, we are "faking" a cluster by running flux hwloc reload (which points to a directory with 3 XML files from an LC system), and then reloading the simple-sched. The various tweaks I made in ${PREFIX}/etc can be found here: https://github.com/SteVwonder/Tutorials/tree/ecp-2020/2020-ECP/workflow-examples/etc

When running jobs on multiple cores with affinity on, we get the following:

→ flux mini submit -N2 -n4 -c2 ./compute.py 10               
65708167790592
→ flux job attach 65708167790592
0.131s: flux-shell[1]: ERROR: affinity: affinity: core1 not in topology
0.131s: flux-shell[1]:  WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
0.127s: flux-shell[0]: ERROR: affinity: affinity: core1 not in topology
0.127s: flux-shell[0]:  WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.

This is expected. So we turn affinity off, and we get:

→ flux mini submit -N2 -n4 -c2 -o cpu-affinity=off ./compute.py 10
65166364377088
→ flux job attach 65166364377088
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.
Will compute for 10 seconds.

So far, so good. If I do the same thing in Python, no issues:

#!/usr/bin/env python3
# test.py                                                                                                                                                                                                                                                                    

import os
import flux
from flux.job import JobspecV1

f = flux.Flux()

compute_jobreq = JobspecV1.from_command(
    command=["./compute.py", "120"], num_tasks=4, num_nodes=2, cores_per_task=2
)
compute_jobreq.cwd = os.getcwd()
compute_jobreq.environment = dict(os.environ)
compute_jobreq.setattr_shell_option('cpu-affinity', 'off')
print(flux.job.submit(f, compute_jobreq))

→ python3 ./test.py                                                    
58183049543680
→ flux job attach 58183049543680
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.

Then I attempt to run example2 in the workflow examples (lightly modified to not use Slurm and to disable cpu-affinity), and I get an error:

→ python3 ./submitter.py 
59534672396288
59535007940608

→ flux job attach 59534672396288
0.240s: job.exception type=exec severity=0 flux_job_event_watch_get: No such file or directory
flux-job: task(s) exited with exit code 1
0.226s: flux-shell[1]: FATAL: flux_job_event_watch_get: No such file or directory
→ flux job attach 59535007940608
Will forward I/O requests for 120 seconds.

Again, no issue when doing the same from the CLI:

→ flux mini submit -N2 -n4 -c2 -o cpu-affinity=off ./compute.py 120 && flux mini submit -N1 -n1 -c1 -o cpu-affinity=off ./io-forwarding.py 120  
62240535347200
62246927466496

→ flux job attach 62240535347200
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
→ flux job attach 62246927466496
Will forward I/O requests for 120 seconds.

I'm scratching my head why the CLI doesn't exhibit issues and why the back-to-back submission matters.

Source

SteVwonder

All 15 comments

Forgot to include, if I run submitter.py with cpu-affinity on, then I get the affinity error/warning but not the job event watch one:

→ python3 ./submitter.py        
72380248489984
72380617588736

→ flux job attach 72380248489984
0.162s: flux-shell[1]: ERROR: affinity: affinity: core1 not in topology
0.162s: flux-shell[1]:  WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
0.159s: flux-shell[0]: ERROR: affinity: affinity: core1 not in topology
0.159s: flux-shell[0]:  WARN: affinity: unable to get cpuset for cores 0-3. Disabling affinity
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.
Will compute for 120 seconds.

→ flux job attach 72380617588736
Will forward I/O requests for 120 seconds.

The patch that turns cpu-affinity off in python:

diff --git a/2020-ECP/workflow-examples/example2/submitter.py b/2020-ECP/workflow-examples/example2/submitter.py
index 51f2408..4381f8c 100755
--- a/2020-ECP/workflow-examples/example2/submitter.py
+++ b/2020-ECP/workflow-examples/example2/submitter.py
@@ -13,6 +13,7 @@ compute_jobreq = JobspecV1.from_command(
 )
 compute_jobreq.cwd = os.getcwd()
 compute_jobreq.environment = dict(os.environ)
+compute_jobreq.setattr_shell_option('cpu-affinity', 'off')
 print(flux.job.submit(f, compute_jobreq))

 io_jobreq = JobspecV1.from_command(
@@ -20,4 +21,5 @@ io_jobreq = JobspecV1.from_command(
 )
 io_jobreq.cwd = os.getcwd()
 io_jobreq.environment = dict(os.environ)
+io_jobreq.setattr_shell_option('cpu-affinity', 'off')
 print(flux.job.submit(f, io_jobreq))

```

SteVwonder on 28 Jan 2020

If you grab the guest.output eventlog for the failing job, we can pinpoint the line that generated the generic flux_job_event_watch_get. e.g.

$ flux job eventlog -p guest.output 59534672396288

That will help with the next step of diagnosis.

grondo on 28 Jan 2020

The patch that turns cpu-affinity off in python:

BTW, you can turn off cpu-affinity globally in initrc.lua:

if not shell.options['cpu-affinity'] then
    shell.options['cpu-affinity'] = "off"
end

Edit: nevermind, I see you have this already in the ECP /etc directory!

grondo on 28 Jan 2020

Well, this is where I rage quit for the day. I had to stop the ec2 instance last night to create an AMI. After restarting the instance, I can no longer reproduce this error :man_facepalming:. I close the issue to keep the GH issue tracker clean, and re-open if it pops back up. If it does reoccur, I'll grab the full eventlog as you suggested @grondo.

Sorry for the noise and thanks for the debugging pointer.

SteVwonder on 28 Jan 2020

And I spoke too soon. I spammed the system with a bunch of jobs and the error returned. So maybe it's tied to the affinity thing (the job has to be scheduled on a particular core).

Output from flux dmesg and the job eventlog below:

2020-01-28T18:01:19.139407Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.141992Z sched-simple.debug[0]: req: 3208726446080: spec={2,4,2}
2020-01-28T18:01:19.142068Z sched-simple.debug[0]: alloc: 3208726446080: rank[1-2]/core[5-8]
2020-01-28T18:01:19.184648Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.206549Z sched-simple.debug[0]: req: 3209045213184: spec={1,1,1}
2020-01-28T18:01:19.206612Z sched-simple.debug[0]: alloc: 3209045213184: rank0/core8
2020-01-28T18:01:19.249406Z barrier.debug[0]: create shell-3208726446080-init 2
2020-01-28T18:01:19.256482Z barrier.debug[0]: destroy shell-3208726446080-init 2
2020-01-28T18:01:19.271975Z barrier.debug[0]: create shell-3208726446080-start 2
2020-01-28T18:01:19.290077Z kvs.debug[0]: aggregated 2 transactions (2 ops)
2020-01-28T18:01:19.315687Z connector-local.debug[0]: subscribe shell-3209045213184.
2020-01-28T18:01:19.324100Z barrier.debug[0]: destroy shell-3208726446080-start 2
2020-01-28T18:01:19.326138Z job-exec.info[0]: 3208726446080: 2: stderr: flux-shell: FATAL: flux_job_event_watch_get: No such file or directory
2020-01-28T18:01:19.351345Z job-exec.debug[0]: exec aborted: id=3208726446080
2020-01-28T18:01:19.351428Z job-exec.debug[0]: exec_kill: 3208726446080: signal 15
2020-01-28T18:01:19.549392Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.550455Z sched-simple.debug[0]: req: 3215621881856: spec={2,4,2}
2020-01-28T18:01:19.550517Z sched-simple.debug[0]: alloc: 3215621881856: rank[0-1]/core[9-12]
2020-01-28T18:01:19.582044Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:19.591096Z sched-simple.debug[0]: req: 3215957426176: spec={1,1,1}
2020-01-28T18:01:19.591151Z sched-simple.debug[0]: alloc: 3215957426176: rank2/core9
2020-01-28T18:01:19.606318Z connector-local.debug[0]: subscribe shell-3215621881856.
2020-01-28T18:01:19.659111Z barrier.debug[0]: create shell-3215621881856-init 2
2020-01-28T18:01:19.663504Z barrier.debug[0]: destroy shell-3215621881856-init 2
2020-01-28T18:01:19.685123Z barrier.debug[0]: create shell-3215621881856-start 2
2020-01-28T18:01:19.731578Z barrier.debug[0]: destroy shell-3215621881856-start 2
2020-01-28T18:01:20.007743Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.008752Z sched-simple.debug[0]: req: 3223305846784: spec={2,4,2}
2020-01-28T18:01:20.034656Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.449079Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.475412Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.899949Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:20.926374Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:21.365716Z job-manager.debug[0]: submit_cb: added 1 jobs
2020-01-28T18:01:21.392349Z job-manager.debug[0]: submit_cb: added 1 jobs

→ flux job eventlog -p guest.output 3208726446080
1580234479.247394 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":4,"stderr":4} options={}
1580234479.322726 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580234479.359454 data stream="stderr" rank="0" eof=true
1580234479.360061 data stream="stdout" rank="0" eof=true
1580234479.367177 data stream="stderr" rank="1" eof=true
1580234479.367729 data stream="stdout" rank="1" eof=true

SteVwonder on 28 Jan 2020

Looks like it is in the guest.input eventlog which is unexpectedly missing.

I think rank 0 shell is supposed to create this eventlog with a header before any other ranks attempt to watch the eventlog. I have to remind myself how this works, but there shouldn't be a race here, because the rank 0 shell would create the guest.input eventlog before the shell.init barrier, and tasks would "watch" the eventlog after the barrier.

I wonder if the cpu-affinity plugin failures change the timing somehow so there is a race though.. requires more investigation.

It would be great if we had a smaller reproducer.

Can you grab guest.input too?

flux job eventlog -p guest.input 3208726446080

grondo on 28 Jan 2020

cc @chu11 because he may more quickly have some insight.

grondo on 28 Jan 2020

👍1

Can you grab guest.input too?

→ flux job eventlog -p guest.input 9374823088128 
1580234674.825106 header version=1 encoding={"stdin":"base64"} count={"stdin":1} options={}

SteVwonder on 28 Jan 2020

I can reproduce this issue reliably with the following script:

#!/bin/bash

RC=0
RANKS="[0-$(($(flux getattr size)-1))]"

flux kvs put resource.hwloc.by_rank="{\"$RANKS\": {\"Package\": 1, \"Core\": 16, \"PU\": 16, \"cpuset\": \"0-15\"}}"

flux kvs get resource.hwloc.by_rank

flux module remove sched-simple
flux module load sched-simple

flux dmesg | grep 'sched-simple.*ready'  | tail -1

flux mini submit --dry-run -N2 -n2 sleep 0 \
    | t/ingest/submitbench -r 24 - >/tmp/jobs.list.$$

for id in $(cat /tmp/jobs.list.$$); do
    if ! flux job attach $id >/dev/null 2>&1; then
        printf "job %s\n" $id
        flux job eventlog -p guest.output $id
        RC=1
    fi
done

exit $RC

E.g.:

grondo@asp:~/git/flux-core.git$ src/cmd/flux start -s 4 ./test.sh
{"[0-3]": {"Package": 1, "Core": 16, "PU": 16, "cpuset": "0-15"}}
2020-01-28T19:02:48.091555Z sched-simple.debug[0]: ready: 64 of 64 cores: rank[0-3]/core[0-15]
2020-01-28T19:02:49.216148Z broker.err[3]: kill: No such process
2020-01-28T19:02:49.230905Z job-exec.err[0]: 9210691587: exec_kill: No such process
job 9193914373
1580238168.894457 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238169.108064 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238169.180861 data stream="stderr" rank="1" eof=true
1580238169.180901 data stream="stdout" rank="1" eof=true
1580238169.181402 data stream="stderr" rank="0" eof=true
1580238169.181508 data stream="stdout" rank="0" eof=true
job 9210691587
1580238168.961750 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238168.964689 log rank=1 level=3 message="affinity: core5 not in topology" component="affinity" file="affinity.c" line=124
1580238168.964755 log rank=1 level=4 message="unable to get cpuset for cores 5. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.134269 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238168.998769 log rank=0 level=3 message="affinity: core5 not in topology" component="affinity" file="affinity.c" line=124
1580238168.998814 log rank=0 level=4 message="unable to get cpuset for cores 5. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.198040 data stream="stderr" rank="1" eof=true
1580238169.198080 data stream="stdout" rank="1" eof=true
1580238169.201722 data stream="stderr" rank="0" eof=true
1580238169.202409 data stream="stdout" rank="0" eof=true
job 9227468802
1580238169.095960 header version=1 encoding={"stdout":"base64","stderr":"base64"} count={"stdout":2,"stderr":2} options={}
1580238169.121105 log rank=1 level=3 message="affinity: core7 not in topology" component="affinity" file="affinity.c" line=124
1580238169.121163 log rank=1 level=4 message="unable to get cpuset for cores 7. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.238275 log rank=1 level=0 message="flux_job_event_watch_get: No such file or directory" file="input.c" line=484
1580238169.106030 log rank=0 level=3 message="affinity: core7 not in topology" component="affinity" file="affinity.c" line=124
1580238169.106078 log rank=0 level=4 message="unable to get cpuset for cores 7. Disabling affinity" component="affinity" file="affinity.c" line=286
1580238169.260278 data stream="stderr" rank="0" eof=true
1580238169.260331 data stream="stdout" rank="0" eof=true
1580238169.265109 data stream="stderr" rank="1" eof=true
1580238169.265160 data stream="stdout" rank="1" eof=true
2020-01-28T19:02:50.449134Z broker.err[0]: Run level 2 Exited with non-zero status (rc=1) 2.5s
flux-start: 0 (pid 4564) exited with rc=1

The issue reproduces with or without cpu-affinity=off

grondo on 28 Jan 2020

👍1

Ah, unless I'm mistaken, shell_input_kvs_init asynchronously creates the input header and thus the guest.input eventlog. This means other ranks can race with creation of the input header. Since we don't have WAITCREATE for eventlogs afaik, the input plugin will have to be modified to wait synchronously for creation of the input eventlog.

proof of concept patch: try it out:

diff --git a/src/shell/input.c b/src/shell/input.c
index 9155ac7fd..3e2f7df13 100644
--- a/src/shell/input.c
+++ b/src/shell/input.c
@@ -259,6 +259,9 @@ static int shell_input_kvs_init (struct shell_input *in, json_t *header)
         goto error;
     if (!(f = flux_kvs_commit (in->shell->h, NULL, 0, txn)))
         goto error;
+
+    flux_future_get (f, NULL);
+
     if (flux_future_then (f, -1, shell_input_kvs_init_completion, in) < 0)
         goto error;
     if (flux_shell_add_completion_ref (in->shell, "input.kvs-init") < 0) {

grondo on 28 Jan 2020

Ran the reproducer with the above patch many times without any failures. I'll work up a real fix and PR.

Thanks for catching this issue @SteVwonder!

grondo on 28 Jan 2020

🎉1

@grondo You beat me a bit to it :-)

Your suggested fix will definitely work.

I'm not sure if there was a miss-assumption on my part or if shell code had changed and it was an previously ok assumption. Trying to go through old commits to see.

I believe the shell input init code (and probably output code too) assumes the "completion_ref" is a barrier. That we can't continue until the "remove_ref" has completed. shell_barrier() uses flux_barrier(), so that definitely is not the case.

chu11 on 28 Jan 2020

👍1

Thanks for catching this issue @SteVwonder!

Thank you for running this down! Nice and concise reproducer btw!

SteVwonder on 28 Jan 2020

Yeah, that is an outstanding issue. The "completion_ref" keeps the reactor running so that the job isn't "complete" until all refs have been released.

There's an open issue that we also need an "initialization" reference, so that we don't enter the barrier until all init refs have been released. This will probably greatly speed up initialization since it will remove a lot of synchronous code. To do that, however, we'll need to enter the reactor before the init barrier, which isn't done now.

Sorry for the confusion!

grondo on 28 Jan 2020

👍1

Looking through the old commits, the completion ref was used to fix raciness on the output side (#2363), then was then it was copied onto the input side, believing it also to be sufficient. Doh!

chu11 on 28 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

wreck parity: need support for job names

SteVwonder · 7Comments

Memory leak in Jtostr

SteVwonder · 7Comments

details of Flux test system

garlick · 8Comments

flux-jobs: get subset of recently completed jobs

cmoussa1 · 6Comments

libflux: change flux_future_error_string() to return flux_strerror() if textual error was not set

chu11 · 6Comments