Flux-core: Race condition in job annotation API

Created on 26 Aug 2020  路  35Comments  路  Source: flux-framework/flux-core

I am working to use Job annotation API to annotate jobs under backfilling policies (https://github.com/dongahn/flux-sched/tree/starttime-estimate) and I found there appears to be a race condition in flux job info which can be reliably reproduced under valgrind.

```console
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux start --size=1 -o -Slog-forward-level=7 -o -Sinit.rc2_timeout=300 -o,-g,10 --wrap=/usr/bin/valgrind --wrap=--tool=memcheck --wrap=--leak-check=full --wrap=--trace-children=no --wrap=--child-silent-after-fork=yes --wrap=--leak-resolution=med --wrap=--error-exitcode=1 --wrap=--suppressions=/usr/src/t/valgrind/valgrind.supp /bin/bash
==1222== Memcheck, a memory error detector
==1222== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1222== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==1222== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-1221-r9vyQn --setattr=tbon.endpoint=ipc://%B/req -Slog-forward-level=7 -Sinit.rc2_timeout=300 -g 10 /bin/bash
==1222==
To run a command as administrator (user "root"), use "sudo ".
See "man sudo_root" for details.

ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux hwloc reload ../data/hwloc-data/004N/exclusive/04-brokers
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux module reload -f sched-fluxion-resource prune-filters=ALL:core subsystems=containment policy=low load-allowlist=cluster,node,core
2020-08-25T23:05:00.729621Z sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux module reload -f sched-fluxion-qmanager queue-policy=easy
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux mini submit -n 8 sleep 360
茠PV41Gy5
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ flux mini submit -n 10 sleep 360
茠S7hmKvP
ahn1@573d563f74c4:/usr/src/t/trash-directory.t1003-qmanager-policy$ 2020-08-25T23:05:20.474341Z job-info.err[0]: update_annotations: job 956116762624 not found: Success
```console

By the time annotation request is called (https://github.com/dongahn/flux-sched/blob/starttime-estimate/qmanager/modules/qmanager_callbacks.cpp#L101), JOBID was already created by job-manager. So I suspect that this might be job info which gets this JOBID late.

All 35 comments

flux_log with LOG_ERR would also be better since errno isn't set.

By the time annotation request is called (https://github.com/dongahn/flux-sched/blob/starttime-estimate/qmanager/modules/qmanager_callbacks.cpp#L101), JOBID was already created by job-manager. So I suspect that this might be job info which gets this JOBID late.

Yeah, the job-info module is "eventually consistent" by design, but even so I don't really understand how the job-annotations event could reach job-info before the first job-state event as they seem to both be published in the same block of code.

On another note, there's an existing issue open that declares job-info should never "error out" (#3153). In this case maybe the job-info module should create a missing job in the initial state and move on, assuming the job-state event will be imminent. Though, perhaps it is better to understand how the annotation event could race with the job-state event.

I had a peek at the event batching in the job manager and I cannot see how this could happen either. Like requests, events also have an in-order delivery guarantee (with global sequence numbers). The batch processing always posts state events before annotation events, and posts batches in order (there can only be _one_ in-progress batch).

I"m a little bit stumped by how this could be happening. Maybe a bit of instrumentation would help us see something, e.g.

diff --git a/src/modules/job-info/job_state.c b/src/modules/job-info/job_state.c
index eed00e198..330f29878 100644
--- a/src/modules/job-info/job_state.c
+++ b/src/modules/job-info/job_state.c
@@ -1185,6 +1185,7 @@ static void update_jobs (struct info_ctx *ctx, json_t *transitions)
         }

         if (!(job = zhashx_lookup (jsctx->index, &id))) {
+            flux_log (jsctx->h, LOG_INFO, "XXX inserting %ju", (uintmax_t)id);
             if (!(job = job_create (ctx, id))){
                 flux_log_error (jsctx->h, "%s: job_create", __FUNCTION__);
                 return;
@@ -1287,6 +1288,7 @@ static void update_annotations (struct info_ctx *ctx, json_t *annotations)
             return;
         }

+        flux_log (ctx->h, LOG_INFO, "XXX annotate %ju", (uintmax_t)id);
         if ((job = zhashx_lookup (jsctx->index, &id))) {
             json_decref (job->annotations);
             if (json_is_null (aValue))

For example if we see the inserting message and then the annotation still fails, we would want to look for some way the job was being removed from the hash between events. Or, perhaps we will actually see the events being processed out of order which would be interesting too (but wow, I don't see how).

This is only happening under Valgrind so code slowdown could effect timing in an odd way.

Like requests, events also have an in-order delivery guarantee (with global sequence numbers).

What job event flux info waits on to create a new info block? submit? My wild guess would be:

1) job-manager sends submit events to everyone

2) job-manager sends alloc RCP to qmanager

3) qmanager calls annotationAPI

4) job info receives the annotation response

Because an event is sent asynchronously at 1), perhaps there is no ordering guarantee between step 4 and job info gets the event sent at 1? (My guess is the non-overtaking rule only applies on the messages sent from a send to a same receivers.)

In any case, I will soon post a fluxion PR with my annotation changes so I can help some code review there. And if someone else can reproduce this, this would be an extra confirmation.

The job manager plays the active role here. There is some ambiguity about how state transitions are batched since that is time based. If we assume no batching, it goes like this

  1. job-manager publishes job-state event for DEPEND transition
  2. job-manager publishes job-state event for SCHED transition
  3. job-manager sends alloc request to scheduler
  4. scheduler responds to alloc request with annotation
  5. job-manager publishes job-annotations event with annotation

job-info is subscribed to both event topics, so it should get these event messages (in order).
With batching it's possible (likely) that both state transitions would be collapsed into one message.
As mentioned, job-manager always publishes the state transition before the annotation within a batch.

Because an event is sent asynchronously at 1), perhaps there is no ordering guarantee between step 4 and job info gets the event sent at 1? (My guess is the non-overtaking rule only applies on the messages sent from a send to a same receivers.)

In this case all events are published by job-manager and are subscribed to by job-info, so there should be a precise ordering expectation.

This is indeed odd then.

put in some debugging with @dongahn's reproducer, and it seems the update annotations has beaten the job state transition to the job info module. will have to investigate

```
2020-09-01T21:02:47.860092Z job-info.err[0]: update_annotations: job 385825636352 not found: Success
2020-09-01T21:02:47.895717Z job-info.err[0]: update_jobs: zhashx_insert 385825636352: Success
````

job-info is subscribed to both event topics, so it should get these event messages (in order).

Wait a second, aren't event messages not guaranteed to be sent in order if they are published on different topics?

Have you tried emitting the event message sequence numbers? I think @garlick alluded above that events are always supposed to be delivered in sequence order, regardless of topic string.

Right: all events are globally sequenced (independent of topic).

Well we have a mystery on our hands, the annotations event is sent before the transitions one.

2020-09-03T01:12:58.868550Z job-info.err[0]: update_annotations: job 582706266112 not found: Success
2020-09-03T01:12:58.870109Z job-manager.err[0]: publish_continuation: sequence 46, annotations: Success
2020-09-03T01:12:58.905362Z job-info.err[0]: update_jobs: insert 582706266112: Success
2020-09-03T01:12:58.907401Z job-manager.err[0]: publish_continuation: sequence 48, transitions: Success

Will investigate

I have a lead on how this is happening. The key issue is that we are doing a flux_kvs_commit() before the publication of job state transitions & annotations (committing events from event_batch_commit_event())

  • job-manager wants to write job events for job (event_batch_commit_event())
  • job-manager wants to publish job-state transition for job (event_batch_pub_state())
  • timer expires, batch send occurs, first sending kvs commit of events and waiting for flux kvs response before publishing job-state transitions
  • job-manager wants to publish annotation for job (event_batch_pub_annotations())
  • timer expires, batch send occurs, there are no job events to write to KVS in this batch, there are no job state transitions to publish in this batch, there are annotations to publish, publish those annotations
  • kvs commit returns from first batch send, now we publish job state transitions from first batch

b/c there was only 1 kvs commit & 1 event publication before, the wait on the kvs commit didn't really matter. of course, now it does b/c there are two event publications.

thinking about potential solutions. The obvious ones are

  • job-manager - queue up batches and only send after prior ones complete
  • job-manager - when timer triggers, don't send new batch if prior one not yet complete
  • job-info - could be fixed here by just creating the job structures when an annotation arrives for a job not yet known, but don't like this approach. I think fix should be in job-manager

approaches that won't work once I thought about it

  • job-manager - if prior batch of job-state and/or job-annotations not sent, append new batch of events to the prior batch (race can occur where job state transition sent before events commited to KVS)
  • job-manager - put job-state and job-annotation on same topic (race will still exist due to the kvs commit wait)

I think either of the first two fixes in job-manager is fine, don't see an obvious pro vs con in either of them making one better/worse than the other. there could be a more creative solution i'm still not seeing that'll come to me later.

Ah I misread the code and assumed there was only one batch allocated at a time, but I think your analysis is correct. Hmm.

I'm not liking the idea of further complicating that event batching code in the job manager.

Idea: could we have a separate hash (by job ID) in job-info for annotations? Then if the annotation event arrives before the job state transition event, it doesn't matter? (Don't validate the job ID - the event comes from the instance owner so assume it's good?)

If that idea works maybe we could just have the scheduler publish annotations directly instead of routing it through the job manager? We might have to reexamine how annotations are eventually written to the KVS job record then.

I'm not liking the idea of further complicating that event batching code in the job manager.

I don't think it'll be too bad. I can give the "if prior batch not completed wait" a shot.

Idea: could we have a separate hash (by job ID) in job-info for annotations? Then if the annotation event arrives before the job state transition event, it doesn't matter? (Don't validate the job ID - the event comes from the instance owner so assume it's good?)

Yeah, this is just a slight variant from my fix in job-info idea from above. I think its easily doable. I just wasn't sure if we should fix in job-info. If in the future there are other listeners of job-state & job-annotations, everyone has to be knowledgeable of this race possibility.

I was kind of hoping we could get on a plan that eventually takes the job manager out of the loop, e.g. my last sentence above. Any thoughts on that possibility?

I really like the idea of using this opportunity to rethink the design of annotations.

If the use cases are really just a way to append extra data for job listing or other queries, it makes a lot of sense to send the annotations directly to where they are needed (job-info).

How would users annotate their own jobs though? I guess you could reject any non-instance-owner annotations when the job doesn't exist yet, but accept user annotations once the user has been associated with the jobid? (Is that just another race though?)

We might have to reexamine how annotations are eventually written to the KVS job record then.

Do the annotations even need to be saved in the eventlog? Would it be enough to potentially save them in the job-archive?

I initially mistakenly thought the annotation goes directly to job info from qmanager: thttps://github.com/flux-framework/flux-core/issues/3165#issuecomment-681007976.

For the rework, we need to think carefully about message ordering though since there is no ordering guarantee for messages/events sent from different sources.

Just tagging #3081 so we can link these issues together

Ah, yes I had forgotten about persistence across restarts.

Ah, yes I had forgotten about persistence across restarts.

If there's a separate hash of annotations in job-info, perhaps checkpoint/restart the KVS at module load/unload (not in the job namespace) would cover that?

I'm not remembering if there's a driver for persisting annotations as part of the job record? What if we define them as specifically NOT that?

How would users annotate their own jobs though? I guess you could reject any non-instance-owner annotations when the job doesn't exist yet, but accept user annotations once the user has been associated with the jobid? (Is that just another race though?)

the flux job annotate command sends annotations to the job manager, which does a security check, then eventually the publishing. so the user annotations is the one area i wonder about if the scheduler publishes annotations directly. Do we want the flux job annotate command to send out events instead? the security check done in job-manager would have to be moved to job-info, which I think is doable (may involve an extra step, as job-info has to determine guest userids that are ok).

The scheduler's running as the instance owner so validating that its events were published by the instance owner should be sufficient there.

The scheduler's running as the instance owner so validating that its events were published by the instance owner should be sufficient there.

understood, i was speaking more about validating guest user annotations, which would be bit trickier.

understood, i was speaking more about validating guest user annotations, which would be bit trickier.

Oh crap, actually this might be even more tricky than I thought. If annotations can arrive before the job-state transitions, that means that events for the job may not be in the KVS yet, which means I cannot even determine if the guest userid is a valid guest userid until later. hmmmm

Wondering if passing user annotation _through_ the job manager for job ID validation, but not capturing the annotation, makes sense? E.g. job manager could publish the event on user's behalf?

Note: event could be an RPC.

Alternatively, job manager could provide a jobid validation RPC that job-info uses? That could be quite a useful general purpose thing...

Edit: could be an authorization call that checks a proposed userid and role against job ID?

Edit: could be an authorization call that checks a proposed userid and role against job ID?

Yeah this might be a nicer interface for other services that need to do the same kind of check, as they wouldn't have to fetch userid and do the comparison themselves.

Hmm, maybe that could save job-info from fetching and parsing the eventlog?

Edit: never mind, doesn't work on jobs that are inactive.

Edit: although, it could if job manager was willing to do the fetch if jobid not found in memory. Then it could distinguish between invalid jobid and inactive jobid.

@dongahn before I forget, is this race affecting things for you right now? do you need a short-term fix?

This can wait. This only manifests itself under Valgrind so shouldn't affect production work.

I was going to add one more CI testing with FLUX_TEST_VALGRIND=t but couldn't because of this bug. But since I can alway run FLUX_TEST_VALGRIND=t manually, this can wait.

Hmm, maybe that could save job-info from fetching and parsing the eventlog?
Edit: although, it could if job manager was willing to do the fetch if jobid not found in memory. Then it could distinguish between invalid jobid and inactive jobid.

job-info already looksup / parses the eventlog and parses the userid for a guest userid check on job-info lookups & watches, so generalizing it into a service that the job-manager provides would be useful. that way I could avoid doing unnecessary lookups.

job-info already looksup / parses the eventlog and parses the userid for a guest userid check on job-info lookups & watches, so generalizing it into a service that the job-manager provides would be useful. that way I could avoid doing unnecessary lookups.

OK I'll open a separate issue on that idea since it seems generally useful.

shall we create an additional issue about the annotations moving into the schedulers? or we can lump that discussion into other issues?

This can wait. This only manifests itself under Valgrind so shouldn't affect production work.

Actually, thinking about it, i'll go ahead and fix. b/c regardless of the long term strategy (e.g. with the jobid checker or annotations sent from scheduler), the basic architecture needs to be fixed anyways, might as well do now while the code is fresh in my mind.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

garlick picture garlick  路  8Comments

garlick picture garlick  路  3Comments

SteVwonder picture SteVwonder  路  7Comments

grondo picture grondo  路  7Comments

SteVwonder picture SteVwonder  路  7Comments