Flux-core: job-manager: add general annotation service

Created on 8 Jan 2020  路  30Comments  路  Source: flux-framework/flux-core

There was an initial design in the job-manager <=> scheduler protocol, not fully implemented, for the scheduler to respond to an alloc request with a human readable job annotation response each time it updated its schedule. The "note" was just a human readable string intended to be passed through to a job listing command. It was not intended to be persistent as part of the job record since it only applies to pending jobs and might be updated many times before the job runs.

Per discussion today

  • It should be structured, allowing at minimum the scheduler to set projected start time and reason under separate keys
  • there might be a use case for jobs to annotate themselves, or for workflow logic to annotate jobs, implying something like a job-manager.setattr RPC
  • DoS could be avoided by restricting size of object for guest users (or not allowing guest users to use it).
  • job-info module could request annotations for a job or list of jobs e.g. through a `job-manager.getattr RPC.
  • job listing tool could fetch updated annotations on demand and supply upon request to user
  • consider namespacing with rules, like putting sched annotations under a sched. namespace and not allowing setattr to touch those.

Most helpful comment

I thought the problem was that annotations are updated at a different rate than job state events? Or did you mean "batched up like the job state transitions" (sorry if I've misread that)

Yeah, I meant batched using the same mechanism but sent out under a different topic string.

I hate to even ask it, but why are we sending annotations to job-manager? They seem to be superfluous to the management of job state transitions, which is the primary role of job-manager. Everything would seem to be easier if annotations were sent directly to job-info by the annotator? I've probably missed something obvious, so sorry for bringing it up.

Valid point IMHO.

If we had the job manager publish this as an event, we could proceed as planned with #2960 and consider the following optimizations later:
1) not storing the annotations in the job manager (publish and forget)
2) moving annotation event publication to the scheduler (or libschedutil) and eliminating completely from the job manager, although I guess that would re-introduce the race, if we care about that

All 30 comments

Thanks @garlick. We also agreed that the initial set of scheduler information that can go here is 1) estimated start time and 2) reason for pending. I will think about if there is other scheduler information that can make use of this.

I'm writing up an RFC for the protocol between job manager and scheduler. Will try to have a draft by noon. Perhaps we can discuss some of the outstanding issues regarding annotation there when it's up.

In my draft so far I changed the note (string) key in the alloc response to annotations (object) in the ANNOTATE response only, leaving note (string) as is for SUCCESS and DENIED responses. Only ANNOTATE would update the job's hash of scheduler annotations.

I will propose that these updates be volatile, e.g. not present in any job eventlog entry so they would disappear when the instance restarts, which is consistent with the projected start time + reason use case articulated above. Open for discussion of course.

So following up PR #2960, I was beginning to look into how to get data from the job-manager into flux-jobs. In PR #2960 having flux-jobs read most of its info from job-info and annotations from job-manager was considered.

It occurred to me that reading data from job-info and job-manager will probably always be racy. For example, job-info may still believe a job is in the SCHED state when job-manager believes it's in RUNNING. so annotations may not match up with expectations.

Should we accept annotations as "best effort" or "most recent we have", and possibly out of sync? It could confuse users if they get an annotation that isn't what they expect.

In order to avoid the raciness, I think an alternate approach would have to be done. (send annotations with job state transitions, write to KVS log of some sort, etc.)

Annotations by the scheduler during SCHED state are already kind of best effort:
1) the scheduler may not have considered a job in SCHED state yet, so annotations unavailable
2) the scheduler might not implement start time estimates or reason pending (they are optional)
3) (race you were thinking of?) the job has left SCHED state and start time / reason have been cleared by scheduler

It seems fine to me to just silently blank those fields in the job listing if unavailable.

I wonder if it would be a good idea to add a flux job annotate command and corresponding job manager service method now to facilitate testing? Maybe an annotation request could take a flag to indicate whether it is volatile or not, and if non-volatile, have it generate an event for the eventlog so that the job manager can re-add it to the hash during eventlog replay (along with those in the alloc event ).

Incidentally those "non-volatile" annotations could be picked up by job-info via the eventlog if we want. I'm not sure if that just gets confusing though, having annotations available from two different places?

BTW how will flux jobs --format know that some arbitrary field name is an annotation? We defined a couple in the RFC but would the scheduler- or user-defined ones need some sort of special prefix?

BTW how will flux jobs --format know that some arbitrary field name is an annotation? We defined a couple in the RFC but would the scheduler- or user-defined ones need some sort of special prefix?

I would think flux-jobs would be able to fetch the entire annotation as a dotdict and use it as annotation.sched.t_estimate or note.sched.t_estimate.

Also, it comes to mind that flux-jobs isn't going to be the only consumer of this data down the road. Another use case might be something like flux top or a dashboard which dynamically updates job information. It would be nice if these utilities could subscribe to updates for job-info, refreshing info only for jobs that have changed. For most job data the utility could subscribe to job events, but not for annotations, which silently change in the job manager between states. Utilities will have to poll for this data for all pending jobs, if they wish to display it. Not sure what my point is exactly, I guess just that we should consider the access mechanism in general, not just how to get the data into flux-jobs.

Incidentally those "non-volatile" annotations could be picked up by job-info via the eventlog if we want. I'm not sure if that just gets confusing though, having annotations available from two different places?

That sounds nice, then flux jobs and other utilities could filter on these non-volatile annotations, which I think would be one of the eventual use cases. However, I do think it is a bit confusing. Maybe the non-volatile annotations are not annotations but something else? Perhaps we leave off non-volatile annotations for now until there is a use case?

I wonder if it would be a good idea to add a flux job annotate command and corresponding job manager service method now to facilitate testing? Maybe an annotation request could take a flag to indicate whether it is volatile or not, and if non-volatile, have it generate an event for the eventlog so that the job manager can re-add it to the hash during eventlog replay (along with those in the alloc event ).

That seems like a good idea. Better than the modification fo sched-simple which was the path I was pondering :P

Incidentally those "non-volatile" annotations could be picked up by job-info via the eventlog if we want. I'm not sure if that just gets confusing though, having annotations available from two different places?

I like the idea, but as @grondo says, best for when we have a use case.

I would think flux-jobs would be able to fetch the entire annotation as a dotdict and use it as annotation.sched.t_estimate or note.sched.t_estimate.

This was my idea, and if a field doesn't happen to exist, just output empty string. Not sure how to handle the header though, but that can be dealt with later.

Also, it comes to mind that flux-jobs isn't going to be the only consumer of this data down the road. Another use case might be something like flux top or a dashboard which dynamically updates job information. It would be nice if these utilities could subscribe to updates for job-info, refreshing info only for jobs that have changed. For most job data the utility could subscribe to job events, but not for annotations, which silently change in the job manager between states. Utilities will have to poll for this data for all pending jobs, if they wish to display it. Not sure what my point is exactly, I guess just that we should consider the access mechanism in general, not just how to get the data into flux-jobs.

Having the annotations in job-info is my preferred goal, b/c of just these reasons. It centralizes everything in one interface.

If we accept annotations as "best effort" and possibly "out of sync" (e.g. job-info may be in the SCHED state for a job but has the RUNNING annotation), I think having everything in job-info will be pretty easy. We can have a streaming service send annotations everytime they get updated. B/c there can be racyness between job-info and job-manager on when job IDs are visible (e.g. a streamed annotation could beat the event state transition to the job-info module, therefor job-info isn't aware of the job id yet), there should be a one-time lookup of the current annotation whenever we first get into the SCHED state.

I hate to see us spend too much time on machinery to replicate this data across the two modules.

Maybe as a stopgap we could just publish an event? If we batched it up with the state transition events, it seems like we could close that race. (published under a separate topic string, I was thinking).

I know I balked at that before but it might be the lesser evil at this point.

Maybe as a stopgap we could just publish an event? If we batched it up with the state transition events, it seems like we could close that race. (published under a separate topic string, I was thinking).

I thought the problem was that annotations are updated at a different rate than job state events? Or did you mean "batched up like the job state transitions" (sorry if I've misread that)

I hate to even ask it, but why are we sending annotations to job-manager? They seem to be superfluous to the management of job state transitions, which is the primary role of job-manager. Everything would seem to be easier if annotations were sent directly to job-info by the annotator? I've probably missed something obvious, so sorry for bringing it up.

Edit: I also meant to say that I like the idea of the batched event update. Dashboard-like utilities could use this along with job state events to refresh most useful data without hammering services with RPCs, streaming or otherwise.

Maybe as a stopgap we could just publish an event? If we batched it up with the state transition events, it seems like we could close that race. (published under a separate topic string, I was thinking).

Yeah, it would definitely close the race if we combined it with the state transition events.

Why publish in a separate topic string? Do you see other subscribers to the job state transitions? B/c I think job-info is the only subscriber right now.

I know I balked at that before but it might be the lesser evil at this point.

Agreed.

Do you see other subscribers to the job state transitions? B/c I think job-info is the only subscriber right now.

One of the workflow examples uses the job state transition events, so I don't think you can assume job-info is the only subscriber.

I thought the problem was that annotations are updated at a different rate than job state events? Or did you mean "batched up like the job state transitions" (sorry if I've misread that)

Yeah, I meant batched using the same mechanism but sent out under a different topic string.

I hate to even ask it, but why are we sending annotations to job-manager? They seem to be superfluous to the management of job state transitions, which is the primary role of job-manager. Everything would seem to be easier if annotations were sent directly to job-info by the annotator? I've probably missed something obvious, so sorry for bringing it up.

Valid point IMHO.

If we had the job manager publish this as an event, we could proceed as planned with #2960 and consider the following optimizations later:
1) not storing the annotations in the job manager (publish and forget)
2) moving annotation event publication to the scheduler (or libschedutil) and eliminating completely from the job manager, although I guess that would re-introduce the race, if we care about that

I think that is a good strategy. It probably isn't a good idea to attempt to optimize for unknown use cases at this point.

So one subtlety I've hit in some testing. If a job is pending/running and has an annotation in it, should annotations be available after it is no longer running?

I can see cases for both cases. For jobs that ran, sched.resource_summary can be useful after the fact. But it can also be confusing to see things like sched.reason_pending in jobs that got canceled

Been going back and forth on this. What are thoughts from others?

Users will probably want to see certain annotations such as which queue that job was running? Given that RFC 27 says:

Upon Alloc success:

If present, the job manager SHALL update the job's annotation dictionary as described in the next section. The scheduler MAY delete annotations such as sched.t_estimate that are not relevant now that the allocation request has been satisfied.

Annotation response:

Annotations SHALL be considered volatile until a SUCCESS response
is received to the sched.alloc request, as described in Alloc Success above.

So perhaps all the annotations that the scheduler didn't delete on alloc success should still be available after the job completes? A sane scheduler should probably delete a key like sched.t_estimate and sched.reason_pending on alloc success but availability should really be the scheduler's choice?

I haven't had hands-on with the new annotations protocol so take it with a grain of salt, though.

That was my understanding @dongahn. I think we may still need to add support for replaying annotations from the eventlog when the job manager / instance restarts, though.

A sane scheduler should probably delete a key like sched.t_estimate and sched.reason_pending on alloc success but availability should really be the scheduler's choice?

Agreed. That is what I've been doing thus far.

Perhaps the case that is a little more iffy: Would we generally want sched.reason_pending to be available if a job was pending and it got canceled?

Ah I guess we didn't cover that case in the RFC. It's ambiguous because when the alloc doesn't end in success the scheduler doesn't have the opportunity to clear the annotations, but since they aren't part of any eventlog entry, they will go away if the instance restarts or job manager is reloaded.

Should we clearsched.* when the alloc ends in cancellation or error?

Should we clearsched.* when the alloc ends in cancellation or error?

Right now I'm trying to clear clearing them via a ANNOTATE response before I respond with CANCEL. Would this be the general pattern to use? Or would we want a flag or something else in the CANCEL response?

Should we clear sched.* when the alloc ends in cancellation or error?

The RFC says

Annotations MAY be discarded by the job manager if the allocation fails.

Are you suggesting that be changed to a SHALL to avoid requirement of a separate RPC? I think that makes sense so that we don't have a requirement for extra, unnecessary RPCs.

or would we want a flag or something else in the CANCEL response?

Avoiding extra RPCs seems like a good idea to me, though I haven't been working on this problem so my opinion shouldn't count for much. I actually think the annotation should be cleared on failure by default, with a future opportunity for a scheduler to add extra annotations in the CANCEL response, since that is likely to be the uncommon case.

I hadn't thought about amending the RFC since MAY allows the job manager to clear those annotations. But a SHALL would make it clear that the scheduler doesn't need to do it so maybe a good idea.

But a SHALL would make it clear that the scheduler doesn't need to do it so maybe a good idea.

You mean "does need to do it"?

No - and hopefully I'm understanding the issue - if the job manager SHALL automatically discard annotations on alloc failure, then the scheduler _does not_ need to issue an extra annotate response to set reason/estimate to NULL.

No - and hopefully I'm understanding the issue - if the job manager SHALL automatically discard annotations on alloc failure, then the scheduler does not need to issue an extra annotate response to set reason/estimate to NULL.

Ahh agreed. I misread earlier. "_scheduler_ doesn't need to do it", not "_job-manager_ doesn't need to do it"

Hmmm, we log annotations on a SUCCESS response in case of a module restart. But user annotations could occur before or after an annotation has been written out. So any annotations after the job has started running wouldn't be re-loaded.

Should we not allow user annotations after a job is running? Or should we consider user annotations after it started running as volatile?

Started to look at annotations support in flux-jobs and a noticed a subtlety.

Right now, if an annotation (such as) sched.reason_pending is annotated, it is listed in the annotations object like this:

{"sched.reasong_pending": "no cores"}

Perhaps this is not what we want? Do we actually want:

{"sched":{"reason_pending": "no cores"}}

In RFC27 we talk about namespaces with annotations, and I assumed the former, but now think we want the latter? The latter (when it is ultimately absorbed into an object in flux-jobs) can formatted by the user doing sched.reason_pending and sched.t_estimate and sched.foobar, etc..

@chu11: Seems the latter makes more sense. Do we need a namespace different than sched though in the sched annotation API, though? Are you thinking to extend this to support the annotation from other subsystems?

I think the sched key/namespace is more than enough for sched. I was mostly thinking of a user namespace for annotations created by the user (see PR #3044).

To support the above, I would have to update how the internal annotations object is updated. Just calling json_object_set() wouldn't work anymore.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SteVwonder picture SteVwonder  路  7Comments

chu11 picture chu11  路  3Comments

cmoussa1 picture cmoussa1  路  6Comments

dongahn picture dongahn  路  7Comments

garlick picture garlick  路  3Comments