The queue maintained by job-manager as submitted in pr #1707 is ordered by jobid, roughly the submission time.
Since the scheduler may only see a subset of the queued jobs (if it so chooses) and the jobs are presented to the scheduler in queue order, there needs to be a priority factor involved in ordering jobs so that administrative expedite can work, and a user can request job priority at submission time.
We had discussed adding a single integer priority to each job, then ordering jobs first by priority, then by submission time. There would be a default priority which the user could override at submission time, as well as a way to adjust the priority of already submitted jobs (which would cause them to be moved within the queue).
The instance owner would be allowed to set priority of a job to any value.
A guest would be allowed to set priority of their jobs to a value less than the default value.
How this impacts the scheduler interface was discussed somewhat in #1697.
We had discussed adding a single integer priority to each job
I believe we also mentioned that this priority could (should?) be constrained to a limited set of values (e.g., -20 -> 20) to maximize the potential for aggregating jobs based on priority (important when requesting credits).
Offline, we also noted the need for additional priority "knobs" in the case of a fair-share scheduler. This is to prevent a high priority job (according to the fair-share prioritization) from being stalled by the jobs already in the job manager's and scheduler's queues. It is worth noting that historically the priority set by the scheduler has been either a floating point value or a integer with a large range of possible values.
I propose three potential solutions:
The first solution is to add an additional numerical priority value to the job that is set by the external prioritization module. If we add this extra priority value, the question then becomes, which priority takes precedence in the job manager module, the integer priority with the limited range or the broader external priority? If the latter, we will probably lose most of our ability to aggregate jobs based on priority.
Another solution is to retain the single, limited range integer priority value by having the external prioritization map its priority down to the smaller range. This will make implementation at the job manager level easier in exchange for a significant amount of complexity at the priority module.
The third solution is to retain a single value but allow it to be a larger range when there is an external prioritization module. In this scenario, the user would still submit with their job with the limited range priority, and the external prioritization module would then take that value into account when calculating the newer, larger range priority.
Thanks for summarizing that discusion @SteVwonder. I was hoping you would jump in and do that.
The first solution is to add an additional numerical priority value to the job that is set by the external prioritization module. If we add this extra priority value, the question then becomes, which priority takes precedence in the job manager module, the integer priority with the limited range or the broader external priority? If the latter, we will probably lose most of our ability to aggregate jobs based on priority.
It seems to me like it would work to sort the queue by 1) integer priority, 2) fair-share priority, 3) submit time. Since the user (e.g. non instance owner) can only lower the integer priority, it could not be used to usurp the position of a job with a higher fair-share priority unless the instance owner demands it, and then it probably _should_ usurp.
It seems to me like it would work to sort the queue by 1) integer priority, 2) fair-share priority, 3) submit time.
I agree. That order makes the most sense to me.
If we go with multiple priority values, we will need to think careful about how to handle the situation where the LC Hotline intervenes and boosts a job's priority. Would they have to boost all the priorities, or just one? If they have to boost all of the priorities, how many modules will their tool have to communicate with? Can it be accomplished with just a single RPC?
When I proposed the above sorting order, I meant that the integer priority would have the heaviest weight, so all +1 jobs would run before 0 jobs regardless of fair share priority. So it would seem like the hotline could just +1 (or +whatever) to the integer priority and be done.
Did you guys also discuss how the scheduler should get the fair share priority of each job and apply that to it's own queue?
What happens if the fair share priority of a job (A) significantly decreases while it is at the scheduler's queue? But in the meantime, another job (B) which has a higher fair share priority hasn't made an allocate request? This can cause a lower priority job gets allocated first since the scheduler hasn't seen the job B?
Maybe the proposed solution is safe only when we can assume that the priorities (both the priority and an externally calculated priority like fair share) of the job doesn't change once it is attached to job in the beginning by the job manager module?
Did you guys also discuss how the scheduler should get the fair share priority of each job and apply that to it's own queue?
We had talked about fair share accounting "service" being outside of the scheduler - something that runs periodically to adjust fair share priorities for each job. The two priority numbers would need to be tracked by the job manager to order its internal queue.
Presumably the priorities (integer and fair share) could be part of the allocate requests sent to the scheduler? If they change, it seems like we would want to abort the allocate request and resubmit with new values (if sufficient credits are available after submitting the highest priority jobs)?
We discussed the credit scheme in #1697, which could provide a basis for aborting low priority requests already "in" the scheduler when higher priority requests are waiting for credits. More thought required and working out some specific use cases IMHO.
Presumably the priorities (integer and fair share) could be part of the聽allocate聽requests sent to the scheduler? If they change, it seems like we would want to abort the聽allocate聽request and resubmit with new values (if sufficient credits are available after submitting the highest priority jobs)?
Yes, geneally unless there will be an "eviction" logic and a robust coordination/consistency protocol between the job manager and scheduler's queues, this will be hard to get right IMHO.
From Lipari, I was under the impression the fair share priorities would keep changing. Presumably we shouldn'th assume the faire share priority (or other external priority schemes in general) to be a quantity only set in the beginning. Just my .02.
unless there will be an "eviction" logic and a robust coordination/consistency protocol between the job manager and scheduler's queues, this will be hard to get right IMHO.
Maybe the job manager can maintain the window of the jobs with pending allocate requests and then periodically reevaluate the entire jobs with respect to the fair share priorities to see if any of the pending jobs should be evicted. If so, use the cancel requests to evict them out of the schedule queue.
If our consistency protocol is enforced at the job manager level, the scheduler don't even have to look at the fair share priorities and this can make the scheduler design simpler (at the expense of a bit higher complexity at the job manager level).
I have to think that this scheme should in general work not only for fair share but also for any other types of priority that come from outside...
Presumably we shouldn't assume the faire share priority (or other external priority schemes in general) to be a quantity only set in the beginning.
Agreed. At least at LLNL, job priority is a function of time, so the priority continually changes throughout the lifetime of the job. Although the priorities are constantly changing, one important property is that once a set of jobs are in the queue, their order with respect to one another won't frequently change. Jobs will only change order when a user has a large job complete, causing the user's historical usage to increase, thus decreasing the priority of the user's currently queued jobs. The other way that the scheduler priority queue will change is the submission of a new job, but that case is more easily handled (as @garlick mentioned, #1697 outlines one potential solution).
Given that job order won't change often (outside of new jobs), I think we should optimize for the common case, and then ensure that the system is eventually consistent w.r.t. the previously mentioned scenario that occurs after job completions. One potential implementation: re-prioritization would run for new jobs when they are submitted, for all jobs as a cron job (i.e., configurable, fixed period of time between runs), and for all of a user's jobs whenever one of their jobs completes (to optimize for ensembles, there could be a size threshold, below which, no action occurs).
If our consistency protocol is enforced at the job manager level, the scheduler don't even have to look at the fair share priorities and this can make the scheduler design simpler (at the expense of a bit higher complexity at the job manager level).
Unfortunately, I don't believe the scheduler can be entirely priority agnostic. Even if the job manger does the heavy lifting to ensure only the N highest priority jobs are in the scheduler's queue, the scheduler will still have to look at the priorities to determine the order of the jobs it does know about. This would also result in significant complexity if the job manager is decentralized (assuming zero cooperation from the scheduler).
Unfortunately, I don't believe the scheduler can be entirely priority agnostic. Even if the job manger does the heavy lifting to ensure only the N highest priority jobs are in the scheduler's queue, the scheduler will still have to look at the priorities to determine the order of the jobs it does know about. This would also result in significant complexity if the job manager is decentralized (assuming zero cooperation from the scheduler).
Good point!
As far as the job manager guarantees that the scheduler see the N highest priority jobs, the scheduler can look at the priorities of those jobs from the KVS job scheme for its own job ordering? This means that the fair share priority module will have to periodically update the priorities of the jobs and store them along with the jobs in the KVS. And perhaps this can be done when the job manager does its reevaluation.
This way, the scheduler worry about one less dependency and can be easily layered with any type of priority module beyond fair share.
Given that job order won't change often (outside of new jobs), I think we should optimize for the common case, and then ensure that the system is eventually consistent w.r.t. the previously mentioned scenario that occurs after job completions.
Good point. But I'm not yet convinced that "eventually consistent" is the best we can do. If the N highest priority job won't change much, why can't we provide "strongly consistent" like I proposed? It seems it will be a matter of the job manager recruit the fair share priority module as part of its control loop?
The fair share priority module will have to be driven one way or the other to periodically factor in the job priorities. It seems our options are two: both job-manager and scheduler drive it; or a single module, job-manager drives it and store the calculated output to KVS. The latter seems cleaner and more tractable IMHO.
If the N highest priority job won't change much, why can't we provide "strongly consistent" like I proposed?
Sorry, I probably should not have used "consistency". Maybe a better way to express what I meant was, job re-prioritization should not be on the critical path of the scheduler. Specifically, when a job completes, the scheduler should not have to wait for job priorities to update before scheduling the next job. It should continue scheduling jobs, assuming that the priorities stayed the same. In this case, the scheduler will be working with out-of-date priorities, but the probability that they change enough to make a difference in scheduling in small, and the performance cost of blocking is high. Maybe the right phrase here is "best effort".
I also meant that we shouldn't worry too much about "drift" between the "real" priority of a job and its "current" calculated priority. The half-life decay for job usage at LLNL is two weeks, so I think having a priority that is 10s of minutes (maybe even an hour) old seems reasonable. We also probably should not be too worried about recalculating priority at a uniform frequency for all jobs. Specifically, it should be OK for Job A to have had its priority recalculated 20 times in the last hour while Job B only had its priority recalculated once in the last hour.
By "strongly consistent", do you mean that as soon as the job priority is recalculated, the job manager module immediately begins working to update the job's position/presence in the scheduler's queue? If so, I agree. If not, could you elaborate on what you mean?
The fair share priority module will have to be driven one way or the other to periodically factor in the job priorities. It seems our options are two: both job-manager and scheduler drive it; or a single module, job-manager drives it and The latter seems cleaner and more tractable IMHO.
Good points, and I agree. I don't see a need for the scheduler to directly interact with the priority module. I imagine that the priority module will occasionally "wake up" (via cron or a timer cb) and re-calculate (every?) job priority on its own, but in other scenarios, it will be calculating priority for specific jobs at the request of the job manager module.
This way, the scheduler worry about one less dependency and can be easily layered with any type of priority module beyond fair share.
Sorry, I am not sure I follow. What dependency does the scheduler no longer have to worry about?
store the calculated output to KVS.
the scheduler can look at the priorities of those jobs from the KVS job scheme for its own job ordering?
I agree that the priority should be saved out to the KVS for resiliency, but we probably want to avoid the scheduler watching/polling the KVS for job priority changes (would add complexity to the scheduler code and overhead in the KVS). Maybe we need to add a new sched/job manager module interface to #1697 specifically for re-prioritizing a job that already has an outstanding allocate request. Alternatively, the job manager module could cancel the outstanding allocate request a resubmit the allocate with an updated priority.
I imagine that the priority module will occasionally "wake up" (via cron or a timer cb) and re-calculate (every?) job priority on its own, but in other scenarios, it will be calculating priority for specific jobs at the request of the job manager module.
Why can't job manager periodically trigger the re-calcuation periodically?
Sorry, I am not sure I follow. What dependency does the scheduler no longer have to worry about?
Like you said, I don't see a need for the scheduler to directly interact with the priority module.
I agree that the priority should be saved out to the KVS for resiliency, but we probably want to avoid the scheduler watching/polling the KVS for job priority changes (would add complexity to the scheduler code and overhead in the KVS).
I don't think the scheduler needs to watch or poll at all. A part of its schedule loop, it will look at the fair share priorities of its N highest priority jobs and order them properly. I think the best way would make it so that how frequently the faire share priorities are calculated and stored into KVS won't be a concern of the scheduler. The scheduler doesn't have to know about the priority changes until it gets a job or resource event.
Maybe we need to add a new sched/job manager module interface to #1697 specifically for re-prioritizing a job that already has an outstanding allocate request.
Hmmm. Maybe I'm missing something.
I thought that you have most of the interfaces. If the job manager has a logic to keep track of the N highest priority jobs and it can cancel the allocate requests of the jobs whose priorities are no longer high enough. My guess is this should be enough to guarantee queue consistency.
Why can't job manager periodically trigger the re-calcuation periodically?
Great point. This actually makes more sense than the bulk cron job I proposed earlier given that the job manager will "know" (have in memory) what jobs are active and the priority module will not.
A part of its schedule loop, it will look at the fair share priorities of its N highest priority jobs and order them properly.
Just to be sure, w.r.t. "look at the fair share priorities", these are the priorities provided by the job manager module when it makes the allocate requests, correct?
I think the best way would make it so that how frequently the faire share priorities are calculated and stored into KVS won't be a concern of the scheduler.
Agreed!
If the job manager has a logic to keep track of the N highest priority jobs and it can cancel the allocate requests of the jobs whose priorities are no longer high enough. My guess is this should be enough to guarantee queue consistency.
Good point. That should be sufficient. I think I was making the mistake of trying to pre-maturely optimize and replace the two RPCs required to update a job's priority with a single RPC.
Just to be sure, w.r.t. "look at the fair share priorities", these are the priorities provided by the job manager module when it makes the allocate requests, correct?
Sorry, I meant that the fair share priorities that are being continuously updated into the KVS. To make it clear, the proposal is:
The job manager tracks those jobs that have pending allocate requests. The size would be determined by the credit handshaking with the scheduler.
The job manager periodically evaluates the jobs to see if there are other submitted jobs whose fair-share priorities are actually higher than any of the jobs being tracked at 1)
If a higher-priority job is found, the job manager causes the lower priority job to be evicted from the scheduler using a cancel request
This protocol would guarantee the consistency between the scheduler and job manager as to what the highest-priority jobs are. Maybe, call it the queue-consistency protocol or similar...
It seems like we are getting to a consensus on managing priorities with the job manager, in a scheduler independent way. Good discussion - thanks everybody!
Let's refocus this issue on the details of the job manager's notion of priorities. Is this a reasonable approach based on the discussion?
primary priority
priority_primary key, and log change to eventlogsecondary priority
priority_secondary key, and log change to eventlogThe job manager queue would be ordered by 1) primary priority, 2) secondary priority, 3) submission order.
Priority changes should be announced by event message so that a distributed job manager can keep the queue order synchronized. For bulk update such as for fair-share adjustment, some sort of batching should be used to avoid generating a huge number of event messages.
Loose ends:
KVS Job schema: set priority_secondary key, and log change to eventlog
It isn't clear if you mean to log priority changes to the main KVS eventlog for the job, or if this proposal is for a priority-specific event log. Depending on implementation, I could imagine the secondary priority being updated quite often, which would generate a lot of entries in the eventlog for every job (in fact, I'd imagine for a job queued for any amount of time, this would be the bulk of entries in the eventlog)
Even if you propose a priority-specific eventlog, what is the purpose of saving the history of all priority changes for jobs? It sounds like there is already a broadcast event for every priority update, so nothing would be monitoring these eventlogs (of course, I'm sure I've missed something obvious!)
Edit: also, priorities only make sense relative to other priorities, so the values might not be useful as historical data.
However, logging an event when a primary priority update is requested indeed might be interesting for job provenance.
Good point! Maybe just log primary priority changes to the main job eventlog. That seems more useful to have in the history.
Sending out an event message when the secondary priority changes might also be a bad idea then. Hmm...
For fair share, it seems like there needs to be a way to set secondary priority at submission time. What sort of mechanism is needed for that?
We already need a hook for one or more validators, would it make sense to use the same mechanism for an optional secondary priority calculation?
In fact, a stack of "plugins" for the job manager that may validate, verify, and possibly annotate jobs would be useful. The plugin interface could return a future instead of being a synchronous callback, which would allow plugins to be implemented by services instead of code. Unless otherwise configured, the job manager could actually call all plugins in parallel...
KVS Job schema: set priority_primary key, and log change to eventlog
Who would watch for this event log? Will the job manager will be distributed and the distributed components will use this key to enforce the queue consistency?
KVS Job schema: set priority_secondary key, and log change to eventlog
Same question.
And I have to agree with @grondo, though, this can lead to lots of updates and notifications. Given the characteristics of the secondary priority, I have to think that periodic checking scheme at job managers' own pace would be more scalable? W/ that, of course, tthe job manager won't be able to re-order the queue immediately when the changes come. But if the job manager also triggers the external priority calculation, this may be okay?
Closing as I think we've moved past most of the issues discussed here.
Most helpful comment
We already need a hook for one or more validators, would it make sense to use the same mechanism for an optional secondary priority calculation?
In fact, a stack of "plugins" for the job manager that may validate, verify, and possibly annotate jobs would be useful. The plugin interface could return a future instead of being a synchronous callback, which would allow plugins to be implemented by services instead of code. Unless otherwise configured, the job manager could actually call all plugins in parallel...