The front end tool for listing jobs(#2416) needs support for querying more info than job-manager can currently provide, such as job name and a resource summary only found in jobspec, inactive jobs, etc.. Rather than augmenting the job info currently tracked by the job-manager, e.g. #2411 (track inactive jobs in job-manager), one idea discussed today was having job-info handle these queries.
job-info could add a hash by jobid similar to the job manger's which it populates based on tracking the event state change events. Additional state would be fetched from the KVS as needed to keep enough info in memory to directly answer most job listing queries. It would "eventually consistent" with respect to the job manager's queue and job state, which should be tolerable for job listing purposes.
In the initial implementation, this hash would just grow without bound. All jobs (active and inactive) that are represented in the KVS would also be represented in job-info.
If job-info were restarted, it would use a technique similar to job manager to recover its state from the KVS.
When we implement a database that tracks inactive jobs (like sqlog), we can implement a purging policy that purges info both from the KVS and from job-info in concert.
job-info could add a hash by jobid similar to the job manger's which it populates based on tracking the event state change events. Additional state would be fetched from the KVS as needed to keep enough info in memory to directly answer most job listing queries.
Began prototyping and one thing I noticed is that for new job submissions, the job-info module would have to do a job eventlog lookup to determine userid, t_submit and things like that are inherently known to the job manager.
Problem is, job-state transition events and commits to the job's eventlog are sent out without coordination. So lookup of a job's eventlog can be racy. Obviously can use WAITCREATE on the kvs lookup or an eventlog-watch, but this seems trickier than is necessary.
Would it be that big of a deal to add extra info to the job-state transitions events? It wouldn't have to be for all transition events, just the first one?
Actually the state pub happens after eventlogs kvs commit completes. It's safe to fetch eventlog (for submit event data) and jobspec after first state (DEPEND).
Also safe to fetch R after RUN.
Edit: this doesn't change the "safety" statements above, but in point of fact, job-ingest commits jobspec and first eventlog entry _then_ passes it to job manager. But it's generally true that the events leading to a state transition are always available in the eventlog before the transition is published.
Actually the state pub happens after eventlogs kvs commit completes. It's safe to fetch eventlog (for submit event data) and jobspec after first state (DEPEND).
Ahh, I see how that happens in the job-manager now. Ok, then it's not as tricky as I thought it would be.
I have a simple prototype in which the job-info module watches for transitions in the job-manager, grabs info from the job eventlog to get some extra info (userid, priority, etc.), and stores job history forever. I figure I would add the recovery of state from the KVS later.
I initially (foolishly) tried to transition flux job list to use job-info instead of job-manager, only to be hit by a variety of unit test failures. A number of tests assume flux job list only lists active jobs, there is a submission order to flux job list, priority order, etc.
Obviously flux job list or the job-info module could sort / filter in any number of ways. But before going down that hole, wanted to ask about thoughts on longer out plans for such a service.
flux job list have numerous options for filtering / sorting of jobs? (TBD if sorting / filtering would occur in job-info or flux job)flux job list should stay as the active job lister, communicating with the job-manager, and some other tool / command (flux job history?) would be the one to connect to job-info? Sort of like squeue vs sacct get info in different places.We need a new porcelain command for listing all jobs as described in #2416.
Maybe for the low level flux job list tool you could add an option for now to query all jobs (or the last N jobs by default), via the new job-info interface to avoid breaking existing tests.
Maybe for the low level flux job list tool you could add an option for now to query all jobs (or the last N jobs by default), via the new job-info interface to avoid breaking existing tests.
What about the fact that flux job list has things sorted in a particular order? There's a part of me that feels like those tests should be removed? The sorting isn't important?
Maybe move the existing flux job list to a test program in t/job-manager? Those tests list check the queue order in the job manager should not be dropped.
Maybe move the existing flux job list to a test program in t/job-manager? Those tests list check the queue order in the job manager should not be dropped.
Ahh, that's a good idea.
It feels like it is important to preserve the priority order by default. If we can't do that then what will be listed when you query for the "top" 25 jobs? (I'm probably just not understanding something here)
It feels like it is important to preserve the priority order by default. If we can't do that then what will be listed when you query for the "top" 25 jobs? (I'm probably just not understanding something here)
I think it's simply the nature of the implementation. By also storing the inactive jobs, keeping a sorted list/queue of jobs makes less sense. So everything is simply stored in a hash. So when queried, you get the random hash order that's returned.
Thinking about it a bit, I think having the front end tool sort the job listing for what we want users to see is probably the best thing to do, b/c we can change the default filtering/ordering in the future if we want. Right now, the sorted order would be state first (i.e. active first), then priority, then job submission time.
@grondo - when I responded I was just thinking helping @chu11 avoid a distraction since those tests are somewhat too closely tied to the output format of that command. They are confirming expected job manager behavior so I would not like to drop them. Also we said job-info would be eventually consistent with respect to the job-manager and I'm not sure how that will affect these tests.
@chu11, I think we're going to need some way to filter down the quantity of data returned in a query for a long running system instance, where there may be millions of jobs in the hash.
Thinking about it a bit, I think having the front end tool sort the job listing for what we want users to see is probably the best thing to do,
But then each query to job info must fetch all jobs in order to get the correct sorting if the tool wants to play nice and just display the "latest" or "top" jobs (e.g. a good default might be 25).
We should design for this interface to handle 1M jobs with users querying at least 10x/sec.
I was idly thinking about that. It seems like 3 lists are needed
It should be possible to answer a lot of common queries with just that, with filters by user (my job), filters on these three categories, and a cap on the number of jobs returned.
Also we said job-info would be eventually consistent with respect to the job-manager and I'm not sure how that will affect these tests.
Ok, that makes sense. However, I was mainly concerned about throwing out our only job listing tool before we had a replacement that gives meaningful output. (queue order is the first requirement for a job list utility IMO)
Agreed on not throwing out that tool until the new thing can do at least that.
Incidentally one gotcha: if the job priority changes after the job is submitted, the only way to find out about that is by watching the eventlog. Maybe to keep this simple we can have the job manager also generate an event for priority changes?
Maybe to keep this simple we can have the job manager also generate an event for priority changes?
Sounds reasonable to me. Would that be useful for a scheduler as well?
I was idly thinking about that. It seems like 3 lists are needed
inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order
This seems like a good idea, w/ flags to determine which list(s) to get info from. And would keep the sorting to a minimum (I think only necessary in the last, as the first two we can just append to as it happens).
I think only necessary in the last,
Don't forget to steal code/ideas from job-manager/queue.c. There is a zhashx there that uses jobids as keys (without turning them into strings), and a zlistx that is always kept in a sorted order by submit time + priority, with code to reinsert items when their priority changes. I was happy with how that worked out, compared to how it would have gone with the older zhash/zlist.
Would that be useful for a scheduler as well?
I think that's already covered in the scheduler's protocol when "unlimited" mode is selected, but if it isn't I think it would be best to handle it there with a request rather than with an event to avoid races like the one we have yet to fix with exception events.
not-yet-running jobs in submit time, then priority order
Just double checking, you meant priority order first, then submit time? B/c I think that's how the job manager is right now.
running jobs in start time order
not-yet-running jobs in submit time, then priority order
Would the most common output (casually looking at squeue's output) be to grab these two queues and list all pending jobs in priority order, then running jobs in submit time order? B/c that doesn't seem to be how the job-manager queue works.
JOBID STATE USERID PRI T_SUBMIT
134704267264 R 1000 31 2019-10-17T13:52:04Z
265096790016 S 1000 31 2019-10-17T13:52:11Z
221291479040 R 1000 15 2019-10-17T13:52:09Z
402502189056 S 1000 15 2019-10-17T13:52:20Z
166731972608 R 1000 10 2019-10-17T13:52:06Z
170993385472 R 1000 10 2019-10-17T13:52:06Z
297678143488 S 1000 10 2019-10-17T13:52:13Z
368058564608 S 1000 10 2019-10-17T13:52:18Z
327642251264 S 1000 0 2019-10-17T13:52:15Z
I suspect it probably doesn't matter for the job manager internally, but just wanted to double check. Dunno if the job-manager should be doing two queues internally as well.
Just double checking, you meant priority order first, then submit time?
Right, sorry.
Would the most common output (casually looking at squeue's output) be to grab these two queues and list all pending jobs in priority order, then running jobs in submit time order?
Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs? In any case, the service should probably accept a mask of states and filter accordingly. We can fine tune the options and defaults for the porcelain command later.
I suspect it probably doesn't matter for the job manager internally, but just wanted to double check. Dunno if the job-manager should be doing two queues internally as well.
I don't think it needs to, especially now that we're outsourcing the queue listing function.
Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs?
Yeah, that sounds good to me. For even more flexibility, you may want to offer an optional per-list/per-state maximum, with a flag to ignore the outer maximum. This would let a user fetch all running jobs, plus pending jobs up to some maximum count (or none if outer max already exceeded). This is the default behavior of squeue (though again, a big dump of jobs will not be the best interface for users on large systems with possibly thousands of running jobs anyway)
Perhaps the "plumbing" version of this tool could just be a thin wrapper around the rpc and emit a json stream or array of responses, which could be processed by jq for testing?
One other thought I had is that this service might also have a stats interface, where a set of summary statistics could be returned. I imagine that would be pretty easy to add, and coupled with a resource usage summary could be useful for a higher level tool than a job queue listing util.
Oh yeah, that reminds me, we were tossing around ideas at one point, and it was suggested that only "my jobs" might be the default output (like ps), and that we might combine that with system wide summary data?
Perhaps the "plumbing" version of this tool could just be a thin wrapper around the rpc and emit a json stream or array of responses, which could be processed by jq for testing?
That's a good idea. Leave the pretty output to python.
only "my jobs" might be the default output (like ps), and that we might combine that with system wide summary data?
Oh, thanks for remembering that. Does that mean we also need per-user lists in job-info? :laughing:
Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs? In any case, the service should probably accept a mask of states and filter accordingly. We can fine tune the options and defaults for the porcelain command later.
Sounds good. I forgot that this is the plumbing tool, not the porcelain one.
inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order
Mapping job states to these three lists, I assume DEPEND & SCHED go to the last list, RUN & CLEANUP go to the middle one, and INACTIVE to the first.
But I began thinking, squeue lists completing and running jobs separately. Should we considering splitting RUN & CLEANUP into separate lists?
Mapping job states to these three lists, I assume DEPEND & SCHED go to the last list, RUN & CLEANUP go to the middle one, and INACTIVE to the first.
Right.
But I began thinking, squeue lists completing and running jobs separately. Should we considering splitting RUN & CLEANUP into separate lists?
Hmm, doesn't seem too compelling to me. Maybe later? (What does @grondo say?)
closing, as majority of work done in #2471. Splitting off remaining work into #2499
Most helpful comment
I was idly thinking about that. It seems like 3 lists are needed
It should be possible to answer a lot of common queries with just that, with filters by user (my job), filters on these three categories, and a cap on the number of jobs returned.