Flux-core: job-info: support job listing queries

Created on 11 Oct 2019 · 29Comments · Source: flux-framework/flux-core

The front end tool for listing jobs(#2416) needs support for querying more info than job-manager can currently provide, such as job name and a resource summary only found in jobspec, inactive jobs, etc.. Rather than augmenting the job info currently tracked by the job-manager, e.g. #2411 (track inactive jobs in job-manager), one idea discussed today was having job-info handle these queries.

job-info could add a hash by jobid similar to the job manger's which it populates based on tracking the event state change events. Additional state would be fetched from the KVS as needed to keep enough info in memory to directly answer most job listing queries. It would "eventually consistent" with respect to the job manager's queue and job state, which should be tolerable for job listing purposes.

In the initial implementation, this hash would just grow without bound. All jobs (active and inactive) that are represented in the KVS would also be represented in job-info.

If job-info were restarted, it would use a technique similar to job manager to recover its state from the KVS.

When we implement a database that tracks inactive jobs (like sqlog), we can implement a purging policy that purges info both from the KVS and from job-info in concert.

Source

garlick

Most helpful comment

I was idly thinking about that. It seems like 3 lists are needed

inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order

It should be possible to answer a lot of common queries with just that, with filters by user (my job), filters on these three categories, and a cap on the number of jobs returned.

garlick on 17 Oct 2019

👍2

All 29 comments

job-info could add a hash by jobid similar to the job manger's which it populates based on tracking the event state change events. Additional state would be fetched from the KVS as needed to keep enough info in memory to directly answer most job listing queries.

Began prototyping and one thing I noticed is that for new job submissions, the job-info module would have to do a job eventlog lookup to determine userid, t_submit and things like that are inherently known to the job manager.

Problem is, job-state transition events and commits to the job's eventlog are sent out without coordination. So lookup of a job's eventlog can be racy. Obviously can use WAITCREATE on the kvs lookup or an eventlog-watch, but this seems trickier than is necessary.

Would it be that big of a deal to add extra info to the job-state transitions events? It wouldn't have to be for all transition events, just the first one?

chu11 on 16 Oct 2019

Actually the state pub happens after eventlogs kvs commit completes. It's safe to fetch eventlog (for submit event data) and jobspec after first state (DEPEND).

Also safe to fetch R after RUN.

Edit: this doesn't change the "safety" statements above, but in point of fact, job-ingest commits jobspec and first eventlog entry _then_ passes it to job manager. But it's generally true that the events leading to a state transition are always available in the eventlog before the transition is published.

garlick on 16 Oct 2019

Actually the state pub happens after eventlogs kvs commit completes. It's safe to fetch eventlog (for submit event data) and jobspec after first state (DEPEND).

Ahh, I see how that happens in the job-manager now. Ok, then it's not as tricky as I thought it would be.

chu11 on 16 Oct 2019

I have a simple prototype in which the job-info module watches for transitions in the job-manager, grabs info from the job eventlog to get some extra info (userid, priority, etc.), and stores job history forever. I figure I would add the recovery of state from the KVS later.

I initially (foolishly) tried to transition flux job list to use job-info instead of job-manager, only to be hit by a variety of unit test failures. A number of tests assume flux job list only lists active jobs, there is a submission order to flux job list, priority order, etc.

Obviously flux job list or the job-info module could sort / filter in any number of ways. But before going down that hole, wanted to ask about thoughts on longer out plans for such a service.

Shall flux job list have numerous options for filtering / sorting of jobs? (TBD if sorting / filtering would occur in job-info or flux job)
OR
flux job list should stay as the active job lister, communicating with the job-manager, and some other tool / command (flux job history?) would be the one to connect to job-info? Sort of like squeue vs sacct get info in different places.

chu11 on 17 Oct 2019

We need a new porcelain command for listing all jobs as described in #2416.

Maybe for the low level flux job list tool you could add an option for now to query all jobs (or the last N jobs by default), via the new job-info interface to avoid breaking existing tests.

grondo on 17 Oct 2019

Maybe for the low level flux job list tool you could add an option for now to query all jobs (or the last N jobs by default), via the new job-info interface to avoid breaking existing tests.

What about the fact that flux job list has things sorted in a particular order? There's a part of me that feels like those tests should be removed? The sorting isn't important?

chu11 on 17 Oct 2019

Maybe move the existing flux job list to a test program in t/job-manager? Those tests list check the queue order in the job manager should not be dropped.

garlick on 17 Oct 2019

Maybe move the existing flux job list to a test program in t/job-manager? Those tests list check the queue order in the job manager should not be dropped.

Ahh, that's a good idea.

chu11 on 17 Oct 2019

It feels like it is important to preserve the priority order by default. If we can't do that then what will be listed when you query for the "top" 25 jobs? (I'm probably just not understanding something here)

grondo on 17 Oct 2019

It feels like it is important to preserve the priority order by default. If we can't do that then what will be listed when you query for the "top" 25 jobs? (I'm probably just not understanding something here)

I think it's simply the nature of the implementation. By also storing the inactive jobs, keeping a sorted list/queue of jobs makes less sense. So everything is simply stored in a hash. So when queried, you get the random hash order that's returned.

Thinking about it a bit, I think having the front end tool sort the job listing for what we want users to see is probably the best thing to do, b/c we can change the default filtering/ordering in the future if we want. Right now, the sorted order would be state first (i.e. active first), then priority, then job submission time.

chu11 on 17 Oct 2019

@grondo - when I responded I was just thinking helping @chu11 avoid a distraction since those tests are somewhat too closely tied to the output format of that command. They are confirming expected job manager behavior so I would not like to drop them. Also we said job-info would be eventually consistent with respect to the job-manager and I'm not sure how that will affect these tests.

@chu11, I think we're going to need some way to filter down the quantity of data returned in a query for a long running system instance, where there may be millions of jobs in the hash.

garlick on 17 Oct 2019

Thinking about it a bit, I think having the front end tool sort the job listing for what we want users to see is probably the best thing to do,

But then each query to job info must fetch all jobs in order to get the correct sorting if the tool wants to play nice and just display the "latest" or "top" jobs (e.g. a good default might be 25).

We should design for this interface to handle 1M jobs with users querying at least 10x/sec.

grondo on 17 Oct 2019

👍1

I was idly thinking about that. It seems like 3 lists are needed

inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order

It should be possible to answer a lot of common queries with just that, with filters by user (my job), filters on these three categories, and a cap on the number of jobs returned.

garlick on 17 Oct 2019

👍2

Also we said job-info would be eventually consistent with respect to the job-manager and I'm not sure how that will affect these tests.

Ok, that makes sense. However, I was mainly concerned about throwing out our only job listing tool before we had a replacement that gives meaningful output. (queue order is the first requirement for a job list utility IMO)

grondo on 17 Oct 2019

Agreed on not throwing out that tool until the new thing can do at least that.

Incidentally one gotcha: if the job priority changes after the job is submitted, the only way to find out about that is by watching the eventlog. Maybe to keep this simple we can have the job manager also generate an event for priority changes?

garlick on 17 Oct 2019

Maybe to keep this simple we can have the job manager also generate an event for priority changes?

Sounds reasonable to me. Would that be useful for a scheduler as well?

grondo on 17 Oct 2019

I was idly thinking about that. It seems like 3 lists are needed
inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order

This seems like a good idea, w/ flags to determine which list(s) to get info from. And would keep the sorting to a minimum (I think only necessary in the last, as the first two we can just append to as it happens).

chu11 on 17 Oct 2019

I think only necessary in the last,

Don't forget to steal code/ideas from job-manager/queue.c. There is a zhashx there that uses jobids as keys (without turning them into strings), and a zlistx that is always kept in a sorted order by submit time + priority, with code to reinsert items when their priority changes. I was happy with how that worked out, compared to how it would have gone with the older zhash/zlist.

garlick on 17 Oct 2019

Would that be useful for a scheduler as well?

I think that's already covered in the scheduler's protocol when "unlimited" mode is selected, but if it isn't I think it would be best to handle it there with a request rather than with an event to avoid races like the one we have yet to fix with exception events.

garlick on 17 Oct 2019

👍1

not-yet-running jobs in submit time, then priority order

Just double checking, you meant priority order first, then submit time? B/c I think that's how the job manager is right now.

running jobs in start time order
not-yet-running jobs in submit time, then priority order

Would the most common output (casually looking at squeue's output) be to grab these two queues and list all pending jobs in priority order, then running jobs in submit time order? B/c that doesn't seem to be how the job-manager queue works.

JOBID       STATE   USERID  PRI T_SUBMIT
134704267264    R   1000    31  2019-10-17T13:52:04Z
265096790016    S   1000    31  2019-10-17T13:52:11Z
221291479040    R   1000    15  2019-10-17T13:52:09Z
402502189056    S   1000    15  2019-10-17T13:52:20Z
166731972608    R   1000    10  2019-10-17T13:52:06Z
170993385472    R   1000    10  2019-10-17T13:52:06Z
297678143488    S   1000    10  2019-10-17T13:52:13Z
368058564608    S   1000    10  2019-10-17T13:52:18Z
327642251264    S   1000    0   2019-10-17T13:52:15Z

I suspect it probably doesn't matter for the job manager internally, but just wanted to double check. Dunno if the job-manager should be doing two queues internally as well.

chu11 on 17 Oct 2019

Just double checking, you meant priority order first, then submit time?

Right, sorry.

Would the most common output (casually looking at squeue's output) be to grab these two queues and list all pending jobs in priority order, then running jobs in submit time order?

Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs? In any case, the service should probably accept a mask of states and filter accordingly. We can fine tune the options and defaults for the porcelain command later.

I suspect it probably doesn't matter for the job manager internally, but just wanted to double check. Dunno if the job-manager should be doing two queues internally as well.

I don't think it needs to, especially now that we're outsourcing the queue listing function.

garlick on 17 Oct 2019

Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs?

Yeah, that sounds good to me. For even more flexibility, you may want to offer an optional per-list/per-state maximum, with a flag to ignore the outer maximum. This would let a user fetch all running jobs, plus pending jobs up to some maximum count (or none if outer max already exceeded). This is the default behavior of squeue (though again, a big dump of jobs will not be the best interface for users on large systems with possibly thousands of running jobs anyway)

Perhaps the "plumbing" version of this tool could just be a thin wrapper around the rpc and emit a json stream or array of responses, which could be processed by jq for testing?

grondo on 17 Oct 2019

One other thought I had is that this service might also have a stats interface, where a set of summary statistics could be returned. I imagine that would be pretty easy to add, and coupled with a resource usage summary could be useful for a higher level tool than a job queue listing util.

grondo on 17 Oct 2019

Oh yeah, that reminds me, we were tossing around ideas at one point, and it was suggested that only "my jobs" might be the default output (like ps), and that we might combine that with system wide summary data?

Perhaps the "plumbing" version of this tool could just be a thin wrapper around the rpc and emit a json stream or array of responses, which could be processed by jq for testing?

That's a good idea. Leave the pretty output to python.

garlick on 17 Oct 2019

only "my jobs" might be the default output (like ps), and that we might combine that with system wide summary data?

Oh, thanks for remembering that. Does that mean we also need per-user lists in job-info? :laughing:

grondo on 17 Oct 2019

Should it be the jobs from all three lists (not-yet-running, running, inactive) but truncated at some max number of jobs? In any case, the service should probably accept a mask of states and filter accordingly. We can fine tune the options and defaults for the porcelain command later.

Sounds good. I forgot that this is the plumbing tool, not the porcelain one.

chu11 on 17 Oct 2019

inactive jobs in completion time order
running jobs in start time order
not-yet-running jobs in submit time, then priority order

Mapping job states to these three lists, I assume DEPEND & SCHED go to the last list, RUN & CLEANUP go to the middle one, and INACTIVE to the first.

But I began thinking, squeue lists completing and running jobs separately. Should we considering splitting RUN & CLEANUP into separate lists?

chu11 on 18 Oct 2019

Mapping job states to these three lists, I assume DEPEND & SCHED go to the last list, RUN & CLEANUP go to the middle one, and INACTIVE to the first.

Right.

But I began thinking, squeue lists completing and running jobs separately. Should we considering splitting RUN & CLEANUP into separate lists?

Hmm, doesn't seem too compelling to me. Maybe later? (What does @grondo say?)

garlick on 18 Oct 2019

closing, as majority of work done in #2471. Splitting off remaining work into #2499

chu11 on 31 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

discuss placement of bank/accounting module

cmoussa1 · 8Comments

wreck parity: need support for job names

SteVwonder · 7Comments

rc1.d/01-enclosing-instance: write URIs as annotations instead of into kvs?

grondo · 5Comments

job-manager: allow specific job id's to be listed

garlick · 8Comments

best practice for launching flux with preset FLUX_URI

grondo · 7Comments