Flux-core: system instance: need a way to see hostnames of execution targets

Created on 3 Sep 2020  路  12Comments  路  Source: flux-framework/flux-core

(This issue arises from testing PR #3168 in a system instance config)

Problem: while it's possible to see which broker ranks are online using flux comms up, there is no easy way to see the hostnames of the down targets.

Provide some sort of tool interface that maps execution target id to hostname so sys admins can diagnose problems.

All 12 comments

flux comms is gone, so we need a replacement interface. Mabye flux resource status? Perhaps do ranks by default and add a --hostnames option?

flux resource drain|undrain probably are good candidates for accepting hostname(s) as alternative to broker rank.

Mabye flux resource status? Perhaps do ranks by default and add a --hostnames option?

We already have flux resource list. Though I realize this queries the scheduler and not the resource module, from a UX viewpoint it might be nice if we could unify the two commands. What do you want to get out of flux resource status that would not be available from flux resource list? (Keep in mind that the current flux resource list interface is just a placeholder until we figure out what we really want to do there)

Well, two things are missing. One, I think we should be able to list the offline/online status of an execution target (broker rank). Perhaps it would also be useful to see which targets are drained or excluded. Two, instead of just integer ranks, I think the sys admins will want hostnames in the report.

I'm not opposed if we want to turn flux resource list into a more flexible command. I was kind of steering clear because it queries the scheduler (as you mentioned).

Well, two things are missing. One, I think we should be able to list the offline/online status of an execution target (broker rank). Perhaps it would also be useful to see which targets are drained or excluded. Two, instead of just integer ranks, I think the sys admins will want hostnames in the report.

Those two things are missing because we didn't have a way to support them before.
flux resource list already supports flux jobs style output format, could we just add format fields for hosts and ranks?
In fact, ranks are already emitted with flux resource list -v...

That being said, if you really need a different command it could share most of the code with flux resource list, but jbe the same as running flux resource list --states=up,down --format={state}{hostname}.

One thought, since execution targets are not really themselves resources, does monitoring and status of them really belong under flux-resource, or do we need a new tool specific to status and control of execution targets/ranks?

Another thought, won't users and sysadmins want to see hostnames associated with allocated and free resources as well? If so, we might have to add support for hostnames to flux resource list (or whatever replacement) anyway.

Oh I do apologize, I did not notice that flux resource list had those capabilities!

Assuming that resource obtains the rank->hostname mapping some way, what's the best way to make --format {hostname} work in the command? Should that mapping be acquired by the scheduler and then supplied in the resource-status response, or would we have the command separately obtain the mapping from resource and leave the scheduler out of it?

Another thought, won't users and sysadmins want to see hostnames associated with allocated and free resources as well?

Yeah, I think you're right.

It is a slightly inconvenient that resource state is spread across multiple services. The scheduler knows up/down and allocated/free, but it doesn't know about offline(down?) vs drained vs excluded. It probably isn't too big of a deal to have the flux resource tool query both the scheduler and resource module and combine information, it would be nice to have everything in one "view" though. Especially if we are designing for what sysadmins are "used to", sinfo.

Assuming that resource obtains the rank->hostname mapping some way, what's the best way to make --format {hostname} work in the command?

The command and associated sched.resource-status currently work off the principle of returning an R fragment for each known state of jobs. The tool can then report on the set of resources that are known to be in each state. I can see how that will be inconvenient when combining information from multiple sources :frowning_face: It might have worked to add hostname key to R_lite, but it sounds like the design went a different direction. In that case maybe a separate command for now is the right approach.

BTW, an ad-hoc Python implementation of libidset had to be added to flux-resource to build the compressed resource "list". For hostnames we might want a python wrapper for the libidset equivalent of hostlist.

Instead why don't we

  • amend the resource.acquire protocol to distinguish between excluded, drained, and offline. This is needed anyway to handle the corner case @dongahn identified of exclusion changes showing up as "down", and breaking satisfiability checks
  • put hostname in R_lite since that also allows us to get the rank->host mapping "for free" in a flux instance spawned by flux, and fluxion already does it anyway

The real need I'm positing here is that sys admins probably will want a way to map from broker ranks, which appear in logs and tools, to hostnames. @stevwonder pointed out on the call that maybe it would be a good stopgap to just provide a lookup tool. If that takes the pressure off for 0.20.0 while we think this through more carefully, I'm fine with that approach.

Well, we've come a ways on this issue: adding hostnames to Rv1 and adding libhostlist to flux-core. I"d suggest we close this issue and open more detailed issues for the remaining work.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

garlick picture garlick  路  3Comments

chu11 picture chu11  路  3Comments

cmoussa1 picture cmoussa1  路  8Comments

SteVwonder picture SteVwonder  路  7Comments

cmoussa1 picture cmoussa1  路  6Comments