Flux-core: Need another resource state beyond "up"/"down"?

Created on 12 Jun 2020  路  22Comments  路  Source: flux-framework/flux-core

@garlick and @grondo:

I have begun to look at the resource.acquire interface and came across this comment.

It seems like we need another state like "excluded" beyond up and down state to help schedulers to perform satisfiability checks more accurately. The schedulers may want to perform the overall scheduling feasibility on the resources whose states are up or down. But if a resource becomes excluded as part of a resource configuration reload, the schedulers shouldn't take that into account for satisfiability?

Edit: I had some edits on the original posting; fixed multiple typos.

Most helpful comment

I brought this up in another issue, but we should pick a set of terminology used by the flux-core resource module and use that consistently across other services. Admins are going to be unhappy, for example, if scheduler declares resources down that are drained or offline, when in another context "down" means unreachable.

This is just a practical consideration and unrelated to the semantics of satisfiability checks etc, so possibly a minor point at this time.

All 22 comments

Yeah, I think we've made it so that the exclusion list is not _really_ dynamically reconfigurable. It would not be that hard to address, although I wonder if maybe we should leave this bug open for a bit and then circle back once we have big ticket system instance items filled in? After you've had a chance to implement the protocol in fluxion, maybe you will have other changes you feel we should make, and then we could do an RFC to solidify a "version 1" of the protocol?

Also what should be the semantics of schedulers for handing a newly excluded resource? Should they treat this case the same as a "drained" even so that a running job can continue? Or kill?

Yeah, I think we've made it so that the exclusion list is not really dynamically reconfigurable. It would not be that hard to address, although I wonder if maybe we should leave this bug open for a bit and then circle back once we have big ticket system instance items filled in? After you've had a chance to implement the protocol in fluxion, maybe you will have other changes you feel we should make, and then we could do an RFC to solidify a "version 1" of the protocol?

This works for me.

Also what should be the semantics of schedulers for handing a newly excluded resource? Should they treat this case the same as a "drained" even so that a running job can continue? Or kill?

Drained, newly excluded, and broker offline are all represented to the scheduler as "down".

The thought was the scheduler should only stop considering "down" nodes from allocation, not try to take any other action. If we need other action such as raising a job exception, maybe exec or job manager would do that.

In any case, I believe expected "drain" semantics are to let the current job finish but not start anything new.

After you've had a chance to implement the protocol in fluxion, maybe you will have other changes you feel we should make, and then we could do an RFC to solidify a "version 1" of the protocol?

Just one more suggestion based on my observation : it seems it would be a bit more future proof if the initial response of resource.acquire had idset for the discovered resources (in addition to "up" idset). This may help schedulers to transition through different revisions of RV better... For now, I can fetch this info directly from resource.hwloc.by_rank, though.

Actually for now you need to take the union of the idset keys in the resources object to know which of the resource.hwloc.by_rank execution targets is actually included in the resource set, since the exclusions have been removed.

Maybe the version one protocol would provide all defined resources in the resources object, then also provide an exclude idset analogous to down, and follow that with exclude and include delta responses like down and up as exclusions change.

Perhaps an easy first step is include both "up" and "down" in the first response?

Perhaps an easy first step is include both "up" and "down" in the first response?

Where "up" Is, for now, the union of the idset keys in R? We could do that. That would let fluxion ignore the resources object. Would it make more sense to call it "include"? (Then later "include" might be a subset of the union?)

Where "up" Is, for now, the union of the idset keys in R?

Sorry, I was having a less useful thought. At present, I guess "up" _is_ all the "included" targets. I was just thinking that presently the "down" idset is inferred by setting all targets down when initializing from by_rank, then marking available only those in the "up" idset. If the "down" idset was also present, the by_rank object could be ignored.

https://github.com/flux-framework/flux-core/issues/3000#issuecomment-643036184

This sounds pretty comprehensive @garlick. This should also enable dynamically changing exclusion set and let schedulers not consider the excluded resources for scheduling satisfiability. There are some implementation details we need to work out for such satisfiability (at least for fluxion) but this should be doable.

I will make near term progress with the union approach for now though.

OK, let's revisit this once we've synced up flux-core and flux-sched and do a version 1 rfc.

It sounds like we would allow a version 1 to be limited in these areas:

  • up/down/exclude/unexclude only work at granularity of execution target
  • resources not associated with an execution target cannot be acquired,
  • there is no way to "grow" other than by "unexcluding" resources that were defined but excluded.

I guess those things would come later as driven by new features.

Now that I think about this, this is also related to future elasticity support. @milroy and I were wondering about what satisfiability should really mean under an elasticity model. My observation is satisfiability semantics seems to map pretty well to inclusion/exclusion model: shrink semantics would be same as exclusion semantics and grow is inclusion.

Since from the perspective of schedulers, it doesn't really matter whether resources are administratively excluded or elastically shrunk, we could consider using keys that are more internal implementation agnostic like 'grow' or 'shrink'?

there is no way to "grow" other than by "unexcluding" resources that were defined but excluded.

I think you were having a similar thought...

Just for reference, there was related discussion about "grow" semantics in #2908.

shrink semantics would be same as exclusion semantics and grow is inclusion.

The semantics are the same, but from a management perspective there is a difference. The flux instance may still want to monitor up/down status (liveness) of "excluded" targets (e.g. for service recovery for instance). Though, I guess to the scheduler it doesn't make much difference.
Long term, though, a separate grow semantic may make more sense (so why not keep them separate now is my question)

Side note: in the case where child instances are grown by unexcluding resources, would the resource module discover using the _R_ of the parent, then exclude everything not in the _R_ assigned to the job?
This would also require disabling "liveness" monitoring for those nonexistent ids.

Though, I guess to the scheduler it doesn't make much difference.

I don't have a strong opinion about what key names are used in the version 1 protocol.

But in terms of writing a bit future-proof fluxion code, this observation so far led me to believe grow(), shrink() and mark() as the three main idioms that fluxion should use to support the current resource interface and to position us for future elasticity.

Long term, though, a separate grow semantic may make more sense (so why not keep them separate now is my question)

Yes this make sense. Just to be clear, though, I wasn't talking about the general semantics but just scheduling semantics w/ respect to satisfiability.

I brought this up in another issue, but we should pick a set of terminology used by the flux-core resource module and use that consistently across other services. Admins are going to be unhappy, for example, if scheduler declares resources down that are drained or offline, when in another context "down" means unreachable.

This is just a practical consideration and unrelated to the semantics of satisfiability checks etc, so possibly a minor point at this time.

Actually for now you need to take the union of the idset keys in the resources object to know which of the resource.hwloc.by_rank execution targets is actually included in the resource set

A question: How do we take the union scalably?

I am looking at idset interface and it seems the most straightforward way to do this is to go over each key (idset) in by_rank and add it into the overall set. But there isn't merge within idset that takes an idset object and merge it to the current idset.

The input idset can be iterated one by one using idset_first and _next. But it wouldn't be efficient to iterate an idset one by one over a large range?

That should be fine. I wouldn't worry too much about the scalability of this operation. It's only done once at startup, and in most cases it's a small set.

See also: rutil_idset_from_resobj() in src/modules/resource/rutil.c .
You could just pull that function in if you want.

@garlick: Do we still need this ticket? I believe we discuss this enough as part of RFC 27?

Yeah let's close this one

Was this page helpful?
0 / 5 - 0 ratings