In flux-framework/flux-sched#754 it was proposed that schedulers bootstrap from Rv1. Assuming this works out, then we should change resource.acquire to return Rv1 instead of the current by_rank object.
The resource module thus needs to obtain an Rv1. There are the following cases:
1) bootstrap from config file (system instance).
2) bootstrap from R assigned by parent flux instance (batch job and others)
3) dynamic discovery (start from foreign resource manager, or flux start for testing).
For 1, we could add a [resource] TOML stanza with a path key that points to a file containing Rv1. This file will probably need to be generated for each cluster from a tool in the fluxion project, since it would need to include JGF.
For 2, the rank 0 resource module can connect to parent broker to read R from job-info.
For 3, we could leave the current mechanism in place and just change flux hwloc reload to place Rv1 in the KVS in place of by_rank.
The big benefits of making these changes in the resource module, schedulers, and flux hwloc are:
1) System admins won't have to bring an entire cluster up once in order to "discover" its config.
2) Every batch job won't have to be delayed waiting for the hwloc reduction - silly when R has already been defined!
(@grondo and I discussed an alternate mechanism where a distributed resource module would gather and reduce dynamically discovered Rv1 fragments, or when R is statically defined, verify on each rank that configured R matches what hwloc thinks, but I didn't want to get this issue too off track for now)
There's a new draft of RFC 28 (resource acquisition protocol) up as flux-framework/rfc#253.
In the proposed draft, the resource object is converted to _R version 1_ and a number of shortcomings in the current implementation are addressed, including ability to grow and shrink the resource set based on changes in the exclusions, making the first response to resource.acquire a bit less special, etc..
I'd like to propose that as part of updating the protocol, we modify the flux-core resource module to be able to handle the above 3 use cases. To accomplish that, we can leverage the existing resource.join RPC to reduce _R_ when discovered dynamically, and add a new resource.get RPC that would allow actual vs configured resources to be verified.
TL;DR
The resource module must know the full _R version 1_ assigned to the instance, so it can return it in a grow response to the scheduler resource.acquire request. To recap, there are three cases for bootstrapping R:
The following handshake is is proposed for resource module loaded on all ranks:
Resource module on rank > 0 makes resource.get RPC upstream. Parent responds with R in case 1,2 (configured, inherited) below. Parent responds with error in case 3 (discovered).
case 1,2: if resource module receives R, it should extract _R_local_, then verify that the expected resources are visible using the libhwloc API. If verification succeeds, send parent join request with online=<rank>. If verification fails, send parent join request with offline=<rank>. Enter reactor.
case 3: if resource module receives error, it should use the libhwloc API to generate _R_local_. Send parent join request with rlocal=object, online=<rank>. Enter reactor.
Join RPC is aggregated on path to rank 0. In case 1,2, _R_ is already known on rank 0. In case 3, _R_ is built up from _R_local_ fragments in the reduction. As soon as _R_ is completely known, the initial grow response may be sent to the scheduler resource.acquire request. It may not be sent any earlier per RFC 28.
The scheduler must not block 'flux module load' on resource.acquire, since in case 3, that will deadlock since downstream ranks will not begin loading resource module until upstream ranks complete rc1, and R cannot be fully known until all ranks complete rc1.
Ranks that send resource.join and place themselves in the offline idset have failed resource verification (case 1,2). Rank 0 should drain these ranks with reason "resource verification failed". Ranks that place themselves in the online idset, should be included in an 'online' response to the scheduler resource.acquire request.
Multiple brokers per rank
In case 2, the resource module converts ranks in _R_ generated for the enclosing instance to local ranks (execution targets). If launching one broker per rank, this is trival: renumber ranks sequentially starting at zero.
If launching multiple brokers per rank, the easiest approach may be to fall back to case 3, dynamic discovery. With the addition of hostnames to _R version 1_, it should be simple for rank 0 to detect that _R_ contains ambiguous rank to host mappings. It could then toss it away and answer NULL to resource.get requests to trigger discovery, like case 3.
Impact on schedulers
Both schedulers must handle new grow/shrink/online/offline responses as outlined in RFC 28.
sched-simple must handle a resource object in _R version 1_ format, instead of _by_rank_.
Since the resource module will no longer call flux hwloc reload, fluxion must also have a reader for _R version 1_.
Fluxion _could_ provide a generator for _R version 1_ that includes JGF in the opaque scheduler section. In case 1 where _R_ is configured by sys admins, Fluxion would then have a higher fidelity resource description.
Similarly, Fluxion _could_ generate JGF fragments in the opaque scheduler portion of _R_ objects allocated to jobs. This would improve resource fidelity in case 2. N.B. since the scheduler section is opaque outside of the scheduler implementation, Fluxion would need to handle conversion of enclosing instance ranks to local ranks in this JGF.
For case 3, the opaque scheduler section would not be generated, so either _R version 1_ alone would have to suffice, or we would need to find a way to acquire hwloc XML as before.
Since _R version 2_ is planned to have JGF-like graph support, Fluxion could just support the plain _R version 1_ for all the above cases until version is developed. Effort might be better spent working on version 2, compared to reworking XML collection.
Another thought @grondo had was adding a resource RPC for retrieving the hwloc XML for the local broker rank. After pondering a bit (and chatting a bit with @SteVwonder on the coffee call), I think this could work as a fallback from JGF for Fluxion in case 3.
case 1: JGF is added to the static _R version 1_ configuration. resource.acquire responds immediately with a grow response (nodes may not all be online), and Fluxion bootstraps from embedded JGF.
case 2: JGF is read with R from the enclosing instance KVS. resource.acquire responds immediately with a grow response (nodes may not all be online), and Fluxion bootstraps from embedded JGF. It has to remap ranks.
case 3: The resource module has to dynamically build _R version 1_ as ranks join. resource.acquire responds with grow only after the full R is known. There is no JGF in R. Fluxion detects the missing JGF and falls back to making hwloc RPCs to all ranks. This is safe because the nodes have to be online, or the response wouldn't have been sent.
As long as we don't try to go fetch hwloc in case 1 and 2, the RPCs should be safe.
Thanks @garlick for this. Generally the proposed interface makes sense to me.
For case 3, the opaque scheduler section would not be generated, so either R version 1 alone would have to suffice, or we would need to find a way to acquire hwloc XML as before.
The use case that I mentioned on the coffee call is this one: https://github.com/flux-framework/flux-core/discussions/3152. That use case requires that the socket resource and it's relationship to the cores and gpus be included in the JGF.
Fluxion detects the missing JGF and falls back to making hwloc RPCs to all ranks. This is safe because the nodes have to be online, or the response wouldn't have been sent.
Just a thought, but could the hwloc XML be collected and aggregated during the Join RPC? I guess the downside here would be that this collection and aggregation would be unnecessary when Fluxion is not loaded, right?
N.B. since the scheduler section is opaque outside of the scheduler implementation, Fluxion would need to handle conversion of enclosing instance ranks to local ranks in this JGF.
I wonder if there is an easy way for us to do this ahead of time. We may run into the same multiple brokers per rank/node issue that you describe above unless there is a way to easily correlate execution targets to child instance ranks. I don't think I've fully grok'd this particular issue though, so no good ideas are coming to mind.
Just a thought, but could the hwloc XML be collected and aggregated during the Join RPC? I guess the downside here would be that this collection and aggregation would be unnecessary when Fluxion is not loaded, right?
Yep! The downside is not so much that it would only be used by Fluxion, as that it would only be used in case 3 (launch by foreign RM or flux start in test ) if JGF can be passed through in cases 1 and 2. It bothers me a little that it would be done for every batch job launch in a system instance, then not be used.
Code change wise, I think it's pretty easy to drop in a resource.hwloc-dump RPC (or similar) to each rank in sequence in place of the KVS lookups of resource.hwloc.xml.<rank> that are there now.
Fetching the XML would be an interim solution until we have a more JGF-like Rv2.
@garlick and @SteVwonder: great discussions above and sorry I'm coming at this late.
With the above scheme, you would require not only an RV1/JGF generator (instance RV1) but also a library that allow you to build an RV1/JSON subset from it for each grow response of resource. This shouldn't be too onerous to develop one if how it should be used is agreed (plugin?).
How did you envision what the input for the RV1/JGF generator would be? Would the input consist of the pre-created hwloc xml files sorted in rank order as well as some minimum metadata including queue info or a resource configuration language to generate the RV1?
Can we flesh out the details for rank remapping a bit?
In case 2, the resource module converts ranks in R generated for the enclosing instance to local ranks (execution targets). If launching one broker per rank, this is trival: renumber ranks sequentially starting at zero.
This will produce the old rank to new rank map. That can be passed to the JGF remapper too. (maybe a call from resource model into the JGF library plugin above).
@grondo: just to make sure, there will be no need for remapping resource IDs correct? The core IDs and GPU IDs in the parent instance are guaranteed to be correct in a nested instance?
If launching multiple brokers per rank, the easiest approach may be to fall back to case 3, dynamic discovery. With the addition of hostnames to R version 1, it should be simple for rank 0 to detect that R contains ambiguous rank to host mappings. It could then toss it away and answer NULL to resource.get requests to trigger discovery, like case 3.
Sorry. I don't fully understand this scheme. There are two cases: 1) multiple ranks on a node will manage the same node resources in a shared fashion; and 2) node-local resources will be chunked and managed by multiple ranks. By dynamic discovery, you meant hwloc resources with awareness of 1) and 2) will be used, this should be okay. But then,
For 3, we could leave the current mechanism in place and just change flux hwloc reload to place Rv1 in the KVS in place of by_rank.
without knowing the interworking of resource, I am not sure exactly how this will be done.
@grondo may have more intelligent comments on this but I'll give you my quick take, including backing up and re-explaining where we currently are. It helps me if nothing else :-)
In pr #3265, the internal protocol of the resource module evolved a bit since this discussion:
In addition, case 3 collects hwloc XML and stores it in the KVS exactly as flux hwloc reload did, as was suggested above by @SteVwonder. This means Fluxion can continue to handle case 3 as it does now: wait for first resource.acquire response, extract grow ranks (small change: extract ranks from Rv1 instead of by_rank), then fetch XML from KVS.
To enable Fluxion to work with case 1 (important since the main use case is the system instance), we _could_ embed JGF in Rv1 in the opaque scheduler section of the configured file. The Rv1 returned by the first resource.acquire response would then contain this, and Fluxion could use it to bootstrap instead of looking for XML (which would not be there in the KVS in this case). We probably would need Fluxion to provide a sys admin tool to generate the Rv1 + JGF.
Case 2 is the tricky one IMHO. It seems like we have the following options (or what am I missing):
resource.acquire inside the re-ranked Rv1. I'm not sure if that can be sufficient info to boostrap Fluxion (I think not?)Although that last option provides the most capability, it may be too much trouble for Rv1 case 2. We may want to apply that effort towards developing Rv2.
Fluxion provides a plugin to core resource module to post-process (re-rank) JGF in the opaque scheduler section.
Why does the re-ranking of opaque data have to be done in the core resource module? Why can't the reader of the opaque data re-maps ranks? (Assuming that ranks are not reordered). The one case that is tricky is when there are multiple child ranks per parent rank, and we've already discussed falling back to case 3 in that scenario.
Fluxion adds JGF (+metadata?) to the opaque Rv1 scheduler section of a new job, then in the job, receives it back unmodified from resource.acquire inside the re-ranked Rv1. I'm not sure if that can be sufficient info to boostrap Fluxion (I think not?)
What do you believe is missing here? Can't the Rv1+JGF in this case be treated the same as bootstrap from configuration?
(I apologize if I'm missing something obvious)
just to make sure, there will be no need for remapping resource IDs correct? The core IDs and GPU IDs in the parent instance are guaranteed to be correct in a nested instance?
We will need to study this and make sure we are referencing IDs consistently. Currently I think a nested instance references logical IDs, since the hwloc topology is filtered then the resources are gathered by logical ID (at least in flux-core). When using Rv1 from a parent instance this will change (assigned IDs will be relative to the parent), so things could break in unexpected ways (e.g. affinity).
Why does the re-ranking of opaque data have to be done in the core resource module? Why can't the reader of the opaque data re-maps ranks? (Assuming that ranks are not reordered). The one case that is tricky is when there are multiple child ranks per parent rank, and we've already discussed falling back to case 3 in that scenario.
OK, if that works then I stand corrected! (I did not have a technical reason in mind - I may have misremembered or misinterpreted earlier discussion).
What do you believe is missing here? Can't the Rv1+JGF in this case be treated the same as bootstrap from configuration?
JGF rank remapping magic, but it sounds like that may not be a thing.
To enable Fluxion to work with case 1 (important since the main use case is the system instance), we could embed JGF in Rv1 in the opaque scheduler section of the configured file. The Rv1 returned by the first resource.acquire response would then contain this, and Fluxion could use it to bootstrap instead of looking for XML (which would not be there in the KVS in this case).
This can be done and will get good interim experiences towards Rv2. But unless resource.acquire can send the subset of JGF corresponding to grow, this will not be that useful.
Maybe I'm missing something here though.
Do you expect that the initial JGF should only contain the resources that are not excluded by the resource configuration file? If so "exclusion" set needs to be coordinated between the initial Rv1 + JGF generator and resource configuration.
We probably would need Fluxion to provide a sys admin tool to generate the Rv1 + JGF.
Is this essentially the resource generation front end or are you expecting an extension of some sort of the basic resource generation tool?
Do you expect that the initial JGF should only contain the resources that are not excluded by the resource configuration file? If so "exclusion" set needs to be coordinated between the initial Rv1 + JGF generator and resource configuration.
This is a good question. I would think you would want or need to include all resources in the configuration file. The scheduler must handle exclusion of configured resources anyway in case a sysadmin decides to exclude resources after startup. However, for the system instance maybe it doesn't make sense to add resource configuration for nodes like management nodes for which it is known will never be "included"?
JGF rank remapping magic, but it sounds like that may not be a thing.
Ah, well I could be oversimplifying. But when child ranks map 1:1 to parent ranks, isn't rank remapping simply sorting existing ranks and re-indexing them from 0? Or, in the case of a scheduler that needs to re-rank its "opaque" resource representation, it could perhaps infer something from the order of ranks which have already been remapped in Rv1?
Case 2 is the tricky one IMHO. It seems like we have the following options (or what am I missing):
- Fluxion adds JGF (+metadata?) to the opaque Rv1 scheduler section of a new job, then in the job, receives it back unmodified from resource.acquire inside the re-ranked Rv1. I'm not sure if that can be sufficient info to boostrap Fluxion (I think not?)
What is already there in JGF is sufficient. In fact, that's what Fluxion uses when reloaded to reconstruct the scheduler state. How to "remap" JGF to the nested instances require some more discussions.
Related to https://github.com/flux-framework/flux-core/issues/3228#issuecomment-712360439, will users be able to set the core resource configuration to set "exclusion"? If so, there is the same implication here.
- Fluxion adds nothing to the opaque Rv1 scheduler section of new job, then in the job, bootstraps from Rv1 only. Scheduling in the job would then be limited by Rv1.
At first glance, this will seem to require lots of changes in Fluxion. Currently if the opaque Rv1 scheduler section is missing, you can't even reconstruct the state of Fluxion on reload.
- Add some way for Fluxion to hint to the core resource module that it should collect hwloc XML in case 2 also, so that Fluxion can bootstrap from that.
Well, I think this is the time to learn how to bootstrap Fluxion using the R from the parent.
Fluxion provides a plugin to core resource module to post-process (re-rank) JGF in the opaque scheduler section.
For case 2, I have to think having some discussions on allowance for resource "exclusion" and rank remapping protocol would allow for quick convergence...
Why does the re-ranking of opaque data have to be done in the core resource module? Why can't the reader of the opaque data re-maps ranks? (Assuming that ranks are not reordered).
Are you suggesting the user of Rv1 (with full JGF) will use the execution section to remap JGF?
Yeah this seems to work for one broker per node case (and I see you comment for falling back for multi-broker case, which sounds good to me). Even for the initial pre-generated JGF shouldn't set the rank fields.
@garlick and @grondo: if you are available for today's 2PM, I would like to discuss this a bit. It will be helpful for me if we discuss a bit for
1) Implications of resource exclusion
2) Rank re-mapping (including the case where some resources are excluded -- if this should be supported).
3) Initial Rv1 generation
4) Resource IDs consistency for nested instances
Are you suggesting the user of Rv1 (with full JGF) will use the execution section to remap JGF?
Yes the user of Rv1 JGF (or any opaque data from .scheduling key) would be responsible for remapping JGF. It should be straightforward if we make a rule that Rv1 inherited from parent will always have 1:1 mapping of ranks, and order will be preserved.
Implications of resource exclusion
I'm a bit confused why resource exclusion is a topic here. Exclusion should be treated the same in all 3 cases (even case 3 which is status quo). The excluded resources are configured but are not used for jobs and are not used for determining satisfiability.
Rank re-mapping (including the case where some resources are excluded -- if this should be supported).
Resources are excluded by rank (a.k.a. execution target). Therefore re-mapping ranks has to be done before exclusion is applied. I'm not sure why exclusion is a special case, as it should be handled very similar to "drain".
I'm a bit confused why resource exclusion is a topic here. Exclusion should be treated the same in all 3 cases (even case 3 which is status quo). The excluded resources are configured but are not used for jobs and are not used for determining satisfiability.
The case I'm thinking about is:
Say, you have an R in the parent instance. But for a nested instance, you used the resource configuration to exclude one node from there.
As I understand from https://github.com/flux-framework/rfc/pull/253/commits/e13b40b45809a6199ad247f2af40b86a25c6cca2, the grow request should send the resource object that doesn't contain the excluded resources. So passing the JGF directly won't work for this case? That is unless we make some other protocol changes?
Ah I see. You are correct I had forgotten that exclude/include is indeed different from drain/undrain. However, given that JGF or anything in .scheduling key is opaque, we can state that the JGF will always be passed as-is in the resource object, exclusions notwithstanding. The user of JGF will have to then read Rv1 to determine the included ranks and apply that to their internal resource configuration after JGF has been processed.
Edit: (and I now understand why @garlick proposed a scheduler-provided plugin for the core resource module to manipulate the .scheduling key. I apologize for not understanding that at first. However, it might be simpler to state that the opaque data in Rv1 is never modified by the resource module, since this results in less coupling of components)
Resource IDs consistency for nested instances
One idea to keep in mind here: once we start using cgroups to contain resources, using logical IDs may be the only thing that makes sense. Therefore, some kind of "remapping" of IDs may end being necessary. I'm not totally sure about that, it may require us to run some experiments.
Per our discussion today at coffee hour:
0) Start to scope the effort required for Fluxion to complete this work (path to least resistance)
1) Have a simple plan of attack to learn about resource ID consistency
2) Use what we learn from step 1) to decide what to do with the initial RV1 generator for now -- from our discussion, it would be likely the generator will produce the initial RV1 from a set of pre-collected hwloc xml files. But lessons from 1) can change our decision so we postponed the decision to a later time.
How much of this issue was resolved with the merge of #3265?
I think the issue as described is resolved (and then some), and I think the embedded JGF is an issue for fluxion.
We can open new issues for any core problems that come up when trying to get that done in fluxion.
Sounds good to me.
Resource IDs consistency for nested instances
This is still an issue for GPUs but this should be tracked in fluxion as well.
Most helpful comment
There's a new draft of RFC 28 (resource acquisition protocol) up as flux-framework/rfc#253.
In the proposed draft, the resource object is converted to _R version 1_ and a number of shortcomings in the current implementation are addressed, including ability to grow and shrink the resource set based on changes in the exclusions, making the first response to
resource.acquirea bit less special, etc..I'd like to propose that as part of updating the protocol, we modify the flux-core
resourcemodule to be able to handle the above 3 use cases. To accomplish that, we can leverage the existingresource.joinRPC to reduce _R_ when discovered dynamically, and add a newresource.getRPC that would allow actual vs configured resources to be verified.TL;DR
The resource module must know the full _R version 1_ assigned to the instance, so it can return it in a grow response to the scheduler
resource.acquirerequest. To recap, there are three cases for bootstrapping R:The following handshake is is proposed for resource module loaded on all ranks:
Resource module on rank > 0 makes
resource.getRPC upstream. Parent responds with R in case 1,2 (configured, inherited) below. Parent responds with error in case 3 (discovered).case 1,2: if resource module receives R, it should extract _R_local_, then verify that the expected resources are visible using the libhwloc API. If verification succeeds, send parent join request with
online=<rank>. If verification fails, send parent join request withoffline=<rank>. Enter reactor.case 3: if resource module receives error, it should use the libhwloc API to generate _R_local_. Send parent join request with
rlocal=object,online=<rank>. Enter reactor.Join RPC is aggregated on path to rank 0. In case 1,2, _R_ is already known on rank 0. In case 3, _R_ is built up from _R_local_ fragments in the reduction. As soon as _R_ is completely known, the initial grow response may be sent to the scheduler
resource.acquirerequest. It may not be sent any earlier per RFC 28.The scheduler must not block 'flux module load' on
resource.acquire, since in case 3, that will deadlock since downstream ranks will not begin loading resource module until upstream ranks complete rc1, and R cannot be fully known until all ranks complete rc1.Ranks that send
resource.joinand place themselves in the offline idset have failed resource verification (case 1,2). Rank 0 should drain these ranks with reason "resource verification failed". Ranks that place themselves in the online idset, should be included in an 'online' response to the scheduler resource.acquire request.Multiple brokers per rank
In case 2, the resource module converts ranks in _R_ generated for the enclosing instance to local ranks (execution targets). If launching one broker per rank, this is trival: renumber ranks sequentially starting at zero.
If launching multiple brokers per rank, the easiest approach may be to fall back to case 3, dynamic discovery. With the addition of hostnames to _R version 1_, it should be simple for rank 0 to detect that _R_ contains ambiguous rank to host mappings. It could then toss it away and answer NULL to
resource.getrequests to trigger discovery, like case 3.Impact on schedulers
Both schedulers must handle new grow/shrink/online/offline responses as outlined in RFC 28.
sched-simplemust handle a resource object in _R version 1_ format, instead of _by_rank_.Since the resource module will no longer call
flux hwloc reload, fluxion must also have a reader for _R version 1_.Fluxion _could_ provide a generator for _R version 1_ that includes JGF in the opaque scheduler section. In case 1 where _R_ is configured by sys admins, Fluxion would then have a higher fidelity resource description.
Similarly, Fluxion _could_ generate JGF fragments in the opaque scheduler portion of _R_ objects allocated to jobs. This would improve resource fidelity in case 2. N.B. since the scheduler section is opaque outside of the scheduler implementation, Fluxion would need to handle conversion of enclosing instance ranks to local ranks in this JGF.
For case 3, the opaque scheduler section would not be generated, so either _R version 1_ alone would have to suffice, or we would need to find a way to acquire hwloc XML as before.
Since _R version 2_ is planned to have JGF-like graph support, Fluxion could just support the plain _R version 1_ for all the above cases until version is developed. Effort might be better spent working on version 2, compared to reworking XML collection.