Problem: hwloc data is currently collected and reduced to by_rank in the KVS using a method that assumes all broker ranks are available. Furthermore, it is initiated in rc1 which (as proposed in #2614) will run on all ranks, not only rank 0, and will run on rank 0 before other ranks are available.
We will need a new way for hwloc data to be collected from any ranks whose resources are not statically configured (e.g. by config file in system instance, or by R in an instance started by Flux)
The reduced hwloc by_rank data is only used by simple-sched, and only as a convenient method of auto-configuration for instances that come up all at once (which is the only thing we currently support)
I like the idea of configuring instead from R for a subinstance, and static configuration for a system instance, perhaps falling back to auto-config for instances started via Slurm or flux start --size. In the first two cases, maybe we can fall back to populating "full" hwloc data in the KVS on demand. In most cases I would assume the full hwloc would never be fetched so we could get away without populating it, and possibly get a slight speedup in instance startup.
One way we could implement this quickly would be to turn resource-hwloc back into a service, perhaps only running on rank 0. Current users of the KVS hwloc data could be moved to this service instead. Schedulers would first check for R or configuration to populate their list of resources, falling back to resource-hwloc if neither exists. On fallback to resource-hwloc, hwloc data is gathered as it is now, and scheduler proceeds once data is available.
Other users of hwloc, e.g. flux-hwloc(1) would also request full hwloc data on demand from the service.
:point_up: Just one idea of I'm sure many more to come.
Hmm, maybe we could have a resource module that just runs flux-hwloc if the eventlog says a new execution target came online and it doesn't have statically configured resources? Otherwise do nothing?
As I recall keeping hwloc out of the broker's address space was a win, though I don't recall the details.
Edit: or have rc1 run flux-hwloc to push data to a service on rank 0 that reduces/ignores it?
As I recall keeping hwloc out of the broker's address space was a win, though I don't recall the details.
Yeah, I was thinking the "service" would just execute flux-hwloc on demand.
Hmm, maybe we could have a resource module that just runs flux-hwloc if the eventlog says a new execution target came online and it doesn't have statically configured resources? Otherwise do nothing?
Why not just do it on demand? The scheduler could determine if it needs hwloc and request it only if necessary. The execution target doesn't come online until it gets the answer. Then you don't have to muck about in a script to determine if there is configuration or not. This would keep the hwloc provider service as simple as possible (you don't have to pollute it with knowledge of what resource configuration is, or if there are multiple ways it might be populated, R, static config, etc.)
Additionally, if some other user wants the full hwloc info for an instance, even when there is configuration, it can easily be provided on demand, instead of unnecessarily for every instance.
I like the idea of configuring instead from R for a subinstance, and static configuration for a system instance, perhaps falling back to auto-config for instances started via Slurm or flux start --size
Yeah, I agree this is where we should be headed.
Maybe one thing we should do now is to firm up our canonical representation of R that all the resource generators can target. To support generators of different complexities, this needs to be highly expressive and needs to be a graph. I and @milroy have been pleased with JSON Graph Format (JGF) for that purpose. Maybe we should promote JGF portion buried within scheduling key of R to the top level key and make use of it.
One advantage of doing it this way would be that we can kick the can down the road to design and implement our resource generation language for sys-admins later while making progress with the current work. All we need would be just one way to generate this R at the top level and we can use a simple approach for this.
For flux-sched, I can use GRUG to generate JGFs with different complexity to test the scheduler complexity. GRUG or JSONified version of GRUG can be used as a stand in for sys-admin configuration language for the time being. If we have a pre-populated set of hwlocs on a target system, we can use some existing codes in sched or core to convert that into JGF for the system instance to test.
I think the core issue to solve at this point is to come up with a good R to execution target map and we need to start our discussion.
Why not just do it on demand? The scheduler could determine if it needs hwloc and request it only if necessary. The execution target doesn't come online until it gets the answer. Then you don't have to muck about in a script to determine if there is configuration or not. This would keep the hwloc provider service as simple as possible (you don't have to pollute it with knowledge of what resource configuration is, or if there are multiple ways it might be populated, R, static config, etc.)
flux-sched can delay its resource graph population time to "runtime" instead of load time. However, it will need an external trigger telling it "now go and fetch hwloc data" from this online service. My guess is there will be ether global or local events the availability of execution target and hence hwloc data?
I and @milroy have been pleased with JSON Graph Format (JGF) for that purpose.
The other merit of JGF is that, as we will have more investigation on this like compression techniques etc in the future, the common format can equally benefit sched and core.
Finally, now that I think about this, @milroy and I are actively working on effective techniques to dynamically grow and shrink resources based on JGF as input formats, leveraging this will also position us to be "elastic" in the future. Just $0.02.
My guess is there will be ether global or local events the availability of execution target and hence hwloc data?
Yes, this is the eventlog @garlick mentioned above. Also mentioned in #2511, but I'm not sure if there is another, specific, issue describing the execution target eventlog.
Yes, this is the eventlog @garlick mentioned above. Also mentioned in #2511, but I'm not sure if there is another, specific, issue describing the execution target eventlog.
Thanks.
I am unsure if it would be best for a scheduler fetch hwloc every time a local event is emitted.
Seems some sort of bulk fetching with some timeout filter would improve performance. This is an optimization that can be worked on later though.
Maybe a new resource module could be the sole writer of this eventlog, as well as act as an intermediary to provide resource data to the scheduler. It could effectively implement "grow" in the case where an execution target comes online without resources pre-loaded for it. This would consist of executing flux-hwloc on the node. Only when resources are known would the target be marked "up" in the eventlog?
To support generators of different complexities, this needs to be highly expressive and needs to be a graph.
Simple schedulers do not need a graph though, and this is why we have the "split" representation of R that allows scheduler specific annotation (e.g. JGF) under the scheduling key.
Not to completely dismiss the idea though, but just to note that this is how we support "schedulers of varying complexity".
Simple schedulers do not need a graph though, and this is why we have the "split" representation of R that allows scheduler specific annotation (e.g. JGF) under the scheduling key.
True.
But it still requires a group of hwloc xml files either directly or by relying on the info that comes from parsing the group of hwloc xml files from another module? As you know, hwloc represents its resources as a graph as well.
If you were to do this with static resource data, I think we are essentially talking about a graph data lowered down to be what the simpler scheduler requires...
Not to completely dismiss the idea though, but just to note that this is how we support "schedulers of varying complexity".
Yes. I think the point is to have a common format that can address most (if not all) consumers...
Maybe a new
resourcemodule could be the sole writer of this eventlog,
Well this makes a lot of sense in that administrative up|down|drain could be implemented via this module. It could also be the provider of the requirement in #2898.
Makes me wonder though, if this module holds the current status and configuration of resources/execution targets, is there a need for an eventlog (besides just a log of what happened when)?
Maybe a new resource module could be the sole writer of this eventlog,
A minor point. But there is already a module called resource at flux-sched, which will change its name in the future. But I mention per naming collision.
is there a need for an eventlog (besides just a log of what happened when)?
Could certainly be done with streaming RPC or published events, but I think we want to record what happened when don't we?
I think we want to record what happened when don't we?
Just to close the loop, I did say
(besides just a log of what happened when)?
I meant to ask is there a benefit to the eventlog as a user of the resource service? Would users (e.g. scheduler) still watch the eventlog, then query the resource service to get the configuration of targets going up/down? Or would they just subscribe to a resource service for events?
Or would they just subscribe to a resource service for events?
For me, it depends. If this new service implements higher-level events like "bulk up" etc, I would rely on that instead of individual raw execution-target specific event. If not, don't see the benefits.
From the perspective of separations of concerns, I can see the benefits of the new service only providing info on resource query and resource to exec target mapping.
Oh yeah, I guess it would be a simpler interface for the scheduler to just deal with an RPC.
Also: we have libschedutil if we want to add a level of abstraction between scheduler and interface (if it can further simplify life for schedulers).
I've gotten a bit lost, sorry.
Sounds like we're gravitating toward addition of a new resource service which performs the following:
flux-hwloc would go away, since hwloc wouldn't be available from resource when it is using a static config (alternately, it could offer a service to populate hwloc data for "up" execution targets on demand.)
A frontend tool described in #2898 could talk directly to resource to get a summary of currently configured vs up/down execution targets, and thus resources. Aside: the tool may also need to somehow query the scheduler or list of running jobs to determine idle vs busy resources. Or should resource service monitor jobs to capture this information as well?
This issue has gone significantly beyond "hwloc cannot assume available of all broker ranks". Shall we rename the issue or open a new one?
Reads resource configuration from static config, R, or dynamic config (hwloc).
My proposal was to unify static config and R into the canonical R representation. The static config would generate an R and this service can only rely on that format . Further, dynamic config (hwloc) can be stored in this R format so that we only have to worry about "one format", if that makes sense.
resource
Let's use a different name to avoid confusion with the same named module in flux-sched.
We also need this service to map R to execution target?
I vote we rename this issue.
If we could back up a little bit and come up with a minimum step here that could go in before my rc changes, that would be useful.
Let's use a different name to avoid confusion with the same named module in flux-sched.
It makes more sense to allow core modules to have the generic names and rename 3rd party modules. I was thinking it might make sense to require external modules to be prefixed with the name of their project somehow. e.g. sched-resource. Not sure how to enforce this however.
It makes more sense to allow core modules to have the generic names and rename 3rd party modules. I was thinking it might make sense to require external modules to be prefixed with the name of their project somehow. e.g. sched-resource. Not sure how to enforce this however.
OK. We will go with fluxion-resource at some point anyway. It is just that this will be confusing for the time being.
If we could back up a little bit and come up with a minimum step here that could go in before my rc changes, that would be useful.
I know this was already dismissed above, but I was thinking of a baby step forward when proposing on-demand population of hwloc data. This would require minimal change to existing modules, and we could keep flux-hwloc for now.
We also need this service to map R to execution target?
Yes, this service would be doing a lot.
I know this was already dismissed above, but I was thinking of a baby step forward when proposing on-demand population of hwloc data. This would require minimal change to existing modules, and we could keep flux-hwloc for now.
If this is combined with the external event generation, flux-sched can make use of it as a baby step, I think.
My proposal was to unify static config and R into the canonical R representation. The static config would generate an R and this service can only rely on that format .
This seems like a lofty goal, but I always worry about attempting to design "do everything" formats, in that it typically ends up making things more difficult for the simple cases, and inevitably impossible for some complex case you didn't at first consider.
The idea of parsing global XML to translate it to JGF just to read "You have 4 cores" seems like a lot of churn for a high throughput case as an example. However, maybe I'm making too much out of it. I'm definitely willing to try making a canonical R and see how it goes.
This seems like a lofty goal,
Agreed.
The idea of parsing global XML to translate it to JGF just to read "You have 4 cores" seems like a lot of churn for a high throughput case as an example. However, maybe I'm making too much out of it. I'm definitely willing to try making a canonical R and see how it goes.
Fair point. I can see your concern and thank you for your willingness.
FWIW, my reasoning was:
As I see where we are headed for high ends, more complex cases will come to our way much quicker than you would think. (e.g., multi-tiered storage support etc).
Ways to statically configure a system will also have to change. (towards higher complexity) And there will likely be multiple ways.
Also very likely, we will also have to deal with different ways to populate R (now hwloc; but later vendor-specific external services to discovery global storage resources...)
Yet, we have to advance not only flux-core but also other components to keep bread of these changes.
It seemed this was too high of complexity to deal with an ad-hoc fashion.
Now, having the canonical jobspec was very helpful to make progress at different paces between flux-core and -sched and it feels like we can benefit from a similar arrangement. Having a full blown target representation first and slowly build up partial implementations.
Also we have lots of experience with JGF with multiple efforts around it. It felt like it makes sense to leverage them as well.
I know this was already dismissed above, but I was thinking of a baby step forward when proposing on-demand population of hwloc data. This would require minimal change to existing modules, and we could keep flux-hwloc for now.
Just to verify that we're on the same page - scheduler would send resource something like a resource.discover request that would have a streaming response containing an execution target idset + resource definition. The request could specify a flag indicating a format for resource definition? (e.g. hwloc XML vs reduced by_rank or advanced formats as passed through in R or config?).
Then maybe there could be a separate resource.monitor request with streaming response containing idset: up|down?
For current system (before rc changes), resource would assume all ranks are up. As part of rc changes, we could modify it to "watch" the hello protocol as ranks join (details TBD).
Did I misconstrue your suggestion or is this kind of what you were thinking?
If this is combined with the external event generation, flux-sched can make use of it as a baby step, I think.
Here's how the first iteration could work:
resource-hwloc (or just resource?) service to fetch by_rank blob or global XML data.resource-hwloc simply waits until all ranks are up and executes flux-hwloc reload or equivalent as is done now.Thus a scheduler loaded early on rank0 would block until all resources are up.
Next step would be a mode where resources are marked up as they are available and the scheduler would add them on demand.
Finally, a mode for the system instance could be added where resource knows all resources available, tells the scheduler about all of them and which are up/down over time.
Sorry @garlick, I didn't see your response until just after I posted.
Yes, I think we're on the same page. I like your method names of .discover and .monitor.
My goal was to split the work into stages that could easily be accomplished without too much interdependency.
e.g. the first step requires very little modification to schedulers.
As part of rc changes, we could modify it to "watch" the hello protocol as ranks join (details TBD).
Yeah, it seems like an item of work for the rc changes is a method to easily determine when an instance is "fully" up. I assume you have this already for starting initial program for non-system instances?
@dongahn, good comments above! It feels like these should be pasted into a different issue to continue the discussion. Do we have a general open issue already? (Perhaps in the RFC project?)
In first iteration, resource-hwloc simply waits until all ranks are up and executes flux-hwloc reload or equivalent as is done now.
Thus a scheduler loaded early on rank0 would block until all resources are up.
So with the first iteration, a scheduler still populates their resource data during the load time. The change is just to move from KVS reads to an RPC to the new service. Did I get it right?
Yeah, it seems like an item of work for the rc changes is a method to easily determine when an instance is "fully" up. I assume you have this already for starting initial program for non-system instances?
No, my thought was that the initial program would start immediately on rank 0 when it completes its rc1. That assumption was predicated on the scheduler knowing resources in advance so that it not reject jobs as unsatisfiable and simply queue them until targets are up.
So it seems like scheduler should not reject jobs until it gets a discover response, and I guess in the on-demand hwloc case we would want to not send this until all the ranks are up.
Edit: but yes, the "hello" protocol does this. We just need to export that information from the broker somehow.
@dongahn, good comments above! It feels like these should be pasted into a different issue to continue the discussion. Do we have a general open issue already? (Perhaps in the RFC project?)
I don't think we have it. Will create one.
No, my thought was that the initial program would start immediately on rank 0 when it completes its rc1. That assumption was predicated on the scheduler knowing resources in advance so that it not reject jobs as unsatisfiable and simply queue them until targets are up.
Seems like in the common case of subinstance and/or running under an existing RM, it would be surprising to users to have their batch script start before all ranks are up?
That being said, I guess it is better to start the user's script as soon as practical, as long as any operations that might assume the batch job is fully "up" block.
It also would support "grow" better in the long term.
So it seems like scheduler should not reject jobs until it gets a
discoverresponse, and I guess in the on-demand hwloc case we would want to not send this until all the ranks are up.
Since the proposal is that the scheduler would block until it gets the first discover response, I think the scheduler can't reject any jobs (it doesn't complete hello protocol with job manager until it has some idea of resource config).
Now that I think of it, there is no rc2 script in the system instance case, so we _could_ make it the default to delay the rc2 script until all ranks are up.
Now that I think of it, there is no rc2 script in the system instance case, so we could make it the default to delay the rc2 script until all ranks are up.
That might be more tractable. Since rc2 script could run anything it would difficult to make sure all services work as expected while the instance is still booting...
Edit: And later on we could make this an option, e.g. for hyyuuuge instances running a lot of jobs it might make sense to start scheduling before the instance is full grown. (the adolescent instance problem)
Good points @grondo!
So it seems like scheduler should not reject jobs until it gets a discover response, and I guess in the on-demand hwloc case we would want to not send this until all the ranks are up.
This shouldn't be a problem for the first iteration if the scheduler blocks still at the load time.
This will create some interesting problems for the true on-demand case with respect to queuing policies. For example, if the satisfiability of the first job cannot be determined, should we move to the next job at all? Much of the queuing policies won't make sense.
Edit: And later on we could make this an option, e.g. for hyyuuuge instances running a lot of jobs it might make sense to start scheduling before the instance is full grown. (the adolescent instance problem)
Yes this will only make sense with specific types of queuing policies.
Yes this will only make sense with specific types of queuing policies.
We could configure whether or not initial program starts before ranks are all online, while still having the scheduler wait to say hello until it knows its base set of resources. The advantage of starting rc2 early is maybe it needs to submit a million jobs to the queue, and it can be doing that in parallel with startup.
while still having the scheduler wait to say hello until it knows its base set of resources.
Probably also a good idea to give an option to the scheduler to go eager too. If the scheduler policy is set to be "order-agnostic high throughput" (we don't have it yet), it might as well start scheduling jobs under partially discovered resources...
BTW, this got me to think. The idea of rejecting a job when grow is supported is just odd... (various limits aside).
Most helpful comment
We could configure whether or not initial program starts before ranks are all online, while still having the scheduler wait to say hello until it knows its base set of resources. The advantage of starting rc2 early is maybe it needs to submit a million jobs to the queue, and it can be doing that in parallel with startup.