As described in RFC 12, the Flux design calls for a "system instance" that nodes join when they boot up, in contrast to instances that are launched on demand as parallel programs by SLURM or Flux.
Requirements:
Design/prototype activities - Est: 6 FTE months:
Development activities - Est: 12 FTE months:
Missing above is an alternative (etcd-based?) bootstrap mechanism for sub-instances of a system instance.
If a requirement of any system instance is the ability to run arbitrary parallel programs and not just child instances, then this alternative bootstrap mechanism will also need to support a PMI interface I suppose.
Can we assume that while the system instances need a special startup mechanism, programs launched by the system instance (including flux instances) use PMI to get bootstrapped?
Can we assume that while the system instances need a special startup mechanism, programs launched by the system instance (including flux instances) use PMI to get bootstrapped?
If we just say PMI is available that would be sufficient.
Notes from todays meeting:
Requirements to consider when weighing the system instance architectures under discussion:
Notes from yesterdays (the three options under consideration for system instance architecture)
@garlick: how does scheduling domains fit into this picture? E.g., if a facility wants to use power-aware scheduling at the top level and then wants to have multiple sub-instances underneath it at which they want to run "traditional nodes and cores" scheduling, which one of the three choices can support this? We can jam all of the complex scheduling logic into the top-level instance but this can have scalability limits for large scale systems?
Just to be sure we're on the same page: we've been considering the "system instance" to be a per-cluster entity. A system instance starts up standalone and then offers its resources to a facility instance (perhaps under the control of the system instance owner e.g. sys admins), and then continues to monitor and manage the resources at the behest of the facility instance. The details of this relationship are TBD.
Of the three system instance architecture options, the first two have only one system instance thus would have at most one scheduling domain. The third option has multiple system instances thus _could_ have more than one scheduling domain, although in our discussions I think we were considering those instances to be pretty minimal, existing mainly to isolate the "top" system instance from compute node failures.
We may not need to tackle the details of how a system instance relates to the facility instance as part of the system instance architecture. However, it seems like if we choose the third bullet, we'll need some parts so that compute node instances can "give" their resources to the system instance. I'm not sure if that's a pro or a con of the third option - it would be good to get this design done sooner rather than later, but it adds complexity to this particular milestone.
Of the three system instance architecture options, the first two have only one system instance thus would have at most one scheduling domain.
However, nothing in the architecture of these options prevents that single instance from creating sub-instances with new scheduling domains. I think the scheduling domain question is out of scope for this issue. The system instance is a building block that allows a resilient instance to start on some resources without an initial parent, and at that point the rest of the hierarchical design of Flux applies as we've discussed before.
Here's a slide representing one possible way to implement the first bullet. It's a hybrid overlay topology: tree with a full mesh replacing the last level of the tree, service "masters" confined to the mesh (ignore the tree node numbering - it begins with _m_, not necessarily _2_).

I'll go ahead and post my slide for the second option even though it's lame. Envision large number of compute nodes at the bottom passively waiting be told to execute/contain something, kill something, provide stdio, etc.
I realize that there's a large continuum between the first slide's architecture and this one. What I was leaning towards here was the simplest possible design where conceptually at least, a "thin" process on the full broker nodes represents each remote one, and perhaps is even managed as though it were the real thing.

On the third option, "multiple instance per fault domain (hierarchically related)". Does anybody have some insight into how this would work? This option is going to be more complicated than the other two, but has the big win that (if done right) it mirrors the cluster:site and workstation:workgroup instance relationships. That problem has been put off thus far, and is likely to be put off for another epoch of Flux development if we don't figure it out now.
Here's a straw man (probably recycling discussion from meetings): Assume management instance M and two compute instances C1 and C2. We want M to be able to transparently run stuff on the resources comprising C1 and C2. Could we just have both C1 and C2 each start a parallel program (J1 and J2) which emulates a "fat compute node" to M? In other words wire up to M like it was a single broker consuming one rank in M, but registering all the resources of the compute instance behind that single rank? As a "leaf" of M, failures in C1 and C2 could be decoupled from M. On the other hand, if J1 or J2 abort (because of failure it's not resilient to), the fat node goes "down" taking with it a big chunk of workload.
@garlick, good proposals here! Some inline questions real quick:
a "thin" process on the full broker nodes represents each remote one, and perhaps is even managed as though it were the real thing.
This is an interesting idea. It might help if you explain the reasoning why you think the thin process necessary? My initial reaction is that the thin process described just adds an extra hop during remote execution, but I realize there must be some other benefits or requirements for the comms layer that are involved here (Is there an underlying assumption of 1 message broker or handler per real or virtual "node"?)
Could we just have both C1 and C2 each start a parallel program (J1 and J2) which emulates a "fat compute node" to M?
This is a very interesting idea, and as you intimated would also handle a "workgroup" case very nicely, because the jobs J1, J2 could be used to present only a subset of real resources to a new parent. I do wonder, though, if the sub-jobs are necessary when donating the entire set of resources to the new parent, though now that I've written that down I wonder if you mean J1 and J2 are not instances but special purpose parallel programs whose sole purpose it is to emulate the fat node?
I actually find the idea of that fascinating (and not necessarily exclusive of the other options presented here). I'm assuming this "fat compute node" parallel program (actually I'm thinking of it more like a "distributed container") would actually be a special purpose instance (built with brokers), and thus could be launched under any supported RM, which would be really neat!
This could also be an interesting way to look at monitoring, since the distributed container would monitor its own resources, using the same mechanisms that a broker monitors local resources.
I guess there are still a few open questions, but to me still an interesting way to encapsulate the problem of how parent/child instances relate.
@garlick: thank for you the slides! The visuals really help. I can sort of guess the pro and cons of each option but perhaps we can more explicitly map the pro and cons of each option to see what we are trading off among different options?
My initial reaction is that the thin process described just adds an extra hop during remote execution, but I realize there must be some other benefits or requirements for the comms layer that are involved here (Is there an underlying assumption of 1 message broker or handler per real or virtual "node"?)
Mainly I was trying to emphasize that this design need not stray too far from the current model. E.g. a program launch service (equivalent of wreck) could spawn remote containers/processes in a manner similar if not identical to the way they are spawned locally, just adding a level of indirection. It wouldn't actually have to be implemented that way if it's not efficient. I was channeling BProc a little bit here too. (I'm not entirely sure I understand your parenthetical question. Could you restate?)
I wonder if you mean J1 and J2 are not instances but special purpose parallel programs whose sole purpose it is to emulate the fat node?
That's what I had in mind. I had not really thought about how it would be implemented (whether brokers or something purpose-built). It might be too many layers for efficient compute node use. Anyway that was truly a straw man - a really rough idea that I thought might help us think about that area of the design space.
perhaps we can more explicitly map the pro and cons of each option to see what we are trading off among different options?
Thanks for chiming in @dongahn!
I'll try to regurgitate some of the pros and cons we had discussed earlier in meetings (others can add to this). I wonder actually if we can't go forward with all three of these options in the architecture rather than downselecting, choose one for S4 in the near term, and eventually reach the point where sys admins can use whatever options work best for their environment?
Pros of option 1:
Cons of option 1:
Pros of option 2:
Cons of option 2:
Pros of option 3:
Cons of option 3:
That was just off the top of my head, please feel free to disagree/add to this list.
(I'm not entirely sure I understand your parenthetical question. Could you restate?)
I just meant that I couldn't think of any assumptions in the current model that required the extra thin process per real "node". However, now I think I understand that you didn't mean a persistent process, but one process per remote process executed. That makes a bit more sense.
now I think I understand that you didn't mean a persistent process, but one process per remote process executed.
That's what I meant, but more as a concept. Maybe the privileged helper shell (we really need a name for that thing!) could have a client program that gets executed in its place, or maybe the service that would usually start processes would grow a remote mode internally?
That was just off the top of my head, please feel free to disagree/add to this list.
We might want to also include the specific impact on each of the design criteria (from comment above), e.g. expected noise impact (in order maybe 3, 1, 2), ease with which large loss of compute nodes could be handled, etc.
@garlick: btw, w/ option 2, are we expecting to instantiate another full multi-user flux instance(s) on top of the system instance as the most popular configuration or will the system instance likely be the main instance users will interact with?
We might want to also include the specific impact on each of the design criteria (from comment above), e.g. expected noise impact (in order maybe 3, 1, 2), ease with which large loss of compute nodes could be handled, etc.
Great plan, did you want to start a document somewhere or should I? (I'm not too sure how we want to do that - maybe google doc we can share for now?)
Maybe the privileged helper shell (we really need a name for that thing!)
We were calling it the Flux Security Singularity (flux-singularity), but I'm not sure that will work long term since there's already a singularity project for hpc out of LBL.
could have a client program that gets executed in its place, or maybe the service that would usually start processes would grow a remote mode internally?
Yes, seems like both of these will be required no matter what, though if the shell itself uses services of the instance (e.g. to grab kvs or content), the first requirement is that the processes executed on the remote need to have a valid FLUX_URI for the instance..
Great plan, did you want to start a document somewhere or should I? (I'm not too sure how we want to do that - maybe google doc we can share for now?)
Can we add it to the slides you've started?
w/ option 2, are we expecting to instantiate another full multi-user flux instance(s) on top of the system instance as the most popular configuration or will the system instance likely be the main instance users will interact with?
Great question. I don't know! The system instance should be capable of running a flat workload like slurm, but it does seem like we may want sub-instances started "temporarily" at least for high throughput stuff. Or maybe a long running one implementing policy for debug partition.
Can we add it to the slides you've started?
Yes if that form works. I think I shared with you and @morrone already?
It seemed like a good idea for design slides to accumulate in that deck in random order, then fork mini-presentations from it later, so feel free to add security related stuff there too.
Yes if that form works. I think I shared with you and @morrone already?
Yes let me take a stab at it.
flux-singularity
Nice.
first requirement is that the processes executed on the remote need to have a valid FLUX_URI for the instance..
Sounds like you guys have made some progress thinking this through :-)
Sounds like you guys have made some progress thinking this through :-)
Heh, don't read too much into that statement, it is simply a conclusion based on the assumption that parallel programs will be managed even slightly in the same way they are now (i.e. the 'job shell' will itself use facilities of an instance to manage its processes as a distributed service, e.g. kvs or content store, local or global barriers, fence, etc.)
I added a list of design criteria to the slide deck, and for lack of a better idea a decision matrix that ranks each of the solutions here on the design criteria. I made the rankings on the spot so they might need adjustment, and the criteria are not weighted. My guess is that solutions 1&2 are kind of tied if criteria here are weighted, but overall I like @garlick's proposal to pursue all 3 in the long run anyway...

Could we have that made into a radial table that looks more like a dart board please? Other than that one criticism, looks awesome, thanks!
It is a nice idea that we would some day do all three (or more) approaches, but I think it is pretty unlikely that we'll ever do that. I think especially if we choose the hybrid approach, we'll never go back and do the code refactoring needed to abstract all of the necessary concepts to make other approaches reasonable to implement. I suspect that the odds are lowest of doing multiple approaches if we choose the hybrid approach because it seems to me to have the highest level of intra-instance complexity (due to non-uniformity) in its communication context. Right now the broker doesn't abstract away much of the networking details. For instance, while there is a separate "overlay.[ch]" files, really the overlay files don't know much about the structure of the overlay network. All of the knowledge about network structure lives at the top level in broker.c, and alot of the information is passed between various parts in a giant ctx_t global with little enscapulation between code components.
In other words, I suspect that the same thing that makes Hybrid seem like the easy approach to you guys is also the thing that will make us unlikely to ever do anything else once we go that route. And we also kick the can down the road on figuring out how to make Flux work center-wide, which could bite in the future as well.
Now one could obviously argue that refactoring for the abstractions that support other solutions is a bit orthogonal of an activity from implementing Hybrid, so there is no need to lump them together. But it seems to me that Hybrid only has a simplicity advantage if we avoid that refactoring. And if we avoid the refactoring, we may very well have boxed ourselves into a corner from which we'll never escape.
I think it is a strength of the Multiple Instances solution that each instance can be much more uniform in internal construction than in the Hybrid case. By going with Hybrid we institutionalize the idea that some nodes are necessarily different than other nodes, not only temporally (who is a _current_ service provider) but also by class (who can _ever_ be promoted to a service provider, recovery methodology, routing, etc.) In the Multiple Instances solution all nodes in the instance can be of the same class, and share an overlay network of uniform design with little in the way of code special casing within one instance for services, recovery, etc.
I am not sure why Multiple Instances would necessarily be worse for noise than Hybrid. Unless you subscribe to the idea that "routers" steal resources from the cluster to avoid noise, in which case we need a column to account for downside of lost cluster resources in the Hybrid solution.
Is "HB" heartbeat? If so, is there any reason to list Monitoring separately? Are there any qualities of monitoring that differentiate it from global heartbeat?
I'd say that you're making too much out of that broker refactoring, and I would hate to think that we would use the state of that code as a reason for choosing one architecture over another. I can fix! I can fix! :-)
I think it is a strength of the Multiple Instances solution that each instance can be much more uniform in internal construction than in the Hybrid case. By going with Hybrid we institutionalize the idea that some nodes are necessarily different than other nodes, not only temporally (who is a current service provider) but also by class (who can ever be promoted to a service provider, recovery methodology, routing, etc.) In the Multiple Instances solution all nodes in the instance can be of the same class, and share an overlay network of uniform design with little in the way of code special casing within one instance for services, recovery, etc.
Are you positing a DHT overhaul here? I don't think that's required OR ruled out for any of the proposed architectures. The first two architectures effectively move the part of the system eligible for rearchitecting like that off to the side in a "management domain", leaving the compute nodes relatively out of it, to decouple services from compute node crashes induced by workload, and to give as much of the compute node resources to the users as possible. That non-uniformity seems like a benefit not a liability.
I would have argued that the _latter_ two separate the parts into different domains, while the first one (Hybrid) tries to keep all of the classes of components as tightly coupled into one communication context (domain?) as possible. Granted, that domain has different network arrangements for different nodes, but that seems like added complexity to me.
If we aren't concerned about code complexity, then what really makes Hybrid simpler? I would say that what makes it simpler that we don't need to change current assumptions like tightly coupled node names and addressing (currently a "rank" fixed at configuration/bootstrap time), and it allows us to avoid for a little bit longer making architectural decisions about things we'll need to decide soon enough anyway (e.g. how long-lived instances will communicate and exchange work/resources/etc.).
Are you positing a DHT overhaul here?
Not really, I am looking at things from a step higher in the architecture.
To restate the three choices from the top-level architectural view:
1) One long-lived instance spanning all nodes
2) One long-lived instance spanning only managment nodes (shorter-lived instances launched on compute nodes on-demand)
3) Multiple long-lived instances, one spanning the managment nodes, and one or move spanning the compute nodes
In option 1, all of the puzzle pieces are fully encapsulated into a single instance. With options 2 and 3, instances themselves become puzzle pieces that can be individually configured fairly consistently internally, and then joined together with other instances to make a whole working enterprise.
Well I completely agree with you that we can't keep pushing off the design of how instances fit together in the center context, which is why I think the third option should be in the system architecture. How we prioritize the work would be the question, and I think one could argue that the priority ought to not go too low on that.
I don't necessarily agree that any of the three options _require_ us to rethink fundamental assumptions in the design thus far. In fact, haven't we done some work to show this with the point designs above? (Less so with the 3rd admittedly). I don't want to rule out any design change that has merit, but obviously backtracking takes time so the case has to be pretty good.
In all three cases we have to make instances resilient, and that likely requires us to work on the overlay network (I don't see how it couldn't). So that refactoring and extending is happening regardless.
I would say option 1 is the easiest because it doesn't require the new design of a remote launcher and possible out of band services needed in option 2, and it can get done without the interface design work needed in option 3. Option 3 also has the strongest coupling of compute node (user workload induced) faults and resource manager services of the three options, so it makes the resiliency work harder.
Great question. I don't know! The system instance should be capable of running a flat workload like slurm, but it does seem like we may want sub-instances started "temporarily" at least for high throughput stuff. Or maybe a long running one implementing policy for debug partition.
@garlick: @grondo captured this in his matrix, but my biggest worry w/ the second option then is still performance/scalability. Unless there is a systematic approach to scale up the management domain along with the compute volume, I don't see how we can scale this scheme. To scale, we may need some architectural support like the set of IONs of BGQ to limit the fanout between system domain and compute domain. Of course, this shouldn't dilute the merits of this for smaller systems and special architecture though.
I also wonder if the noise of the option 1 will be really that bad. In particular, if the management traffic will be set to go through management network... There is of course local-node noise this scheme creates but did we ever measure the noise level using something like FTQ/FWQ?
Yeah I think the BG ION comparison is apt for option 2. That type of environment and support for coprocessor style computation were brought up as examples of why option 2 style architecture might be necessary regardless of what we do on regular linux clusters. I think we all share your concerns about the fanout in that option and the challenge would be how to mitigate that while keeping the special purpose code to a minimum.
I also agree the noise in option 1 may not be too bad and that measuring is a good idea (though as I recall the results can be pretty subjective)
I also agree the noise in option 1 may not be too bad and that measuring is a good idea (though as I recall the results can be pretty subjective)
I don't remember much about FTQ/FWQ, but if we could design a quantitative experiment to measure noise (or some sort of noise score based on comparison to a quiescent system), this would be valuable in creating an actual design criteria or requirement. (I think typically engineering requirements need testable specifications!) Similar tests should be designed for "resiliency" and launch scalability, though these tests are much easier to posit.
Yes! Maybe @dongahn could consult with Edgar Leon to see if he has ideas on how to track this? He may already have baselines for slurm-based systems that will be useful and have ideas about how to set pass/fail criteria for doing no worse, say.
Good idea. I will talk with Edgar on this.
@morrone, you make some very valid arguments. I honestly don't have a preference at this point and find the discussion very valuable.
With options 2 and 3, instances themselves become puzzle pieces that can be individually configured fairly consistently internally, and then joined together with other instances to make a whole working enterprise.
Agreed, though conceptually I'm having trouble with your use of the term "enterprise" here. Flux has only instances, so I'm assuming you mean "join together smaller instances to make another instance". As long as this doesn't break the core parent/child instance relationship, this kind of instance composition seems like an axiom of Flux (please correct if I am wrong?).
It seems like no matter how you connect instance to instance to create other instances, you still need some way to make instances that have reliability, scalability, and low noise. If you compose an instance of parts that are not reliable, scalable, and themselves do not have low noise, how is the whole (which itself uses the building blocks of the parts) imbued with properties lacking in its parts?
Sorry if I'm over-simplifying or missed your point here.
Think it might be useful for planning purposes to identify areas of work that would be common to all three approaches?
That certainly makes sense to me if the work is obvious. That work could go on a plan which might give us some development targets on the way to a full system instance implementation? (perhaps allowing us further design time on how the parts fit together?)
Hmm, when I try to make a list it comes out like the bullets in the description of this issue. I guess the main point is to acknowledge that none of the arch options really alter the core of the work that needs to be done. Some add to it. Might be easier to identify the differences? Time to shovel more snow.
In option 1, all of the CNs are bundled into a single domain. In option 3, the CNs can be split accross multiple domains as the administrative needs of job size/rate and fault tolerance demand.
Specific changes to overlay network topology can be applied to any of the solutions, and therefore may be somewhat orthogonal to the higher level architecture...
To me the main characteristic of option 1 is that the CNs are moved to the periphery of a TBON where they can go up and down without disrupting RM services running on non-CNs. They are still part of the system instance and are in one comms session. My use of "domain" above must be different than yours - I was referring to "fault domains" , drawing a distinction between nodes that run user code and thus have higher probability of failure, and nodes dedicated to a management role that should be more stable.
I think we agreed at one point that it was helpful to decouple CNs from instance services, since it creates fewer fault/recovery situations that may cause services to pause.
In option 3, a CN instance would be running RM services (since all Flux instances do that), so this coupling is greater and it may be a challenge to keep the overall system responsive. I like option 3 for cluster-center and workstation-workgroup; it seems trickier for CN-cluster.
If there's a way to make all instances as strongly resilient as a CN instance might need to be, that is great, but it's more challenging and it makes me nervous to propose it with no idea of how to get there.
Thanks, @morrone, your statement does make sense, and the same argument is made for why we'd run an instance per-cluster, say, instead of one across the center. However, there are a few things that still trouble me about this blanket assertion. It would help to understand your take it them.
First, it feels like (I know, we should ignore feelings in software design) there must be some point at which pushing off resilience "down the hierarchy of instances" has to stop. Is your argument that any time a user wants a resilient instance they should split into multiple domains? At some point we'll have build resiliency into a single instance, and it seemed like option 1 was at least a step toward that?
Second, it is still unclear to me what these extra instances actually do. Do they exist only to bootstrap or run real user work? After they are started do they aggregate and report at least resource up/down information to the "parent"? I think you gave a hint when you said they will be created "as the administrative needs of job size/rate and fault tolerance demand", but I fear I wasn't clever enough to follow how these extra instances relate to job size and rate.
Finally, I'm not sure I know what you mean by 'domain' here. You are talking about message routing domain, fault domain, or something else? This might help understand why having all compute nodes bundled into a single domain is a bad thing. Certainly we want them to be in a single scheduling domain?
​
To me the main characteristic of option 1 is that the CNs are moved to the
periphery of a TBON where they can go up and down without disrupting RM
services running on non-CNs. They are still part of the system instance
and are in one comms session. My use of "domain" above must be different
than yours - I was referring to "fault domains" , drawing a distinction
between nodes that run user code and thus have higher probability of
failure, and nodes dedicated to a management role that should be more
stable.
I think we agreed at one point that it was helpful to decouple CNs from
instance services, since it creates fewer fault/recovery situations that
may cause services to pause.
In option 3, a CN instance would be running RM services (since all Flux
instances do that), so this coupling is greater and it may be a challenge
to keep the overall system responsive. I like option 3 for cluster-center
and workstation-workgroup; it seems trickier for CN-cluster.
If there's a way to make all instances as strongly resilient as a CN
instance might need to be, that is great, but it's more challenging and it
makes me nervous to propose it with no idea of how to get there.
@garlick: btw, one of the reasons I like option 1 is having an RM-provided persistent network like this can help innovate the entire HPC ecosystem beyond RM services. This is of course when we can keep noise at bay and making the network(s) resilience. It seems you had good correspondences with an ORTE developer about making k-ary tree resilient, though?
Also, one nice property of this is, two distinct trees can be joined to a new root which then becomes a new larger tree (all of the tree properties still there). It is true, at some point we should learn how to stitch two distinct (independent) instances (two trees) into a larger one which then serve as a building block towards being able to a center-wise instance. This is different than our tradition model of an larger instance creating smaller sub-instances (push vs. pull in a way)... Given where we are, perhaps we can do some zil amount of work to see if we can do this at least at the user level under slurm.
It seems you had good correspondences with an ORTE developer about making k-ary tree resilient, though?
In #953 I quoted Ralph Castain's comments on that topic. My understanding is that OpenMPI's radix tree is not fault tolerant, but Ralph described an algorithm he used in other (unnamed) projects, which sounds very similar to the technique used by the MrNet guys(*). I had based the earlier resiliency in our TBON roughly on that paper, but didn't get it quite right and we pulled it out.
So we don't have a demonstrably resilient TBON in the wild, at least on in the form of ORTE, and the algorithm that seems so easy when Ralph describes it did trip me up before, so I tend to recalibrate his comments in that context.
(*) A Scalable Failure Recovery Model for Tree-based Overlay Networks, Arnold et al, 2009.
Ah... I can talk w/ Dorian to see if his algorithm is being used in production by MRNet and how effective it has been (plus tips+tricks). Will this be helpful at least to some extent?
Ah... I can talk w/ Dorian to see if his algorithm is being used in production by MRNet and how effective it has been (plus tips+tricks). Will this be helpful at least to some extent?
Yes, thanks!
Yes! Maybe @dongahn could consult with Edgar Leon to see if he has ideas on how to track this? He may already have baselines for slurm-based systems that will be useful and have ideas about how to set pass/fail criteria for doing no worse, say.
OK. I talked with Edgar (@eleon), and it appears we should be able to quantify the noise level of these options pretty straightforwardly with the benchmarks he has been using.
His suggestion was to use FWQ to quantify single-node noise level and to use all-reduce or barrier benchmarks to quantify multinode noise where random noise can be more harmful than coordinated ones.
I think one can perform this task immediately, if we want, by measuring the noise level under SLURM vs. SLURM+Flux using our single-user flux instance.
These benchmarks don't generate a single number figure of merit, but one can plot the data and visually inspecting the plots should suggest the merit. If a single number or numbers is/are desired, I believe we can come up with reasonable ones as well. We might need to create some scripts to generate those plots automatically, though.
I'm attaching the slides we produced for the 2/15 milestone. The main deliverable was to get a handle on the system instance architecture. Here's the content of the "Decision and Ratonale" slide:
Option 1 includes the basis for resilient services and system instance bootstrap.
Those aspects need to be pursued. The hybrid overlay portion is less clear, so in
that area, broker refactoring to support multiple overlay plugins is proposed.Option 2 is placed on the back burner for now.
The inter-instance communication implied by option 3 is needed for cluster:center
and workstation:workgroup, therefore we prioritize the _design_ for this option highly.In summary, we propose to do some work to advance options 1 and 3, preparing
the way for a detailed design to replace slurm in 2017 which may have elements
of both approaches.
More detailed work is broken out for next steps. As indicated, this work feeds critical info into the slurm replacement design milestone on 3/15. There will also need to be some discussion with sched team about what jobspec and scheduler pieces will need to be in the 3/15 plan
Did I miss anything @grondo, @morrone, @trws, @chu11?
@garlick: sounds great to me.
The inter-instance communication implied by option 3 is needed for cluster:center
and workstation:workgroup, therefore we prioritize the design for this option highly.
I don't know what you have in mind with this thrust exactly, but it seemed to me figuring out ways to pull independent instances into a common root to establish a larger instance could be a key technology?
This would be different than the current push model where a larger instance instantiates smaller sub-instances in the subsets of its resources.
In the latter case, sub-instances will have their own brokers. But for the former case, the new larger instance may want to reuse the existing brokers of the instances...
There will also need to be some discussion with sched team about what jobspec and scheduler pieces will need to be in the 3/15 plan
Please propose how we can best proceed with this interlock.
@garlick: BTW, do we want to consider having a summer student to work on integrating benchmarks and fault-injectors into Flux to help quantify the criteria you have in your table?
In the latter case, sub-instances will have their own brokers. But for the former case, the new larger instance may want to reuse the existing brokers of the instances...
@dongahn, thanks for the good question. We are indeed talking about an existing instance connecting to a new parent and offering its resources. This mechanism and protocol has to be designed as part of work on option 3. However, I don't see why the same method couldn't be used for a parent to communicate to an instance it spawned via a subjob (the main difference is how resource ownership moves between instances).
Just to clarify, in both cases (push and pull as you termed them), the child instances will have their own brokers and communications overlay, etc..
@grondo, thank you for the quick response! I guess a difficulty I had in my mind with the pull model is:
Does the new parent will use the existing instances to spawn its own brokers to all of the nodes?
@dongahn, In these cases I think the parent only would have brokers on the resources of the child if it spawned the child itself, i.e. if the child instance is running as a program within the parent. The other case would be an adoption, where an existing instance is sadly orphaned and connects to a new parent and offers all of its resources. In this case, the parent may not need (in fact it may be undesirable) any of its own brokers co-located with the child, as there would be some protocol for the parent to forward work to child instances. Part of the argument for option 3 approach I think is that programs launched in such a child would take a lot of load specifically off the parent's brokers, which may need to be a more reliable instance.
That is how I understand the current direction.
One thing that comes to mind is that in your "push" model, the resources are handed off to a new instance and no longer "available" in the parent. In the "pull" model, the resources are taken back from an existing child, but the child still exists and remains responsible for other aspects of resource management, specifically execution and monitoring.
The other case would be an adoption, where an existing instance is sadly orphaned and connects to a new parent and offers all of its resources. In this case, the parent may not need (in fact it may be undesirable) any of its own brokers co-located with the child, as there would be some protocol for the parent to forward work to child instances. Part of the argument for option 3 approach I think is that programs launched in such a child would take a lot of load specifically off the parent's brokers, which may need to be a more reliable instance.
OK. I think we are on the same page then! There will be certain RM services that the new parent wants to do without explicit having its own brokers on the resources where the previous orphaned instance was running. And I thought having this into the design at this point is the right way to go as this appears to me as a main challenge. I appreciate you fleshing out some details for this.
One thing that comes to mind is that in your "push" model, the resources are handed off to a new instance and no longer "available" in the parent. In the "pull" model, the resources are taken back from an existing child, but the child still exists and remains responsible for other aspects of resource management, specifically execution and monitoring.
Yeah. I think this is one of the things we will have to figure out how to solve sooner rather than later (at least in our design :-). There is also a scheduler implication to this as well. In this case, where users submit their jobs to? -- probably we should allow both and how do we coordinate this...
@dongahn, I'd suggest we raise the other two topics (how to work together to determine sched requirements for slurm replacement, possibility of student to assist with test development) at monday's project meeting. A little group discussion is required there I think.
In #953 I quoted Ralph Castain's comments on that topic. My understanding is that OpenMPI's radix tree is not fault tolerant, but Ralph described an algorithm he used in other (unnamed) projects, which sounds very similar to the technique used by the MrNet guys(*). I had based the earlier resiliency in our TBON roughly on that paper, but didn't get it quite right and we pulled it out.
So we don't have a demonstrably resilient TBON in the wild, at least on in the form of ORTE, and the algorithm that seems so easy when Ralph describes it did trip me up before, so I tend to recalibrate his comments in that context.
(*) A Scalable Failure Recovery Model for Tree-based Overlay Networks, Arnold et al, 2009.
@garlick: I just talked to Dorian Arnold (@darnold5845) by phone. His FT implementation should be a part of MRNet upstream. But he will further check if his algorithm is used in production by default and also can send some pointers about what MRNet files have his logic etc. I will create a new issue ticket so that we can have more detailed discussions on this topic.
There was a CEA/NNSA meeting a month or so ago in which @morrone and @hautreux discussed the system instance design in the context of the minimum viable slurm replacement.
My understanding based on @morrone's verbal report was that they discussed a variant of option 1 above (first slide) where the "management domain" is confined to rank 0 as it is now, but it is implemented as a migratable service within an HA cluster to improve availability. E.g. where state is shared somehow between nodes and an IP representing the service migrates so ranks > 0 simply see a dropped TCP connection and reconnect, something zeromq hides.
This would avoid having to distribute service location information, and the redundant overlay network connections proposed in option 1. Lost messages would still be a possibility of course (rank 0 accepts a request then crashes; new rank 0 doesn't respond).
It was a good thought and I realized we hadn't put it down anywhere. Feel free to correct/clarify if I didn't accurately convey the idea.