Community: Proposal: allow StatefulSet to use RestartPolicy "OnFailure" and "Never"

Created on 15 Mar 2017 · 14Comments · Source: kubernetes/community

Goal

Allow StatefulSet to use the other 2 restart policies: "OnFailure" and "Never".
It enables to deploy "run-once + strong identity" cluster apps using StatefulSet.

Use Cases:

In machine learning, for training a model in distributed way, we define a bunch of ordered stateful container workers, e.g. "worker-0" "worker-1" ... "worker-n". (They use numbers as their suffix. StatefulSet can manage these continuous numbers for us.)
When training begins, these containers talk to each other by their DNS records. (StatefulSet can help us to create their DNS records.)
When the training is done, all the workers save the result into their persistent volumes and exit 0. (StatefulSet helps us to manage their PVs.)
Then, all the stateful Pods stop there in success state, and we still can check their logs.

This is a suitable scene to use StatefulSet, and it is really helpful to allow "OnFailure" and "Never" RestartPolicy in StatefulSet.

Declaration

In the original stateful-apps proposal, it says,

"StatefulSets: Running pods which need strong identity and storage".

The word "running" here, is a verb, which means to run something, not to keep something always in a running state.

BTW, it also says,

"StatefulSets are considered easy to use for deploying clustered software for common cases."

A run-once stateful cluster is also a common case.

Design Points

In the original design of StatefulSet, it only allows the RestartPolicy "Always".

If we can allow the RestartPolicy "OnFailure" and "Never", the Pods in StatefulSet can stay at their final states when their tasks are done, instead of restarting over and over again.

After implementing this proposal, the default value of restart policy is still the "Always", so that it won't affect the existing StatefulSets. If a user wants to use RestartPolicy "OnFailure" and "Never" in StatefulSet, they have to explicitly set the value, which means that they clearly know what they want to do with their apps.

Implementation

Modify the method isRunningAndReady in StatefulSet.

Adding something like this:

// Determine alive or not based on its restart policy
 switch pod.Spec.RestartPolicy {
 case v1.RestartPolicyNever:
    // RestartPolicyNever treats all the other phases as alive
 case v1.RestartPolicyOnFailure:
    if phase != v1.PodRunning && phase != v1.PodSucceeded {
        return false
    }
 default:
    if phase != v1.PodRunning {
        return false
    }
 }

I had made a PR at kubernetes/kubernetes#42892 of implementing this, and @kow3ns suggested me to have a discussion in community before supplying a PR.

Question

This seems to be an enhancement to stateful-apps, should I make a PR directly on "stateful-apps", or open a new propose, e.g. "stateful-apps-restartpolicy"? What do you think?

Source

ohmystack

Most helpful comment

I do want to reiterate from the previous discussion that

ReplicaSets are meant to make sure a specific number of Pods are running at a particular time
DeamonSets are meant to make sure that all (or some) Nodes run a copy of a Pod
StatefulSets are meant to make sure that a specific number of unique copies of a Pod are running.

Right now the restart policy of all three controllers reflects the design intention and is consistent. If we change the semantics of StatefulSet, we make it inconsistent, and, as we evolve the controller to better suite its design intentions, we have to always consider the implications with respect to the "StatefulJob" use case of dealing with completing processes. In the 1.7 time frame, this will make status tracking and container update more difficult for users to reason about and for developers to implement.

StatefulSet was never intended to run all use cases, and imo, if their is a use case for ordered StatefulJobs that can not be met with StatefulSet, this should be implemented as an extension whose implementation and evolution is specifically focused on supporting that use case.

kow3ns on 16 Mar 2017

👍2 👎1

All 14 comments

/cc @smarterclayton @bprashanth @kargakis @kow3ns @thockin

ohmystack on 15 Mar 2017

I don't feel comfortable adding run-once semantics in a controller that is designed to run long-running processes. @soltysh @erictune is this a use case for indexed Jobs?
https://github.com/kubernetes/kubernetes/issues/14188

kargakis on 15 Mar 2017

cc: @kubernetes/sig-apps-feature-requests

kargakis on 15 Mar 2017

@soltysh @erictune is this a use case for indexed Jobs?

That's what it looks to me. Although, IndexedJobs will be limited in comparison with StatefulSet, esp. the storage part, but the stable identity is a key req for it.

soltysh on 15 Mar 2017

@kargakis @soltysh
I'm so glad that it has been designed before. This is my first time to hear about Indexed Job.
The --restart=OnFailure flag and the $INDEX environment variable are fantastic and useful!

I found that kubernetes/kubernetes#14188 has been shelved for over a year. Is there anyone working on or planning to implement this? I would like to help to implement this.

ohmystack on 15 Mar 2017

👍1

Are the workers (1) cooperating to solve a single problem instance, e.g. train a single model? Or are they (2) doing parallel work, e.g. train and evaluate a model 20 times in parallel, each with different values of a hyperparameter?

erictune on 15 Mar 2017

If it is (1), then I'd suggest you just use a statefulSet, and have an outer control loop delete the statefulSet once the problem is done.

If it is (2), then the best current option is to just generate lots of Jobs using a script.

erictune on 15 Mar 2017

@erictune
It is (1). The workers cooperationg to train a single model.
The workers mainly need 2 parameters: "a whole cluster structure" and "index of itself".
The whole cluster structure: this parameter is all the same among all the workers.
The index: this paremeter is different. Each worker need to use the index number to know which part to take. e.g. --index=$INDEX

For a distributed TensorFlow case, the commands look like this:

Code are copied from the docs https://www.tensorflow.org/deploy/distributed
There are two types of Indexed Job or StatefulSet, "ps" and "worker".

# On ps0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1

# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1

Your advice is acceptable, but one thing, the logs.
The problem is, when the containers exit 0, they are restarted immediately. Their logs are lost. If the container images are provided by end-users, I cannot easily change their entrypoints or commands to redirect the stdout into a file. I have to build a logging system to collect container logs in live.

I know there are many work-around ways to build the above cluster, and actually I already did that.
Through this proposal, I want to discuss more about whether we can make it easier with the community. :)

ohmystack on 15 Mar 2017

I do want to reiterate from the previous discussion that

ReplicaSets are meant to make sure a specific number of Pods are running at a particular time
DeamonSets are meant to make sure that all (or some) Nodes run a copy of a Pod
StatefulSets are meant to make sure that a specific number of unique copies of a Pod are running.

kow3ns on 16 Mar 2017

👍2 👎1

Yes, probably TPR is what you want to use:) Although TPRs are not ideal
today

On Thu, Mar 16, 2017 at 1:32 PM, Kenneth Owens notifications@github.com
wrote:

I do want to reiterate from the previous discussion that

ReplicaSets are meant to make sure a specific number of Pods are
running at a particular time
https://kubernetes.io/docs/user-guide/replicasets/#when-to-use-a-replicaset

DeamonSets are meant to make sure that all (or some) Nodes run a
copy of a Pod
https://kubernetes.io/docs/admin/daemons/#what-is-a-daemonset

StatefulSets are meant to make sure that a specific number of unique
copies of a Pod are running
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/stateful-apps.md
.

Right now the restart policy of all three controllers reflects the design
intention and is consistent. If we change the semantics of StatefulSet, we
make it inconsistent, and, as we evolve the controller to better suite its
design intentions, we have to always consider the implications with respect
to the "StatefulJob" use case of dealing with completing processes. In the
1.7 time frame, this will make status tracking and container update more
difficult for users to reason about and for developers to implement.

StatefulSet was never intended to run all use cases, and imo, if their is
a use case for ordered StatefulJobs that can not be met with StatefulSet,
this should be implemented as an extension whose implementation and
evolution is specifically focused on supporting that use case.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/community/issues/458#issuecomment-287133219,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFf5mq7Sgk0E4tshJm3ZeB5UOSz0cIks5rmXIpgaJpZM4MdpSz
.

kargakis on 16 Mar 2017

Yeah, "StatefulJobs" is how we would bucket this. Our controller types (Deployment, RS, StatefulSet) are not for "jobs".

smarterclayton on 16 Mar 2017

👍1

Agree. After the discussion here, I think the "Indexed Job" or "StatefulJob" may be the best solution.

ohmystack on 17 Mar 2017

@ohmystack with the Indexed Job being tracked in https://github.com/kubernetes/kubernetes/issues/14188, are you ok with closing this in favor of it?

soltysh on 27 Mar 2017

@soltysh OK. I'll close this. Let's track the Indexed Job.

ohmystack on 28 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings