In a resource-constrained environment like a namespace with resource limits imposed (or just an insufficiently provisioned cluster), creating a TaskRun (Pod) that exceeds those limits should not fail the TaskRun, but should instead continually try to create the Pod until it either succeeds or times out.
Pods fail to start and the TaskRun is failed ~immediately.
hello world or something simpleThis is similar to how Jobs can handle Pod scheduling failures by retrying until they are successful.
It's unclear whether users would expect TaskRuns waiting for sufficient resources to queue in order of the time they were created, or whether they'd expect the Kubernetes scheduler to do whatever it needs to do to schedule the Pods. As an initial implementation it's probably fine to have Kubernetes schedule Pods, and not have to worry about enforcing FIFO.
/assign @sbwsg
Been working through some of the implementation details in a POC but want to drop current working notes here since I likely won't be able to work on it more until tomorrow.
Catching a pod failure is relatively straightforward; checking the error message produced by the createPod() func in pkg/reconciler/v1alpha1/taskrun/taskrun.go reveals the reason. From here it's quick to parse out the error message and look for e.g. "exceeded quota" in the string. This relies on a somewhat brittle contract though. I'll also need to check for the different messages generated both by LimitRanges as well as ResourceQuotas since it looks like they both enforce resource limits on a pod. I'm currently looking around to see if there's a less brittle approach to this error checking.
Once the resource constraint error is detected the pod then needs to be restarted. In my POC implementation this works by simply Enqueue()ing the TR to be re-assessed on the next reconcile loop. This results in many rapid reruns however when really it would be nicer to see an exponential backoff strategy similar to that used by k8s' job controller. The job controller uses a particular kind of workqueue to implement this ("ExponentialFailureRateLimiter") but TaskRun's controller Impl uses the "RateLimitedQueue", which is set up via knative's controller.NewImpl() func. So I'm looking at other alternatives to implement this.
I'm going to move this into a design doc. There are enough variables here to seed some discussion and it'd be good to get broader input before committing to one approach.
I've started the design doc here including use cases, a draft implementation, some open questions and possible alternative implementations that I'm still working through.
Most helpful comment
I'm going to move this into a design doc. There are enough variables here to seed some discussion and it'd be good to get broader input before committing to one approach.