Hi,
I have a problem with OpenFaaS when setting the min and max replicas to 1, to avoid scaling of the function.
labels:
com.openfaas.scale.min: "1"
com.openfaas.scale.max: "1"
However the serving pod is scaled down to zero once in every minute.
The serving pod of the function shouldn't be scaled down to zero.
Serving pod of the function is scaled down to zero in every minute
Output of kubectl get deployments -n openfaas-fn -w
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
myfunc 0 1 1 1 116m
myfunc 0 1 1 1 116m
myfunc 0 0 0 0 116m
myfunc 1 0 0 0 116m
myfunc 1 0 0 0 116m
myfunc 1 0 0 0 116m
myfunc 1 1 1 0 116m
myfunc 1 1 1 1 116m
myfunc 0 1 1 1 117m
myfunc 0 1 1 1 117m
myfunc 0 0 0 0 117m
myfunc 1 0 0 0 118m
myfunc 1 0 0 0 118m
myfunc 1 0 0 0 118m
myfunc 1 1 1 0 118m
myfunc 1 1 1 1 118m
I have investigated the code of OpenFaaS and found the bottom lines in faas/gateway/handlers/alerthandler.go:
// CalculateReplicas decides what replica count to set depending on current/desired amount
func CalculateReplicas(status string, currentReplicas uint64, maxReplicas uint64, minReplicas uint64, scalingFactor uint64) uint64 {
newReplicas := currentReplicas
step := uint64((float64(maxReplicas) / 100) * float64(scalingFactor))
if status == "firing" {
if currentReplicas == 1 {
newReplicas = step
} else {
if currentReplicas+step > maxReplicas {
newReplicas = maxReplicas
} else {
newReplicas = currentReplicas + step
}
}
} else { // Resolved event.
newReplicas = minReplicas
}
return newReplicas
}
When calculating steps and it happens that maxReplicas is 1 and scaling factor is not set, so its value is 20% by default (according to https://docs.openfaas.com/architecture/autoscaling/) the step value is set to 0 after converting the whole calculation to uint64.
According to the bottom lines, newReplcias are set to 0 in my case.
if currentReplicas == 1 {
newReplicas = step
Solution is to set newReplicas to 1 in case maxReplicas is 1 and steps is 0 when currentReplicas equals to 1.
Append the bottom lines to the function's yml file, and redeploy the function.
labels:
com.openfaas.scale.min: "1"
com.openfaas.scale.max: "1"
After starting to call the function with curl or ab or whatever, the serving pod is scaled down to zero.
I wanted to try how OpenFaaS works without scaling.
FaaS-CLI version ( Full output from: faas-cli version ): 0.7.7
Docker version docker version (e.g. Docker 17.0.05 ): 18.06.1-ce
Are you using Docker Swarm or Kubernetes (FaaS-netes)? FaaS-netes
Operating System and version (e.g. Linux, Windows, MacOS): Ubuntu 16.04, Kernel 4.4.0-134-generic
Hello @szefoka . The faas-idler is responsible to this behaviour. In values.yaml file there is faasIdler: which should have dryRun set to true to not actually scale your functions down to 0 in case they are not invoked. Turn that on to true if it isn't.
Derek add label: question
Derek add label: support
Hi let me ping @Templum, Simon PTAL?
Martin please can you see if you can reproduce this issue?
Alex
Hello @szefoka . The faas-idler is responsible to this behaviour. In
values.yamlfile there isfaasIdler:which should havedryRunset to true to not actually scale your functions down to 0 in case they are not invoked. Turn that on to true if it isn't.
Hi @martindekov! Actually the value of dryRun was originally set to true, so it might be some other issue I think.
I will try to reproduce it. However would like to know if faas idler is involved in the setup, as this could also play a role.
I have deployed the faas-idler with dryrun set to true. The behavior is the same for me with or without it.
I have watched the logs of faas-idler I got this:
{{200 myfunc} [1.540401506699e+09 2.301694915254237]}
{{502 myfunc} [1.540401506699e+09 0]}
2018/10/24 17:18:26 Skip: myfunc due to missing label
{{200 myfunc} [1.540401536704e+09 2.301694915254237]}
{{502 myfunc} [1.540401536704e+09 0]}
2018/10/24 17:18:56 Skip: myfunc due to missing label
{{200 myfunc} [1.540401566709e+09 3.7016949152542367]}
{{502 myfunc} [1.540401566709e+09 0.003389830508474576]}
2018/10/24 17:19:26 Skip: myfunc due to missing label
{{200 myfunc} [1.540401596717e+09 4.098305084745762]}
{{502 myfunc} [1.540401596717e+09 0.003389830508474576]}
2018/10/24 17:19:56 Skip: myfunc due to missing label
I looked into the code of the faas-idler (https://github.com/openfaas-incubator/faas-idler/blob/master/main.go), as I understand it checks if the "com.openfaas.scale.zero" label is set to true. If the value is false or not 1 the above log-lines are printed.
I append 'com.openfaas.scale.zero: "false"' to the function yml, just to be sure, though scaling to zero also happened in the case of running faas-idler with dryRun set to true.
Thank you for providing the debug info. Please do you also have the logs from the gateway?
I cut the part of the log which is related to my function's scaling down to zero and then up to one.
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.046548 seconds
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.047080 seconds
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.044991 seconds
2018/10/24 20:58:13 Alert received.
2018/10/24 20:58:13 {"receiver":"scale-up","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"APIHighInvocationRate","function_name":"myfunc","monitor":"faas-monitor","service":"gateway","severity":"major","value":"16.4"},"annotations":{"description":"High invocation total on ","summary":"High invocation total on "},"startsAt":"2018-10-24T20:58:08.324404523Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus-6c7d64d46b-hl5nx:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"}],"groupLabels":{"alertname":"APIHighInvocationRate","service":"gateway"},"commonLabels":{"alertname":"APIHighInvocationRate","function_name":"myfunc","monitor":"faas-monitor","service":"gateway","severity":"major","value":"16.4"},"commonAnnotations":{"description":"High invocation total on ","summary":"High invocation total on "},"externalURL":"http://alertmanager-ccd8559-hvpdn:9093","version":"4","groupKey":"{}:{alertname=\"APIHighInvocationRate\", service=\"gateway\"}"}
2018/10/24 20:58:13 [Scale] function=myfunc 1 => 0.
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.058671 seconds
...
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.066349 seconds
2018/10/24 20:58:13 Forwarded [GET] to /system/function/myfunc - [200] - 0.002305 seconds
2018/10/24 20:58:13 Forwarded [POST] to /function/myfunc - [200] - 0.051610 seconds
...
2018/10/24 20:58:16 Forwarded [POST] to /function/myfunc - [200] - 0.053221 seconds
2018/10/24 20:58:16 Forwarded [GET] to /system/function/myfunc - [200] - 0.002099 seconds
2018/10/24 20:58:16 Forwarded [POST] to /function/myfunc - [200] - 0.048294 seconds
...
2018/10/24 20:58:18 Forwarded [POST] to /function/myfunc - [200] - 0.047385 seconds
2018/10/24 20:58:18 Forwarded [GET] to /system/function/myfunc - [200] - 0.001952 seconds
2018/10/24 20:58:18 Forwarded [GET] to /system/functions - [200] - 0.005494 seconds
2018/10/24 20:58:21 Forwarded [GET] to /healthz - [200] - 0.000476 seconds
2018/10/24 20:58:21 Forwarded [GET] to /system/function/myfunc - [200] - 0.002725 seconds
2018/10/24 20:58:21 Forwarded [GET] to /system/functions - [200] - 0.002115 seconds
2018/10/24 20:58:23 Forwarded [GET] to /system/function/myfunc - [200] - 0.002897 seconds
2018/10/24 20:58:25 Forwarded [GET] to /system/functions - [200] - 0.002473 seconds
2018/10/24 20:58:26 Forwarded [GET] to /system/function/myfunc - [200] - 0.002498 seconds
2018/10/24 20:58:28 Forwarded [GET] to /system/functions - [200] - 0.006393 seconds
2018/10/24 20:58:28 Forwarded [GET] to /system/function/myfunc - [200] - 0.006327 seconds
2018/10/24 20:58:31 Forwarded [GET] to /healthz - [200] - 0.000640 seconds
2018/10/24 20:58:31 Forwarded [GET] to /system/function/myfunc - [200] - 0.002785 seconds
2018/10/24 20:58:32 Forwarded [GET] to /system/functions - [200] - 0.003320 seconds
2018/10/24 20:58:33 Forwarded [GET] to /system/function/myfunc - [200] - 0.003518 seconds
2018/10/24 20:58:35 Forwarded [GET] to /system/functions - [200] - 0.003398 seconds
2018/10/24 20:58:35 Forwarded [GET] to /system/functions - [200] - 0.002135 seconds
2018/10/24 20:58:36 Forwarded [GET] to /system/function/myfunc - [200] - 0.002754 seconds
2018/10/24 20:58:38 Forwarded [GET] to /system/function/myfunc - [200] - 0.004848 seconds
2018/10/24 20:58:39 Forwarded [GET] to /system/functions - [200] - 0.009307 seconds
2018/10/24 20:58:41 Forwarded [GET] to /healthz - [200] - 0.000443 seconds
2018/10/24 20:58:41 Forwarded [GET] to /system/function/myfunc - [200] - 0.002679 seconds
2018/10/24 20:58:42 Forwarded [GET] to /system/functions - [200] - 0.003117 seconds
2018/10/24 20:58:43 Alert received.
2018/10/24 20:58:43 {"receiver":"scale-up","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"APIHighInvocationRate","function_name":"myfunc","monitor":"faas-monitor","service":"gateway","severity":"major","value":"16.4"},"annotations":{"description":"High invocation total on ","summary":"High invocation total on "},"startsAt":"2018-10-24T20:58:08.324404523Z","endsAt":"2018-10-24T20:58:38.324404523Z","generatorURL":"http://prometheus-6c7d64d46b-hl5nx:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"}],"groupLabels":{"alertname":"APIHighInvocationRate","service":"gateway"},"commonLabels":{"alertname":"APIHighInvocationRate","function_name":"myfunc","monitor":"faas-monitor","service":"gateway","severity":"major","value":"16.4"},"commonAnnotations":{"description":"High invocation total on ","summary":"High invocation total on "},"externalURL":"http://alertmanager-ccd8559-hvpdn:9093","version":"4","groupKey":"{}:{alertname=\"APIHighInvocationRate\", service=\"gateway\"}"}
2018/10/24 20:58:43 [Scale] function=myfunc 0 => 1.
2018/10/24 20:58:43 Forwarded [GET] to /system/function/myfunc - [200] - 0.002687 seconds
2018/10/24 20:58:46 Forwarded [GET] to /system/functions - [200] - 0.003025 seconds
2018/10/24 20:58:46 Forwarded [GET] to /system/function/myfunc - [200] - 0.003477 seconds
2018/10/24 20:58:48 Forwarded [GET] to /system/function/myfunc - [200] - 0.002289 seconds
2018/10/24 20:58:49 Forwarded [GET] to /system/functions - [200] - 0.003028 seconds
2018/10/24 20:58:51 Forwarded [GET] to /healthz - [200] - 0.000638 seconds
2018/10/24 20:58:51 Forwarded [GET] to /system/function/myfunc - [200] - 0.002641 seconds
2018/10/24 20:58:53 Forwarded [GET] to /system/function/myfunc - [200] - 0.002131 seconds
2018/10/24 20:58:53 Forwarded [GET] to /system/functions - [200] - 0.002839 seconds
2018/10/24 20:58:56 Forwarded [GET] to /system/functions - [200] - 0.002223 seconds
2018/10/24 20:58:56 Forwarded [GET] to /system/function/myfunc - [200] - 0.002321 seconds
2018/10/24 20:58:58 Forwarded [GET] to /system/function/myfunc - [200] - 0.003331 seconds
2018/10/24 20:59:00 Forwarded [GET] to /system/functions - [200] - 0.005705 seconds
2018/10/24 20:59:01 Forwarded [GET] to /healthz - [200] - 0.000642 seconds
2018/10/24 20:59:01 Forwarded [GET] to /system/function/myfunc - [200] - 0.003134 seconds
2018/10/24 20:59:03 Forwarded [GET] to /system/functions - [200] - 0.003607 seconds
2018/10/24 20:59:03 Forwarded [GET] to /system/function/myfunc - [200] - 0.002101 seconds
2018/10/24 20:59:05 Forwarded [GET] to /system/functions - [200] - 0.002820 seconds
2018/10/24 20:59:06 Forwarded [GET] to /system/function/myfunc - [200] - 0.002785 seconds
2018/10/24 20:59:07 Forwarded [GET] to /system/functions - [200] - 0.003209 seconds
2018/10/24 20:59:08 Forwarded [GET] to /system/function/myfunc - [200] - 0.002848 seconds
2018/10/24 20:59:10 Forwarded [GET] to /system/functions - [200] - 0.002861 seconds
2018/10/24 20:59:11 Forwarded [GET] to /system/function/myfunc - [200] - 0.005427 seconds
2018/10/24 20:59:11 Forwarded [GET] to /healthz - [200] - 0.000305 seconds
2018/10/24 20:59:13 Forwarded [GET] to /system/function/myfunc - [200] - 0.002615 seconds
2018/10/24 20:59:14 Forwarded [GET] to /system/functions - [200] - 0.003211 seconds
2018/10/24 20:59:16 Forwarded [GET] to /system/function/myfunc - [200] - 0.002498 seconds
2018/10/24 20:59:17 Forwarded [GET] to /system/functions - [200] - 0.002666 seconds
2018/10/24 20:59:18 error with upstream request to: , Post http://myfunc.openfaas-fn.svc.cluster.local.:8080: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [502] - 60.000223 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.050617 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.050752 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.048583 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.046774 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.052220 seconds
2018/10/24 20:59:18 Forwarded [GET] to /system/function/myfunc - [200] - 0.002421 seconds
2018/10/24 20:59:18 Forwarded [POST] to /function/myfunc - [200] - 0.050028 seconds
@Martind were you able to reproduce this?
I am on it
@szefoka thank you for this report. You can disable scaling by scaling Alertmanager to 0 replicas. In the meantime we're going to look into this as a priority.
Alex
Thanks @alexellis this solved my problem temporarily.
I'm not sure if it would do the trick, but I think if I would modify the CalculateReplicas function in /gateway/handlers/alerthandler.go, from this
...
if status == "firing" {
if currentReplicas == 1 {
newReplicas = step
}
...
to something like this
...
if status == "firing" {
if currentReplicas == 1 {
if step == 0 {
newReplicas = 1
} else {
newReplicas = step
}
}
...
would solve this issue.
Thanks,
David
Yeah I was able to replicate the issue with 2500+ requests. Thanks for sharing this with us @szefoka 馃檪
Thanks for the quick fix! @alexellis :)
This was released in the 0.9.8 image available on the Docker Hub.
Derek close: fixed in this commit
@szefoka @Templum @ivanayov this introduced regression into OpenFaaS Cloud where we have a minimum replica count of 1 and max of 4 with proportional scaling of 20%. I believe it used to work before introducing this patch.
Reproduce error with:
func TestScaling_1Min_4Max_Step1(t *testing.T) {
minReplicas := uint64(1)
maxReplicas := uint64(4)
scalingFactor := uint64(50)
current := uint64(1)
want := uint64(2)
newReplicas := CalculateReplicas("firing", current, maxReplicas, minReplicas, scalingFactor)
if newReplicas != want {
t.Logf("Replicas - want: %d, got: %d", want, newReplicas)
t.Fail()
}
}
func TestScaling_1Min_4Max_Step2(t *testing.T) {
minReplicas := uint64(1)
maxReplicas := uint64(4)
scalingFactor := uint64(50)
current := uint64(2)
want := uint64(4)
newReplicas := CalculateReplicas("firing", current, maxReplicas, minReplicas, scalingFactor)
if newReplicas != want {
t.Logf("Replicas - want: %d, got: %d", want, newReplicas)
t.Fail()
}
}
Do we believe it is correct for no autoscaling to take place between the range of 1 - 4 replicas? Clearly 20% of 1 is less than 1 so would round down.