Test-infra: Scheduler benchmark job missing some lines of stdout while printing results

Created on 23 Jan 2018 · 34Comments · Source: kubernetes/test-infra

I recently added a scheduler-benchmark job which basically runs make test-integration ... inside a dockerized environment. The job runs this script using the kubernetes_verify scenario. That script is quite similar to the test-dockerized.sh script which is used to run our normal integration tests - while this one is used to run benchmark integration tests.

When I locally run the make command, I get the desired output (with some lines of benchmark results in b/w):

$ make test-integration WHAT="./test/integration/scheduler_perf" KUBE_TEST_ARGS="-run='XXX' -bench=. -benchmem"
+++ [0123 20:46:51] Checking etcd is on PATH
/usr/local/google/home/shyamjvs/Downloads/etcd-v3.1.11-linux-amd64/etcd
+++ [0123 20:46:51] Starting etcd instance
etcd --advertise-client-urls http://127.0.0.1:2379 --data-dir /tmp/tmp.SzP8ZqrfTO --listen-client-urls http://127.0.0.1:2379 --debug > "/dev/null" 2>/dev/null
Waiting for etcd to come up.
+++ [0123 20:46:52] On try 2, etcd: : http://127.0.0.1:2379
{"action":"set","node":{"key":"/_test","value":"","modifiedIndex":4,"createdIndex":4}}
+++ [0123 20:46:52] Running integration test cases
Running tests for APIVersion: v1
+++ [0123 20:46:54] Running tests without code coverage
W0123 20:47:11.417751   90561 admission.go:61] AlwaysAdmit admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0123 20:47:11.418413   90561 services.go:35] Network range for service cluster IPs is unspecified. Defaulting to 10.0.0.0/24.
I0123 20:47:11.418442   90561 master.go:269] Node port range unspecified. Defaulting to 30000-32767.
...
...
BenchmarkScheduling/100Nodes/0Pods-12        500       2393175 ns/op      431003 B/op       4825 allocs/op
...
...
BenchmarkScheduling/100Nodes/1000Pods-12                 500       2164436 ns/op      459811 B/op       5366 allocs/op
...
...

All those lines are also present in the CI job's build-log.txt (e.g - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-benchmark-scheduler-master/71/build-log.txt), except those lines starting with Benchmark which are basically results of the go benchmark.

cc @krzyzacy @kubernetes/test-infra-maintainers - Does someone know what's happening with those lines?

kinbug kinfailing-test sitesting

Source

shyamjvs

All 34 comments

cc @krzyzacy
also FWIW test-dockerized.sh is a legacy hack :(

BenTheElder on 23 Jan 2018

I'll suspect something goes wonky during our fancy d-in-d-in-d transition? :joy:
cc @ixdy if Jeff has seen something like this before

krzyzacy on 23 Jan 2018

from your log output this is mucking with godep and other things for no reason at the beginning I suspect, this should almost definitely just be a normal test job running on kubekins e2e image without dind.

BenTheElder on 23 Jan 2018

I'll suspect something goes wonky during our fancy d-in-d-in-d transition?

The question is what is so special with these benchmark lines that it happens only for them - when there are other lines (for e.g Waiting for etcd to come up.) that are also printed to stdout, but they do appear in the logs.

this should almost definitely just be a normal test job running on kubekins e2e image without dind

I'm not sure if kubekins e2e would be the right image, as this job doesn't run any e2e test.. It just runs some integration tests, like for e.g pull-kubernetes-verify (which also uses kubernetes_verify scenario).

shyamjvs on 23 Jan 2018

Even otherwise, I don't think kubernetes_e2e is the right scenario to be used as it calls the kubekins binary which is meant for the e2e setup (i.e bring up cluster, run e2e tests, etc..). Functionally, this job is quite different. Here we're just running integration tests (i.e everything running locally within the container as some binary - nothing really to interact with externally).

shyamjvs on 23 Jan 2018

I'm not sure if kubekins e2e would be the right image, as this job doesn't run any e2e test.. It just runs some integration tests, like for e.g pull-kubernetes-verify (which also uses kubernetes_verify scenario).

The kubekins-e2e image is the catch-all image for tests that need versioned go, bazel, kubetest, gcloud etc.

pull-kubernetes-verify is the worst presubmit we have currently (very slow, buggy, complex to maintain) and only runs the way it does because this job is blocking and still needs to work identically post-Jenkins. We have a TODO somewhere to delete the current kubernetes_verify.py after we figure out how to best solve this :(

Running this way does docker-on-docker in docker-in-docker just to run in a certain image and be able to do docker builds. There's no doubt that this is buggy. I suspect we see some issues with the layers of log buffering going on here.

BenTheElder on 23 Jan 2018

Even otherwise, I don't think kubernetes_e2e is the right scenario to be used as it calls the kubekins binary which is meant for the e2e setup (i.e bring up cluster, run e2e tests, etc..). Functionally, this job is quite different. Here we're just running integration tests (i.e everything running locally within the container as some binary - nothing really to interact with externally).

Integration tests probably just need the execute scenario, images and scenarios aren't actually very related. kubekins-e2e is a bad name, it is just an image that has all the dependencies for e2es and building. The image doesn't run e2e's itself, that's handled by bootstrap / the scenario.

BenTheElder on 23 Jan 2018

I see. Then my question reduces to the following - how do I pass a --script parameter like this to kubernetes_e2e?

shyamjvs on 23 Jan 2018

Use execute.py instead and then use an image like gcloud-in-go or kubekins-e2e.

BenTheElder on 23 Jan 2018

That is, unless somehow this is actually using docker. I suspect we just need to add versioned etcd @krzyzacy

We're discussing how to solve this today, we've been putting off solving the verify/unit jobs but I don't want those work arounds to become permanent.

BenTheElder on 23 Jan 2018

I can try to re-configure the job and see if it will work better

krzyzacy on 23 Jan 2018

Ok.. let's try that.

shyamjvs on 23 Jan 2018

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-benchmark-scheduler-master/72

hummmmm @shyamjvs can you tell where's the failure? I saw
ok k8s.io/kubernetes/test/integration/scheduler_perf 187.646s though

krzyzacy on 23 Jan 2018

I believe it's this:

./hack/jenkins/benchmark-dockerized.sh: line 54: /workspace/artifacts/BenchmarkResults.txt: No such file or directory

which is coming from this line of code in the script: https://github.com/kubernetes/kubernetes/blob/master/hack/jenkins/benchmark-dockerized.sh#L54
This didn't happen before and it's quite surprising why it just didn't create that file to write the output.

shyamjvs on 24 Jan 2018

I've sent out the above PR to fix the non-existing file issue. Hope it helps.

shyamjvs on 24 Jan 2018

👍1

W0124 17:25:37.647] + touch /workspace/artifacts/BenchmarkResults.txt
W0124 17:25:37.648] touch: cannot touch '/workspace/artifacts/BenchmarkResults.txt': No such file or directory

hummm maybe '/workspace/artifacts` did not exist yet?

krzyzacy on 24 Jan 2018

Yup.. that seems to be the issue. Will send out a fix asap.

shyamjvs on 24 Jan 2018

Ok, the above PR fixed those failures. However, the original issue (i.e missing lines of benchmark output to stdout) still remains.
And in fact those are the lines most useful from the job.

@krzyzacy @BenTheElder - Any leads? Maybe sth going wrong with the stdout stream in the container?

shyamjvs on 25 Jan 2018

@shyamjvs can you point us where are those lines get printed during the integration test?

krzyzacy on 25 Jan 2018

They are printed by go when it runs the test.. i.e. as part of go test which is called inside the target make test-integration ... in the benchmark script.

shyamjvs on 25 Jan 2018

Maybe there's a way to reproduce this locally? Any leads on how to run the job in the kubekins_e2e container locally?

shyamjvs on 25 Jan 2018

the container looks like

containers:
    - image: gcr.io/k8s-testimages/kubekins-e2e:v20180123-1260ba9fc-master
      args:
      - --repo=k8s.io/kubernetes=master
      - --timeout=40
      - --root=/go/src
      env:
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: GOPATH
        value: /go
      volumeMounts:
      - name: service
        mountPath: /etc/service-account

you'll need to mount a service account

krzyzacy on 25 Jan 2018

@shyamjvs what is printing out the BenchmarkScheduling/ lines? are they being printed to stderr instead of stdout?

ixdy on 29 Jan 2018

No, they're being printed to stdout (checked it by redirecting stderr >
/dev/null).

On Mon, Jan 29, 2018, 1:15 AM Jeff Grafton notifications@github.com wrote:

@shyamjvs https://github.com/shyamjvs what is printing out the
BenchmarkScheduling/ lines? are they being printed to stderr instead of
stdout?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/6392#issuecomment-361109899,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEIhk7V4O2glkpp-NS_43RJl0NWOolwaks5tPQ2KgaJpZM4RqOOC
.

shyamjvs on 29 Jan 2018

do you know what code is printing them out? maybe there's some verbosity setting somewhere that's being set to something different?

ixdy on 29 Jan 2018

I've spent some time yesterday trying to reproduce this issue locally (within a container) and finally managed to do it by:

docker run -it \
-e JOB_NAME=ci-benchmark-scheduler-master \
-e GOOGLE_APPLICATION_CREDENTIALS=/etc/service-account/service-account.json \
-v /etc/service-account:/etc/service-account \
gcr.io/k8s-testimages/kubekins-e2e:v20180123-1260ba9fc-master \
--repo=k8s.io/kubernetes=master \
--timeout=40 \
--root=/go/src

Before running the above you need to ensure that you have the kubekins GCP service-account stored locally on your machine at /etc/service-account/service-account.json.

shyamjvs on 31 Jan 2018

It turns out that those benchmark lines are not printed as part of the docker run command's output, but if I exec into the running container and run the make test-integration ... command on bash, those lines get printed.

shyamjvs on 31 Jan 2018

While we were digging it up, @porridge identified the reason why this was happening. The stdout of the go test command which was being called under the hoods by the make command, was being filtered based on a regexp. This regexp was turned on due to setting the KUBE_JUNIT_REPORT_DIR var inside our script.

I'll send out a fix soon to disable it.

shyamjvs on 31 Jan 2018

I also managed to verify that the fix works by running it against my PR using:

docker run -it \
-e JOB_NAME=ci-benchmark-scheduler-master \
-e GOOGLE_APPLICATION_CREDENTIALS=/etc/service-account/service-account.json \
-v /etc/service-account:/etc/service-account \
gcr.io/k8s-testimages/kubekins-e2e:v20180123-1260ba9fc-master \
--repo=k8s.io/kubernetes=master:65c0d55bb,59068:b12a6e367 \
--timeout=40 \
--root=/go/src

and the output had:

I0131 13:22:56.550] goos: linux
I0131 13:22:56.550] goarch: amd64
I0131 13:22:56.550] pkg: k8s.io/kubernetes/test/integration/scheduler_perf
I0131 13:22:56.550] BenchmarkScheduling/100Nodes/0Pods-12        500       2038534 ns/op      431378 B/op       4826 allocs/op
I0131 13:22:56.551] BenchmarkScheduling/100Nodes/1000Pods-12                 500       2279950 ns/op      466114 B/op       5451 allocs/op
I0131 13:22:56.551] BenchmarkScheduling/1000Nodes/0Pods-12                   100      26056228 ns/op     2390752 B/op      27403 allocs/op
I0131 13:22:56.551] BenchmarkScheduling/1000Nodes/1000Pods-12                100      27349647 ns/op     2676577 B/op      32578 allocs/op
I0131 13:22:56.551] BenchmarkSchedulingAntiAffinity/500Nodes/250Pods-12                  250      57442820 ns/op     3572412 B/op      28158 allocs/op
I0131 13:22:56.551] BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods-12                 250     173754062 ns/op    35009722 B/op     205868 allocs/op
I0131 13:22:56.551] PASS
I0131 13:22:56.551] ok      k8s.io/kubernetes/test/integration/scheduler_perf   140.174s

shyamjvs on 31 Jan 2018

@krzyzacy @BenTheElder While the above would fix the issue with those missing lines, there's one more piece that we need here.
The pretty-printed results for the benchmark are written to WORKSPACE/artifacts/, but the dir is not being uploaded to GCS (for e.g - http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-benchmark-scheduler-master/249/).

Could you fix the execute scenario (or maybe bootstrap.py) to do that?

shyamjvs on 31 Jan 2018

it might work if you rename the artifacts directory to _artifacts. I think only the unit/verify scenarios use artifacts, and only inside the test container (it's mapped to _artifacts outside).

ixdy on 31 Jan 2018

I included that change in my above PR. Will see if that works, thanks.

shyamjvs on 31 Jan 2018

@ixdy good catch :joy: one day we will maybe want to unify everything :man_shrugging:

krzyzacy on 31 Jan 2018

Just checked that my fix works, and it indeed seems so - http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-benchmark-scheduler-master/297/
Thanks guys!

shyamjvs on 2 Feb 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings