We're attempting to integrate the jaeger all-in-one pod in our development environments and are experiencing shutdowns during our test suite, with the jaeger-all-in-one pod reporting the following the following during the test:
all-in-one logs:
{"level":"info","ts":1530582648.1951566,"caller":"healthcheck/handler.go:99","msg":"Health Check server started","http-port":14269,"status":"unavailable"}
{"level":"info","ts":1530582648.1964855,"caller":"tchannel/builder.go:89","msg":"Enabling service discovery","service":"jaeger-collector"}
{"level":"info","ts":1530582648.196542,"caller":"peerlistmgr/peer_list_mgr.go:111","msg":"Registering active peer","peer":"127.0.0.1:14267"}
{"level":"info","ts":1530582648.1970081,"caller":"standalone/main.go:179","msg":"Starting agent"}
{"level":"info","ts":1530582648.1976883,"caller":"standalone/main.go:219","msg":"Starting jaeger-collector TChannel server","port":14267}
{"level":"info","ts":1530582648.1977384,"caller":"standalone/main.go:229","msg":"Starting jaeger-collector HTTP server","http-port":14268}
{"level":"info","ts":1530582648.214731,"caller":"standalone/main.go:290","msg":"Registering metrics handler with jaeger-query HTTP server","route":"/metrics"}
{"level":"info","ts":1530582648.2148683,"caller":"standalone/main.go:296","msg":"Starting jaeger-query HTTP server","port":16686}
{"level":"info","ts":1530582648.2148962,"caller":"healthcheck/handler.go:133","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1530582649.1985323,"caller":"peerlistmgr/peer_list_mgr.go:157","msg":"Not enough connected peers","connected":0,"required":1}
{"level":"info","ts":1530582649.198617,"caller":"peerlistmgr/peer_list_mgr.go:166","msg":"Trying to connect to peer","host:port":"127.0.0.1:14267"}
{"level":"info","ts":1530582649.199323,"caller":"peerlistmgr/peer_list_mgr.go:176","msg":"Connected to peer","host:port":"[::]:14267"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x8bd634]
goroutine 122 [running]:
github.com/jaegertracing/jaeger/cmd/agent/app/reporter/tchannel.(*Reporter).EmitBatch(0xc420200180, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/reporter/tchannel/reporter.go:119 +0x84
github.com/jaegertracing/jaeger/thrift-gen/jaeger.(*agentProcessorEmitBatch).Process(0xc420326170, 0xc400000000, 0xe863c0, 0xc4201b17c0, 0xe863c0, 0xc4201b17c0, 0xc42065ff28, 0xc420432f80, 0xc4203261f0)
/home/travis/gopath/src/github.com/jaegertracing/jaeger/thrift-gen/jaeger/agent.go:137 +0xc9
github.com/jaegertracing/jaeger/thrift-gen/jaeger.(*AgentProcessor).Process(0xc4202a2040, 0xe863c0, 0xc4201b17c0, 0xe863c0, 0xc4201b17c0, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/jaegertracing/jaeger/thrift-gen/jaeger/agent.go:111 +0x2fd
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer(0xc4200812c0)
/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:110 +0x179
created by github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).Serve
/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:82 +0x62
The pod is implemented via helm using a slightly modified version of the all-in-one chart template found at https://github.com/jaegertracing/jaeger-kubernetes/blob/master/all-in-one/jaeger-all-in-one-template.yml
Following are our deployment and service configurations.
all-in-one deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: 2018-07-02T21:00:27Z
generation: 1
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-deployment
heritage: Tiller
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one
namespace: default
resourceVersion: "73847"
selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one
uid: f2662b0a-7e3a-11e8-ac4d-42010a8e0fe3
spec:
replicas: 1
selector:
matchLabels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
spec:
containers:
- env:
- name: COLLECTOR_ZIPKIN_HTTP_PORT
image: jaegertracing/all-in-one:1.5.0
imagePullPolicy: IfNotPresent
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one
ports:
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 9411
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 16686
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 750m
memory: 1Gi
requests:
cpu: 250m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-07-03T11:53:59Z
lastUpdateTime: 2018-07-03T11:53:59Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
all-in-one services:
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2018-07-02T21:00:27Z
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-agent
heritage: Tiller
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-agent
namespace: default
resourceVersion: "834"
selfLink: /api/v1/namespaces/default/services/tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-agent
uid: f2815108-7e3a-11e8-ac4d-42010a8e0fe3
spec:
clusterIP: None
ports:
- name: agent-zipkin-thrift
port: 5775
protocol: UDP
targetPort: 5775
- name: agent-compact
port: 6831
protocol: UDP
targetPort: 6831
- name: agent-binary
port: 6832
protocol: UDP
targetPort: 6832
- name: agent-configs
port: 5778
protocol: TCP
targetPort: 5778
selector:
app: jaeger-all-in-one
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2018-07-02T21:00:27Z
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-collector
heritage: Tiller
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-collector
namespace: default
resourceVersion: "832"
selfLink: /api/v1/namespaces/default/services/tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-collector
uid: f27c80fd-7e3a-11e8-ac4d-42010a8e0fe3
spec:
clusterIP: 10.113.30.20
ports:
- name: jaeger-collector-tchannel
port: 14267
protocol: TCP
targetPort: 14267
- name: jaeger-collector-http
port: 14268
protocol: TCP
targetPort: 14268
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
selector:
app: jaeger-all-in-one
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2018-07-02T21:00:27Z
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-query
heritage: Tiller
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-query
namespace: default
resourceVersion: "827"
selfLink: /api/v1/namespaces/default/services/tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-query
uid: f270e4b3-7e3a-11e8-ac4d-42010a8e0fe3
spec:
clusterIP: 10.113.20.47
ports:
- name: query-http
port: 80
protocol: TCP
targetPort: 16686
selector:
app: jaeger-all-in-one
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2018-07-02T21:00:27Z
labels:
app: jaeger-all-in-one
chart: jaeger-all-in-one-0.1.00
component: all-in-one-zipkin
heritage: Tiller
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
name: tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-zipkin
namespace: default
resourceVersion: "846"
selfLink: /api/v1/namespaces/default/services/tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-zipkin
uid: f2927bd1-7e3a-11e8-ac4d-42010a8e0fe3
spec:
clusterIP: None
ports:
- name: jaeger-collector-zipkin
port: 9411
protocol: TCP
targetPort: 9411
selector:
app: jaeger-all-in-one
component: all-in-one-deployment
jaeger-infra: jaeger-all-in-one
release: tjb-jgr-allinone-stability-v2-1
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Could you just confirm what you see when you run the version command for the backing container image? Like:
$ kubectl exec -it POD_NAME -- /go/bin/standalone-linux version
{"gitCommit":"c8d514c4b45e32fb037497ae2ef4042b3005f902","GitVersion":"v1.5.0","BuildDate":"2018-06-30T22:11:21Z"}
POD_NAME look like this to me when I run kubectl create -f ....../jaeger-all-in-one-template.yml: jaeger-deployment-8648cb84d5-cg82n
@jpkrohling standalone-linux version on my pod reports the following:
─ k exec -it tjb-jgr-allinone-stability-v2-1-jaeger-all-in-one-6786d849hr2ml -- /go/bin/standalone-linux version
{"gitCommit":"ab77ac7efdabbcdce9e53628681ec45421f05e58","GitVersion":"v1.5.0","BuildDate":"2018-05-28T15:24:54Z"}
Thanks, I'll try to reproduce it on my side. Is the test failure happening consistently, or is it intermittent? If it's consistent, are you able to share a code that would trigger this condition?
Unable to share code, the failure occurs consistently during our pytest suite of our smart contract modules although at varying points in the tests. I've had it get nearly through the entire suite of tests before failing, sometimes in the first 50 or so out of a total of 200+.
We have a total of four modules reporting into the all-in-one pod for tracing across python, go, and haskell modules if thats helpful at all.
That's helpful, thanks! It sounds like a concurrency issue to me, which will make it a bit harder to reproduce. In any case, we'll need your help again during the validation of the fix!
Most helpful comment
That's helpful, thanks! It sounds like a concurrency issue to me, which will make it a bit harder to reproduce. In any case, we'll need your help again during the validation of the fix!