I want to be able to use an ingress as in --cluster.peers to discover all peers inside another cluster. Is this possible? Or do I need to create and ingress for each prometheus and thanos store in my cluster?
I know I can use the service like --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
I'm referencing your slide 80 on slideshare
Also an example how you do this would be sweet!!!
Sure, we can try to explain that (:
Global gossip:
Using static --store option:
Using static --store + gossip option:
So the answer to your title is: yes and we are doing that right now.
[Thanos query -> (static) -> Thanos query per each environment] -> (static + proxy) -> [ Thanos-query -> (gossip) local Thanos sidecars]
Where things inside [ ] are in the same network.
--cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900 works only if you can resolve this DNS entry and resolved IPs can be accessed.
@Bplotka Thanks so much. So this is my setup...
Is this right? What would you change? I'm using different buckets for each cluster, but was thinking I can just use the same one. I not able to resolve datapoints cross cluster, but I can get the series names.
values.yaml
ingress:
host: k8clusterA or k8clusterB or k8clusterC # this is a --set-value
query:
peers:
- "cluster.monitoring.k8clusterA:80"
- "cluster.monitoring.k8clusterB:80"
- "cluster.monitoring.k8clusterC:80"
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: thanos-cluster
spec:
rules:
- host: cluster.monitoring.{{ .Values.ingress.host }}
http:
paths:
- path: /
backend:
serviceName: thanos-peers
servicePort: 10900
---
apiVersion: v1
kind: Service
metadata:
name: thanos-peers
namespace: monitoring
labels:
app: {{ template "thanos.name" . }}
chart: {{ template "thanos.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
type: ClusterIP
clusterIP: None
ports:
- name: cluster
port: 10900
targetPort: cluster
selector:
thanos-peer: "true"
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: thanos-query
spec:
rules:
- host: thanos.monitoring.{{ .Values.ingress.host }}
http:
paths:
- path: /
backend:
serviceName: thanos-query
servicePort: 9090
---
apiVersion: v1
kind: Service
metadata:
labels:
app: thanos-query
chart: {{ template "thanos.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
name: thanos-query
spec:
externalTrafficPolicy: Cluster
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http-query
- port: 10900
targetPort: cluster
name: cluster
selector:
component: thanos-query
sessionAffinity: None
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-query
labels:
chart: {{ template "thanos.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
serviceName: "thanos-query"
replicas: 2
selector:
matchLabels:
app: thanos
component: thanos-query
thanos-peer: "true"
template:
metadata:
labels:
app: thanos
component: thanos-query
thanos-peer: "true"
spec:
containers:
- name: thanos-query
image: improbable/thanos:v0.1.0-rc.2
args:
- "query"
- "--log.level=debug"
- "--cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900"
- "--query.replica-label=replica"
{{- range .Values.query.peers }}
- "--cluster.peers={{ . }}"
{{- end }}
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: cluster
containerPort: 10900
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-store
labels:
chart: {{ template "thanos.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
serviceName: "thanos-store"
replicas: 2
selector:
matchLabels:
component: thanos-store
thanos-peer: "true"
template:
metadata:
labels:
app: thanos
component: thanos-store
thanos-peer: "true"
spec:
containers:
- name: thanos-store
image: improbable/thanos:v0.1.0-rc.2
env:
- name: S3_SECRET_KEY
value: <key>
args:
- "store"
- "--log.level=debug"
- "--tsdb.path=/prometheus"
- "--cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900"
- "--s3.endpoint=s3.us-west-2.amazonaws.com"
- "--s3.bucket=<bucket>"
- "--s3.access-key=<key>"
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: cluster
containerPort: 10900
volumeMounts:
- name: data
mountPath: /prometheus
volumes:
- name: data
emptyDir: {}
So I've been mess with this the last few days and the major blocker I'm running into is that the cluster.advertise-address relies on an ip:port combination vs a host:port. Unless you can set a set a static IP which you advertise for the other components to find it's way to the peer, this looks impossible.
In your example you're using ingress to the gossip port, but nginx ingress by default is an HTTP layer not a TCP. You can piggy back on your ingress controller using the following method.
https://github.com/kubernetes/ingress-nginx/blob/master/docs/user-guide/exposing-tcp-udp-services.md
@Bplotka How is your proxy configured? Is (static + proxy) just a store that can resolve ips on both clusters and those are it's peers?
[Thanos query -> (static) -> Thanos query per each environment] -> (static + proxy) -> [ Thanos-query -> (gossip) local Thanos sidecars]
So basically its networked like this?
Cluster A :
Proxy:
Cluster B:
Peers Configs:
Then Query in Cluster A and B can point at their cluster's store using --store?
So right now I've dropped the store store component. I'm on AWS and the work isn't finalized for s3, if my next hunch is right, this can be added.
For initial testing I've been removing all kubes based service discovery and going through my ingress controller, or external services in hopes it will scale to my multicluster environment.
I've tried configuring tcp redirect on my ingress for 30900 > thanos-peers service. This would get every component in the cluster matching the label selector, this look liked it worked in a single peer setup. Problem is that the sidecars and the query api register the pod IP, which I'm using Calico on AWS, and that IP is not accessible, probably not a problem with GKE, or kube-net, where traffic can be routed.
Then I tried using an AWS Network Load Balancer which advertises a static IP. My problem through is that as far as configuration goes this is a nightmare. I would need to provision, then check the IP, then go back and update the app with the IPs, and it really isn't ideal. This way might work if followed through, but I don't like it.
Currently, I was looking at hostPort Option for the cluster port. I'll have external-dns post records with the components node IP, and bind to the port. So now I control the node, dns is provisioned on resource create. The final piece is telling the sidecar and query it's advertised address, which I think can be done via an ENV var and mapping it's value from .status.hostIP
In the end it will look like this
Cluster 1:
Node 1 (192.168.1.10)
Prom / Sidecar
Binding to Node's 10900
--cluster.peers="thanos-peers.cluster1.example.org"
--cluster.peers="thanos-peers.cluster2.example.org"
--cluster.advertise-address=$NODE_IP:10900
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIp
DNS record thanos-peer.cluster1.example.org -> 192.168.1.10
Node 2 (192.168.1.20)
Query 1
Binding to Node's 10900
--cluster.peers="thanos-peers.cluster1.example.org"
--cluster.peers="thanos-peers.cluster2.example.org"
--cluster.advertise-address=$NODE_IP:10900
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIp
DNS record thanos-peer.cluster1.example.org -> 192.168.1.20
Node 3 (192.168.1.30)
Query 2
Binding to Node's 10900
--cluster.peers="thanos-peers.cluster1.example.org"
--cluster.peers="thanos-peers.cluster2.example.org"
--cluster.advertise-address=$NODE_IP:10900
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIp
DNS record thanos-peer.cluster1.example.org -> 192.168.1.30
Cluster 2:
Node 1 (192.168.2.10)
Prom / Sidecar
Binding to Node's 10900
--cluster.peers="thanos-peers.cluster1.example.org"
--cluster.peers="thanos-peers.cluster2.example.org"
--cluster.advertise-address=$NODE_IP:10900
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIp
DNS record thanos-peer.cluster2.example.org -> 192.168.2.10
I realize this is SUPER verbose and doesn't answer your question exactly. I'm also not finished testing this method to confirm it will work, but it's the last idea I think I've come up with with my stack. I hope it provides some direction for you.
Regarding the mention of a proxy above I'm not entirely sure how that resolves anything. I'm rolling out istio currently and I'm not certain of how this would bridge the gap, unless their envoy's are communicating cross cluster. Which that is cool for them, but as of now we're having our pod network isolated per cluster. I still stand by the biggest pain in the ass regarding thanos + kubes is it relying on a static IP. Which in the cloud is often ... difficult.
I'll update tomorrow when I've had some sleep and a chance to test these assumptions.
Best of luck
So my final working configuration is Thanos query running in one cluster. I set my sidecars to point to my ingress controller. I map my sidecars to my node port. Then when I run my sidecars they advertise their hosts port with the above mentioned configuration. Technically the all node ports will work route traffic but by using the NODE_IP you're avoiding extra hops.
- args:
- query
- --log.level=debug
- --query.replica-label=ENVIRONMENT
- --cluster.advertise-address=$(NODE_IP):10900
- --grpc-advertise-address=$(NODE_IP):10901
env:
- name: NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
So the sequence I see is Sidecar gossips to query api > api tries to connect via grpc. I'm not 100% sure if the sidecar will initiate a connection to the query api via grpc. This should also work with the store API. I'm going to look at adding it in next week.
Currently, I am deploying it, I found a solution in k8s.
args:
- "sidecar"
- "--log.level=debug"
- "--tsdb.path=/var/prometheus"
- "--prometheus.url=http://127.0.0.1:9090"
- "--cluster.peers=$(PEERS)"
- "--reloader.config-file=/etc/prometheus/prometheus.yml.tmpl"
- "--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml"
- "--cluster.advertise-address=$(NODE_IP):30900"
- "--grpc-advertise-address=$(NODE_IP):30901"
env:
- name: NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
externalTrafficPolicy=LocalapiVersion: v1
kind: Service
metadata:
name: thanos-peers-in-cluster
namespace: thanos
spec:
type: NodePort
externalTrafficPolicy: Local
ports:
- name: cluster
port: 10900
targetPort: cluster
nodePort: 30900
- name: grpc
port: 10901
targetPort: grpc
nodePort: 30901
selector:
# Useful endpoint for gathering all thanos components for common gossip cluster.
thanos-peer: "true"
externalTrafficPolicy: Local doc
so, when a sidecar advertise itself to HOST_IP:30900
other components request -> HOST_IP:30900
-> HOST_IP Node of another k8s cluster, Port: 30900
-> service 30900 -> 10900
-> externalTrafficPolicy: Local
-> so it will route to the 10900 on the same Node
-> sidecar
it works, but the thanos-query complains
level=debug ts=2018-10-02T01:39:48.240360133Z caller=stdlib.go:89 component=cluster caller="peers2018/10/02 01:39" msg=":48 [DEBUG] memberlist: Failed ping: 01CRS6VEFNA8NGYGSZCR8PA97B (timeout reached)"
level=debug ts=2018-10-02T01:39:48.740308321Z caller=stdlib.go:89 component=cluster caller="peers2018/10/02 01:39" msg=":48 [WARN] memberlist: Was able to connect to 01CRS6VEFNA8NGYGSZCR8PA97B but other probes failed, network may be misconfigured"
I am trying to figure out why, is there any clues?
then problem is, you should use different selector and NodePort for different thanos-component if they are deployed in the same cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.