Linkerd2: Use tap data to add unmeshed resources to the octopus graph

Created on 4 Sep 2018 · 12Comments · Source: linkerd/linkerd2

We only have stat data for meshed resources. But we can get the pod/owner from the tap data on the resource detail pages.

Use the tap data to display these unmeshed resources on the page.
Don't show any stats as the stats we have will be different from the prometheus data stats (it'll be sampled, since tap samples).
Show a call to action to add those resources to the mesh.

areweb prioritP0 review

Source

rmars

All 12 comments

Would another potential solution be to create an API for this which queries prometheus and then uses the kube api to populate src and dst metadata?

I'm not suggesting we necessarily do this, but mostly just want to check that my understanding of this issue is correct. And wondering if we should consider this in the future as a longer term solution.

adleong on 4 Sep 2018

@adleong I think the major drawback of that approach is that we wouldn't be able to combine rows where the src and dst ips map to the same metadata. For instance if we wanted stats for an uninjected deployment that contained multiple pods, we wouldn't be able to produce a single percentile latency stat by combining individual pod latencies returned by prometheus. In order to calculate it correctly, we would need the source deployment label to exist in prometheus.

klingerf on 4 Sep 2018

Ah, interesting. That's a good point, thanks!

adleong on 4 Sep 2018

Although, even though this approach would not give us accurate metric values, I think it would be a more reliable way of showing the existence of unmeshed upstreams or downstreams because TAP could theoretically miss events to/from the unmeshed resource; especially if those requests are infrequent compared to the other traffic.

adleong on 4 Sep 2018

👍1

Hmm, yeah, that's a good point about sampling. It does seem like we could implement a separate API that queries source IPs from Prometheus and then converts them to a requested resource type using the Kubernetes API. From an implementation standpoint, however, it's more straightforward to just use tap data, since that data has the metadata we need and it's already being requested. But I agree that a new API to serve this data could be a more robust longer-term approach.

klingerf on 4 Sep 2018

👍1

API proposal:

I'm thinking about reusing the ListPodsResponse and only populating the name, podIP, owner fields... thoughts? Should I just make a new Response type?

rpc ListUpstreams(ListUpstreamsRequest) returns (ListPodsResponse) {}

message ListUpstreamsRequest {
  ResourceSelection selector = 1;
}

// this is the same as ListPodsResponse
message ListUpstreamsResponse {
  repeated Pod pods = 1;
}

message Pod {
  string name = 1;
  string podIP = 2;
  oneof owner {
    string deployment = 3;
    string replica_set = 10;
    string replication_controller = 11;
    string stateful_set = 12;
    string daemon_set = 13;
    string job = 14;
  }
  string status = 4;
  bool added = 5; // true if this pod has a proxy sidecar (data plane)
  google.protobuf.Duration sinceLastReport = 6;
  string controllerNamespace = 7; // namespace of controller this pod reports to
  bool controlPlane = 8; // true if this pod is part of the control plane
  google.protobuf.Duration uptime = 9; // uptime of this pod
}

rmars on 5 Sep 2018

The protobuf looks good to me, but in order to avoid upstream/downstream confusion, I recommend:

rpc ListSourcePods(ListSourcePodsRequest) returns (ListPodsResponse) {}

I also think it makes sense to reuse ListPodsResponse, but it would probably be less code overall to continue to populate all of the fields in the same way that the ListPods endpoint does.

klingerf on 5 Sep 2018

👍1

Yeah, that's a good point about avoiding upstream/downstream confusion.
I like rpc ListSourcePods(ListSourcePodsRequest) returns (ListPodsResponse) {}.

There's a lot of code that can be reused from ListPods, for sure. The only things that can't be hydrated in this way are uptime and added (which require a separate process_start_time prometheus query).

Do you think it's better to have a separate type of Pod response with only those fields we care to hydrate? Or is it worth making the extra Prometheus query to hydrate that field? (I'm inclined to not hydrate it, because we don't use it in the UI).
(The other thing the ListPods hydrates is added, but I can use the prometheus results from this query to determine that and hydrate that field here too).

rmars on 5 Sep 2018

@rmars Ah, right, that totally make sense. I think it's no prob at all to use the existing structs and only hydrate whichever fields are readily available from the Kubernetes API.

klingerf on 5 Sep 2018

I... realized we don't actually have source IPs in our prometheus data 😭
Filed #1592

rmars on 6 Sep 2018

D'oh -- sorry about that. I didn't think it through enough when suggesting the other approach. Thanks for filing that issue.

klingerf on 6 Sep 2018

Ahaha, np! I was running under the same assumptions as you and thought we did have that data! But we don't :(

rmars on 6 Sep 2018

Was this page helpful?

0 / 5 - 0 ratings