Title: Tag extraction doesn't work if resource name contains dots
Description:
_I've tested this for CLUSTER_NAME, but I assume it holds for others with similar regexes like listener, vhost, etc._
If you have a cluster name like foo.bar (this is every single cluster for us, since they are namespaced), then you'll have stats like cluster.foo.bar.upstream_cx_active. The tag extraction regex, however will match only foo and then use bar.upstream_cx_active for the stat name.
A reasonable workaround is to not use dots. However, if we take this approach, we should warn about this in the docs.
A fix will be a bit trickier. How do we possibly differentiate the resource name from the metric name when they both contain dots?
This issue might be considered an extension of #2687 and solving that in a general way would likely solve this. But it's a bit distinct in that there isn't a collision per se (in that two different stats don't resolve to the same thing), but it does have to do with tag extraction. Further, it isn't going to be solved with testing like in some of the other issues since it is dependent on the configuration. Either way, let me know if you think this is a dupe and I can move it to a comment on that issue.
Repro steps:
This repros on master with a trivial unit test change:
diff --git a/test/common/stats/tag_extractor_test.cc b/test/common/stats/tag_extractor_test.cc
index ac1c89c..7ca5dda 100644
--- a/test/common/stats/tag_extractor_test.cc
+++ b/test/common/stats/tag_extractor_test.cc
@@ -138,9 +138,9 @@ TEST(TagExtractorTest, DefaultTagExtractors) {
// Cluster name
Tag cluster_tag;
cluster_tag.name_ = tag_names.CLUSTER_NAME;
- cluster_tag.value_ = "ratelimit";
+ cluster_tag.value_ = "foo.bar";
- regex_tester.testRegex("cluster.ratelimit.upstream_rq_timeout", "cluster.upstream_rq_timeout",
+ regex_tester.testRegex("cluster.foo.bar.upstream_rq_timeout", "cluster.upstream_rq_timeout",
{cluster_tag});
// Listener SSL
@@ -163,7 +163,7 @@ TEST(TagExtractorTest, DefaultTagExtractors) {
cipher_suite.name_ = tag_names.SSL_CIPHER_SUITE;
cipher_suite.value_ = "ECDHE-RSA-AES128-GCM-SHA256";
- regex_tester.testRegex("cluster.ratelimit.ssl.ciphers.ECDHE-RSA-AES128-GCM-SHA256",
+ regex_tester.testRegex("cluster.foo.bar.ssl.ciphers.ECDHE-RSA-AES128-GCM-SHA256",
"cluster.ssl.ciphers", {cluster_tag, cipher_suite});
// ipv6 non-loopback (for alphabetical chars)
@ambuc could you investigate, since you're now de facto owner of this subsystem? :)
I think the right answer here is that cluster names can't contain dots. Should we add a check at runtime to reject configs which violate this (and update the docs) ? Like you said, this can't be detected with tests, since it has to fail on a specific "invalid" config.
Honestly, this seems like getting to the point where using a flat-name for the stat, computed from a format string, and then trying to unflatten it later, is breaking down. Should we consider storing/referencing stats by the non-flat name? They would be stored more like prometheus metrics then, where there's a basename and some attached labels.
I want to be able to store an fqdn (or similar) in the name of the cluster. Yes, I could encode the '.' in some way and decode it later, but it just feels like this is getting harder than it should be.
+1 to what @ggreenway said.. We use dots extensively in cluster names in Istio.. Heck, we also use '|'s as well. A non-flat name would work much better. Or how about using / as a separator between the components of a stat? WE could have a configurable option in bootstrap/CLI that allows the user to specify the stat separator.
It sounds to me like everyone is saying we want the stat key to be a proto :)
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This is affecting gRPC stats in my usage too.
I have gRPC services defined named with the project name, service name, and service version, e.g.:
syntax = "proto3";
package myapp.widgets.v0;
service Widgets {
...
}
In DataDog I am seeing stats like envoy.cluster.grpc.widgets.v0.Widgets.failure with a tag {envoy.grpc_bridge_service:myapp}
I am able to extract the full service name by defining a custom TagSpecifier regex:
stats_config:
stats_tags:
- regex: "^cluster(?=\\.).*?\\.grpc\\.((.*)\\..*\\.)" # greedy service name and trailing .Method name
tag_name: envoy.grpc_bridge_svc
However it would be nice to overwrite envoy.grpc_bridge_service but I get the error Tag name 'envoy.grpc_bridge_service' specified twice.
Most helpful comment
It sounds to me like everyone is saying we want the stat key to be a proto :)