This issue came up during the implementation of https://github.com/elastic/kibana/issues/39282
AWS CloudWatch metrics, and by extension metricbeat's aws.ec2 metricset, contain fields that have to be interpreted differently depending on whether the user has enabled (and paid for) "detailed" CloudWatch monitoring:
NetworkOut:
The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
(from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html )
The affected metrics are NetworkIn, NetworkOut, DiskReadBytes, and DiskWriteBytes. They are mapped to the fields aws.ec2.network.in.bytes, aws.ec2.network.out.bytes, aws.ec2.diskio.read.bytes, and aws.ec2.diskio.write.bytes, respectively.
The information whether "detailed" monitoring is enabled is captured in the field aws.ec2.instance.monitoring.state.
Currently there is no easy way to perform custom calculations on a metric that is fetched through TSVB models.
I did experiment with this code in the model definition (in kibana/x-pack/legacy/plugins/infra/server/lib/adapters/metrics/models/aws/aws_network_bytes.ts):
{
id: 'tx',
metrics: [
{
field: 'aws.ec2.network.out.bytes',
id: 'max-net-out',
type: InfraMetricModelMetricType.max,
},
{
id: 'by-second-max-net-out',
type: InfraMetricModelMetricType.calculation,
variables: [
{ id: 'var-max', name: 'max', field: 'max-net-out' },
{
id: 'var-detailed-monitoring',
name: 'detailed-monitoring',
field: 'aws.ec2.instance.monitoring.state',
},
],
script: 'params.detailed-monitoring == "disabled" ? params.max / 300 : params.max / 60',
},
],
split_mode: 'everything',
}
This does not work, presumably because string comparison is not possible in the script field of a TSVB model.
This feature request is about adding this functionality to the node detail page and the corresponding metrics GraphQL resolver, and performing the correct calculations for AWS metrics.
As long as this isn't implemented, we shouldn't show the affected metrics.
Pinging @elastic/infra-logs-ui
It is possible that DiskWriteOps and DiskReadOps (aws.ec2.diskio.read.count and aws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )
@kaiyan-sheng Another idea, could this maybe be solved in metricbeat directly?
@skh What if we added an attribute to the metadata endpoint that indicated whether or not they have detailed metrics. If we had that then the two approaches I could see are:
metadata to the TSVB models then use a template string to inject the correct value (60 or 300) in to the script dynamicallyIt is possible that
DiskWriteOpsandDiskReadOps(aws.ec2.diskio.read.countandaws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )
Yes, I tried yesterday to create an EC2 instance but can't make DiskWrite/ReadOps to be non-zero. I assume for these two metrics, the definition of "a specified period of time" in the documentation is the same as DiskReadBytes and DiskWriteBytes.
@kaiyan-sheng Another idea, could this maybe be solved in
metricbeatdirectly?
@skh You mean calculating rate for all NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, DiskReadOps and DiskWriteOps? Since monitoringState is available, this shouldn't be hard.
You mean calculating rate for all
NetworkIn,NetworkOut,DiskReadBytes,DiskWriteBytes,DiskReadOpsandDiskWriteOps?
Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.
Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.
Yes. I would also favour this solution.
Another option would be (for now), to use a sum aggregation for these fields, and always calculate the rate based on 300 seconds.
As we don't have variable bucket intervals yet (that would come with https://github.com/elastic/beats/issues/12616) this would be correct in the following two cases:
300s reporting period configured in metricbeat for the aws module (default). In this case we have one value in our 300 second bucket and divide that by 300 to get the rate. 60s reporting period configured in metricbeat for the aws module. In this case we have five values in our 300 second bucket, sum them up and then divide that by 300 to get the rate. It would be the average rate for the last 300 seconds instead of for the last 60 seconds, but it would be correct.In the following case it would be incorrect:
300s reporting period configured in metricbeat. In this case we have one value in our 300 second bucket, but it is the value for the last 60 seconds, not the last 300. We would divide by 300 and the value would be one-fifth of what it should be.We could point out in documentation that the reporting period for the metricbeat aws module has to be adjusted when detailed monitoring is available.
@simianhacker @kaiyan-sheng what do you think?
After further discussion, this is now implemented as:
Leave the bucket size hard-coded at >=5m
First, use a sum aggregation in every bucket. that way, if the user has detailed monitoring and a reporting period of 60s, we sum up even in the smallest possible buckets. If the user zooms out and the TSVB endpoint therefore uses larger buckets (note that the value we use is >=5m, not 5m), we also sum up if the user has only basic monitoring and a reporting period of 300s.
Next, calculate the cumulative sum in each bucket, i.e. each bucket will have the sum of all buckets before it, and its own value added on top. This turns the metric into a monotonically increasing counter.
Finally, take the derivative with unit: 1s to get the per-second rate regardless of bucket sizes.
Note that this will still be wrong if the user has detailed monitoring enabled in AWS but configured a 300s reporting period in metricbeat.
Hi @skh, the PR for calculating rate metrics in ec2 metricset is merged 馃槵
@skh @simianhacker with the changes that landed in https://github.com/elastic/beats/pull/13203 can we remove our workaround and just use the raw numbers? Will those beats changes have any effect on our workaround?