Kibana: [APM] Collect telemetry about data/queries

Created on 15 Nov 2019  路  22Comments  路  Source: elastic/kibana

We currently don't have a lot of insight into the amounts of data that our customers have and how long it takes for our ES queries to be processed. This makes it hard to judge at which scale current and new functionalities need to operate. For instance, in some cases it might be reasonable to process things in memory on the Node server rather than in ES, which might simplify our implementation. Additionally, optimizing right now is hard because we don't know where our users are experiencing slowness.

Ideally we would have telemetry about:

  • The data volume (how many errors/error groups? how many transactions? how many transactions/spans per trace? how many services? etc). This could be collected with a Kibana task that queries the data indices at a set interval and sends the data back home.
  • Query response times. We could instrument our ES client facade to store telemetry about ES response times. There's also the possibility of using the nodejs agent to instrument Kibana, but the ongoing efforts are explicitly scoped to non-production usage: https://github.com/elastic/kibana/pull/43548

@graphaelli: any idea if the monitoring data that we have provides an answer to any of these questions?

Telemetry apm

All 22 comments

Pinging @elastic/apm-ui (Team:apm)

Not sure about this, but would it be apm-server telemetry that reports the no. of transactions, errors and metrics? Or would it be better to query against the various document types in ES not to put additional pressure on server?

I'd think that query should happen in ES. AFAIK, APM Server is stateless, and at most knows about the amount of documents it is currently processing (and there could be multiple instances of APM Server as well). Telemetry is reported from Kibana as well, so I would imagine this as a Kibana task that queries ES and then sends up telemetry to our telemetry cluster.

@dgieselaar OK, that makes sense

Thanks for creating this issue. Super good point about us needing better insights into the users data volumes and pain points in querying this.

Query response times
I'd really hate for us to re-build the nodejs APM agent all over again, so I'd much prefer if we could investigate ways to piggyback on the extensive auto-instrumentation it has.
Afair we ran into two problems when enabling the agent in prod:
1) PII: we were worried about sending sensitive data
2) secretToken needed to be bundled with kibana, which is not great

Regarding #1 I think the node agent is quite configurable and we might be able to turn off most things except for elasticsearch query performance numbers. And we might be able to enable it just for specific plugins.

Regarding #2 we might be able to disable the transport via APM Server, and instead send it as telemetry data, and thus avoid needing secretToken. This would probably require us to parse it first but it would still provide us with a solution that potentially can auto-instrument the entire Kibana and send consistent performance telemetry.
I've emphasised consistent here because it's key that performance data it collected similarly between plugins for us to be able to compare it

If using the APM agent is not feasible in the short term I'm good with starting out doing it ourselves.

Data volume
APM agent doesn't do anything in this area so I feel much better about solving this on our own.
Again, it would be optimal if we can build something that other plugins can use too, to improve the consistency of the collected data.

Looks like nested could work for us, nice stuff:


Copy and paste this into the console

PUT apm-telemetry-test

PUT apm-telemetry-test/_mapping
{
  "properties": {
    "plugins": {
      "properties": {
        "apm": {
          "properties": {
            "has_any_services": {
              "type": "boolean"
            },
            "services_per_agent": {
              "properties": {
                "go": {
                  "type": "long"
                },
                "java": {
                  "type": "long"
                },
                "js-base": {
                  "type": "long"
                },
                "rum-js": {
                  "type": "long"
                },
                "nodejs": {
                  "type": "long"
                },
                "python": {
                  "type": "long"
                },
                "dotnet": {
                  "type": "long"
                },
                "ruby": {
                  "type": "long"
                }
              }
            },
            "endpoint_responses": {
              "type": "nested",
              "properties": {
                "path": {
                  "type": "keyword"
                },
                "status_code": {
                  "type": "integer"
                },
                "response_time_ms": {
                  "type": "double"
                }
              }
            }
          }
        }
      }
    }
  }
}

POST apm-telemetry-test/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

POST apm-telemetry-test/_doc
{
  "plugins": {
    "apm": {
      "endpoint_responses": [
        {
          "path": "POST /api/apm/foo",
          "status_code": 200,
          "response_time_ms": 100
        },
        {
          "path": "GET /api/apm/foo",
          "status_code": 200,
          "response_time_ms": 200
        },
        {
          "path": "GET /api/apm/bar",
          "status_code": 500,
          "response_time_ms": 300
        }
      ]
    }
  }
}

GET apm-telemetry-test/_search

GET apm-telemetry-test/_search
{
  "size": 0,
  "aggs": {
    "average_response_time_per_endpoint": {
      "nested": {
        "path": "plugins.apm.endpoint_responses"
      },
      "aggs": {
        "by_path": {
          "terms": {
            "field": "plugins.apm.endpoint_responses.path"
          },
          "aggs": {
            "average_response_time":  {
              "avg": {
                "field": "plugins.apm.endpoint_responses.response_time_ms"
              }
            }
          }
        }
      }
    }
  }
}

What about:

interface TimeframeMap {
  '1d': number;
  '1mo': number;
  '6mo': number;
  all: number;
}

interface APMDataTelemetry {
  has_any_services: boolean;
  services_per_agent: {
    go: number;
    java: number;
    'js-base': number;
    'rum-js': number;
    nodejs: number;
    python: number;
    dotnet: number;
    ruby: number;
  };
  data_characteristics: {
    transactions: TimeframeMap;
    spans: TimeframeMap;
    errors: TimeframeMap;
    metrics: TimeframeMap;
    transaction_groups: Pick<TimeframeMap, 'all'>;
    error_groups: Pick<TimeframeMap, 'all'>;
    traces: Pick<TimeframeMap, 'all'>;
    services: Pick<TimeframeMap, 'all'>;
    agent_configurations: Pick<TimeframeMap, 'all'>;
  };
  integrations: {
    alerting:boolean;
    ml:boolean;
  }
}

interface APMPerformanceMeasurement {
  path: string;
  response_time_ms: number;
  status_code: number;
}

interface APMPerformanceTelemetry {
  total_api_requests: number;
  requests: APMPerformanceMeasurement[];
}

type APMTelemetry = APMDataTelemetry & APMPerformanceTelemetry;

Not sure if this is feasible with telemetry or not, but for agent developers, it would be super helpful to know overall counts for

  • service.agent.name / service.agent.version
  • service.framework.name / service.framework.version
  • service.language.name / service.language.version
  • service.runtime.name / service.runtime.version

All of these come in pairs, so it would be most meaningful to count distinct combinations.

Having this data would give us information on adoption rate of new agent versions, as well as some data of which frameworks and language versions are being used, which could inform decisions on deprecating support for old versions (Python 2.7 support comes to mind).

cc @elastic/apm-agent-devs

This is what we have today:

{
  "requests": [
    {
      "path": "GET /api/apm/index_pattern/dynamic",
      "response_time": {
        "ms": 139
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_types",
      "response_time": {
        "ms": 552
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/ui_filters/environments",
      "response_time": {
        "ms": 569
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/agent_name",
      "response_time": {
        "ms": 72
      },
      "status_code": 200
    },
    {
      "path": "POST /api/apm/index_pattern/static",
      "response_time": {
        "ms": 65
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/ui_filters/local_filters/transactionGroups",
      "response_time": {
        "ms": 788
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups/breakdown",
      "response_time": {
        "ms": 939
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups/charts",
      "response_time": {
        "ms": 970
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups",
      "response_time": {
        "ms": 940
      },
      "status_code": 200
    }
  ],
  "total_requests": 9,
  "total_response_time": {
    "ms": 5034
  },
  "counts": {
    "span": {
      "1d": 718028,
      "1M": 11504559,
      "6M": 16720490,
      "all": 16720490
    },
    "transaction": {
      "1d": 119024,
      "1M": 2014449,
      "6M": 2867622,
      "all": 2867622
    },
    "metric": {
      "1d": 51093,
      "1M": 969633,
      "6M": 1307368,
      "all": 1307368
    },
    "error": {
      "1d": 14721,
      "1M": 227677,
      "6M": 312524,
      "all": 312527
    },
    "onboarding": {
      "1d": 1,
      "1M": 25,
      "6M": 36,
      "all": 36
    },
    "agent_configuration": {
      "all": 3
    },
    "transaction_group": {
      "all": 24
    },
    "error_group": {
      "all": 866
    },
    "trace": {
      "all": 981677
    },
    "service": {
      "all": 8
    }
  },
  "has_any_services": true,
  "services_per_agent": {
    "js-base": 0,
    "rum-js": 0,
    "dotnet": 0,
    "go": 1,
    "java": 2,
    "nodejs": 1,
    "python": 1,
    "ruby": 1
  },
  "versions": {
    "apm_server": {
      "major": 8,
      "minor": 0,
      "patch": 0
    }
  },
  "integrations": {
    "ml": true
  }
}

requests is an aggregation of requests made throughout the time-period, right? Should there be a count on each request (number of times it was called?)

Btw since the requests have status_code: can the same request show up twice with different status codes?

Discussed per Slack to drop performance measurements for now. It's going to be cumbersome to actually use this data because we cannot use nested objects as the xpack-phone-home indices use index sorting.

I've added some agent and index metrics as well. I've also uploaded a bunch of dummy data to https://apm.elstc.co that uses the same mapping as what is used for the telemetry cluster, so we can test out dashboards, queries etc over there.

Some highlights of what I'm collecting so far:

  • counts of processor events of various time ranges (1d, 1M, 6M, forever)
  • counts of common groupings we use in the UI (error groups, transaction groups, traces, services)
  • number of services per agent
  • per agent, the top 3 values for agent.version, service.framework.name, service.framework.version, service.language.name, service.language.version, service.runtime.name, and service.runtime.version
  • whether the cluster has APM-specific ML indices
  • the most recent version of APM server that is being used (kibana and ES versions are tracked separately)
  • stats about indices (document/shard count, disk size)

Here's a sample document:


Sample document

{
  "counts": {
    "error": {
      "1d": 233186,
      "1M": 3453848,
      "6M": 3453854,
      "all": 3453854
    },
    "metric": {
      "1d": 613004,
      "1M": 9461527,
      "6M": 9461527,
      "all": 9461563
    },
    "span": {
      "1d": 7949662,
      "1M": 120103499,
      "6M": 120104792,
      "all": 120105948
    },
    "transaction": {
      "1d": 1599822,
      "1M": 31030690,
      "6M": 31030798,
      "all": 31030845
    },
    "onboarding": {
      "1d": 0,
      "1M": 15,
      "6M": 15,
      "all": 15
    },
    "sourcemap": {
      "1d": 0,
      "1M": 0,
      "6M": 0,
      "all": 0
    },
    "agent_configuration": {
      "all": 43
    },
    "max_error_groups_per_service": {
      "all": 293394
    },
    "max_transaction_groups_per_service": {
      "all": 25
    },
    "traces": {
      "1d": 1259637,
      "all": 25892950
    },
    "services": {
      "all": 9
    }
  },
  "tasks": {
    "processor_events": {
      "took": {
        "ms": 41142
      }
    },
    "agent_configuration": {
      "took": {
        "ms": 15
      }
    },
    "services": {
      "took": {
        "ms": 50393
      }
    },
    "versions": {
      "took": {
        "ms": 17
      }
    },
    "groupings": {
      "took": {
        "ms": 60768
      }
    },
    "integrations": {
      "took": {
        "ms": 19
      }
    },
    "agents": {
      "took": {
        "ms": 2996
      }
    },
    "indices_stats": {
      "took": {
        "ms": 58
      }
    }
  },
  "has_any_services": true,
  "services_per_agent": {
    "java": 1,
    "js-base": 2,
    "rum-js": 0,
    "dotnet": 1,
    "go": 2,
    "nodejs": 1,
    "python": 1,
    "ruby": 1
  },
  "versions": {
    "apm_server": {
      "major": 8,
      "minor": 0,
      "patch": 0
    }
  },
  "integrations": {
    "ml": {
      "has_anomalies_indices": true
    }
  },
  "agents": {
    "java": {
      "agent": {
        "version": [
          "1.11.1-SNAPSHOT"
        ]
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [
            "Java"
          ],
          "version": [
            "10.0.2"
          ]
        },
        "runtime": {
          "name": [
            "Java"
          ],
          "version": [
            "10.0.2"
          ]
        }
      }
    },
    "js-base": {
      "agent": {
        "version": [
          "4.6.0",
          "4.5.1"
        ]
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [
            "javascript"
          ],
          "version": []
        },
        "runtime": {
          "name": [],
          "version": []
        }
      }
    },
    "rum-js": {
      "agent": {
        "version": []
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [],
          "version": []
        },
        "runtime": {
          "name": [],
          "version": []
        }
      }
    },
    "dotnet": {
      "agent": {
        "version": [
          "1.1.2"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "ASP.NET Core"
          ],
          "version": [
            "2.2.0.0"
          ]
        },
        "language": {
          "name": [
            "C#"
          ],
          "version": []
        },
        "runtime": {
          "name": [
            ".NET Core"
          ],
          "version": [
            "2.2.7"
          ]
        }
      }
    },
    "go": {
      "agent": {
        "version": [
          "1.6.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "gin"
          ],
          "version": [
            "v1.4.0"
          ]
        },
        "language": {
          "name": [
            "go"
          ],
          "version": [
            "go1.13.4",
            "go1.12.12"
          ]
        },
        "runtime": {
          "name": [
            "gc"
          ],
          "version": [
            "go1.13.4",
            "go1.12.12"
          ]
        }
      }
    },
    "nodejs": {
      "agent": {
        "version": [
          "3.2.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "express"
          ],
          "version": [
            "4.17.1"
          ]
        },
        "language": {
          "name": [
            "javascript"
          ],
          "version": []
        },
        "runtime": {
          "name": [
            "node"
          ],
          "version": [
            "12.13.0"
          ]
        }
      }
    },
    "python": {
      "agent": {
        "version": [
          "5.3.1"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "django"
          ],
          "version": [
            "2.1.13"
          ]
        },
        "language": {
          "name": [
            "python"
          ],
          "version": [
            "3.6.9"
          ]
        },
        "runtime": {
          "name": [
            "CPython"
          ],
          "version": [
            "3.6.9"
          ]
        }
      }
    },
    "ruby": {
      "agent": {
        "version": [
          "3.1.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "Ruby on Rails"
          ],
          "version": [
            "5.2.3"
          ]
        },
        "language": {
          "name": [
            "ruby"
          ],
          "version": [
            "2.6.5"
          ]
        },
        "runtime": {
          "name": [
            "ruby"
          ],
          "version": [
            "2.6.5"
          ]
        }
      }
    }
  },
  "indices": {
    "shards": {
      "total": 10
    },
    "all": {
      "total": {
        "docs": {
          "count": 164334688
        },
        "store": {
          "size_in_bytes": 49911744350
        }
      }
    }
  }
}

@elastic/apm Any thoughts/suggestions here? Here's how you can help:

  • Tell me if you're missing any data, or if there's data being tracked that will definitely not be useful
  • Create dashboards/panels/sample queries on https://apm.elstc.co. Log in as elastic and use the xpack-phone-home Kibana index pattern. Discover example: https://apm.elstc.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-90d,to:now))&_a=(columns:!(_source),index:ae184d10-1742-11ea-99bb-47d16bed49fc,interval:auto,query:(language:kuery,query:''),sort:!(!(timestamp,desc)))
  • Tell me specific questions that you'd like this data to answer so I can create a dashboard/panel/sample query based on it

Would be great to have this feedback somewhere next week so we can finish this up.

This looks great. A few thoughts:

  • telemetry on the oldest retained data, per event type (transaction, metrics, ...), to help determine if the 1M stats are really over that time or actually much less time coverage
  • how did we pick 6M as a time?
  • stack_stats.kibana.plugins.apm.counts.services.all broken down over time to indicate whether stats have should be stable / expecting large changes
  • are stack_stats.kibana.plugins.apm.counts.max_error_groups_per_service.all and stack_stats.kibana.plugins.apm.counts.agent_configuration.all correct? look too large in the sample data
  • Considered any spans per transactions or stackframes per error or span telemetry (per language)?

@elastic/kibana-stack-services do you all have any advice here? The APM team would like to collect telemetry regarding their "apm data" and usage. The APM "data indices" are configurable, and the kibana_system role doesn't, and shouldn't, have access to read from these data-indices as elaborated upon here.

I believe that Elasticsearch itself writes its own telemetry data that Kibana then consumes. Is Logstash or any other application doing something similar?

cc: @elastic/pulse

@kobelb I am still yet to read this whole thread but to answer your question yes. All stack_stats.xpack stats are reported directly from elasticsearch.

kibana_system can also read from the monitoring indices, which we utilize to send usage data about apm, logstash, and beats.

@kobelb If I understood you correctly, you would be okay(-ish) with the kibana_system role having access to apm-* indices by default for the purpose of collecting telemetry, is that right? And then we'd have to accept that in some cases we cannot collect telemetry because the user has a incompatible configuration for their apm data indices.

@kobelb If I understood you correctly, you would be okay(-ish) with the kibana_system role having access to apm-* indices by default for the purpose of collecting telemetry, is that right? And then we'd have to accept that in some cases we cannot collect telemetry because the user has a incompatible configuration for their apm data indices.

Correct. If we do intend to go this route, we'll have to be extra careful when using the elasticsearch client's callWithInternalUser method and ensure it's never used to allow end-users themselves to query the APM data indices.

Hey @ogupte, is my understanding correct that this is something that you'd like to implement for 7.7? Is it safe to assume that https://github.com/elastic/kibana/pull/51612 is the PR which would implement this functionality?

The @elastic/pulse team today was discussing how we should handle sending telemetry data for products besides Kibana, and this seems to fall into that category. I don't want to necessarily stall this effort while we continue to have this discussion, but I did want to make sure we weren't working ourselves into a corner.

@kobelb We're aiming for 7.7, but status is not entirely clear atm - I'll send you a message.

The @elastic/pulse team today was discussing how we should handle sending telemetry data for products besides Kibana, and this seems to fall into that category.

@dgieselaar and @kobelb The information above is super helpful in fleshing out additional Pulse service requirements for products besides Kibana. I've made notes from the whole discussion and will take them into account during planning specifications for these.

Sample document

...
 "integrations": {
   "ml": {
     "has_anomalies_indices": true
   }
 },
...

Sorry to be commenting late in the day, but instead of has_anomalies_indices would it be better to have something like has_apm_job_use? The reason is that anomalies indices are an internal implementation detail whereas jobs are the public interface. In other words, is the high level requirement to report when an APM job has been used?

https://github.com/elastic/elasticsearch/pull/52917#discussion_r393775715 is related.

EDIT:

would it be better to have something like has_apm_job_use?

https://github.com/elastic/elasticsearch/pull/52917#discussion_r393902892 and https://github.com/elastic/elasticsearch/pull/52917#discussion_r394218454 are related to this. Based on those comments it sounds like a better name would be something like has_apm_job.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ctindel picture ctindel  路  3Comments

Ginja picture Ginja  路  3Comments

celesteking picture celesteking  路  3Comments

timroes picture timroes  路  3Comments

cafuego picture cafuego  路  3Comments