kibana 🚀 - Measure APM agent impact on the platform performance

Pinging @elastic/kibana-platform (Team:Platform)

elasticmachine on 29 Sep 2020

Setup

API performance testing is based on setup https://github.com/dmlemeshko/kibana-load-testing I adjusted number of requests not to overwhelm APM server.

  setUp(
    scn.inject(
      constantConcurrentUsers(15) during (2 minute),
      rampConcurrentUsers(15) to (20) during (2 minute)
    ).protocols(httpProtocol)
  ).maxDuration(15 minutes)

Testes are run against 7.10.0-SNAPSHOT

Results

APM agent seems to add a significant overhead (see 95%).

Without APM agent:

2020-10-08_11-11-25
download result in html 7.10.0-without-apm.zip

With APM agent:

2020-10-08_11-12-07
download result in html 7.10.0-with-apm.zip

restrry on 8 Oct 2020

Tested Kibana image doesn't contain changes introduced in https://github.com/elastic/kibana/pull/78697
So I added breakdownMetrics: false, to dist APM config manually. It slightly improves the situation:
2020-10-08_11-33-55

restrry on 8 Oct 2020

https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html provides some details on how to squeeze out a bit more perf improvements.
The most simple way is to reduce the sample ratio https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html#performance-sampling
It seems we can use 0.2-0.3 as a default value and adjust it via config file.
Numbers:

sample ratio 0.1
sample ratio 0.2
sample ratio 0.3
sample ratio 0.5

Other config values don't seem to affect CPU as much as sampleRation does, so I decided not to use them. @vigneshshanmugam do you have anything to add?

restrry on 8 Oct 2020

As you have already figured transactionSampleRate is the go to metric we recommend for both Node.js and RUM agent to tune it for performance as it drops transactions based on this metric.

Perf tuning RUM agent - https://www.elastic.co/guide/en/apm/agent/rum-js/current/performance-tuning.html

breakdownMetrics Disabling it certainly helps a lot in the RUM agent for custom transactions vs page-load transactions. I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.
centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.
metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent. https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html#metrics-interval

I cant seem to find any other config that would help.

vigneshshanmugam on 8 Oct 2020

I don't know how the above test schedules the load and which browsers it runs so cant say for sure if its going to have a huge impact. But my recommendation would be to keep it to false if it helps.

It tests the server side API perf only.

centralConfig - Disable this one as it introduces one additional request to APM server. Defaults to true in Node.js and false in RUM.

already disabled in my tests

metricsInterval - Can you try increasing this interval/disabling metrics reporting and check if its helping? This controls the Metrics capturing in Node agent.

Test with metricsInterval: '120s' and transactionSampleRate: 0.3 slightly improves the situation (in comparison to transactionSampleRate: 0.3):
2020-10-08_14-46-02

restrry on 8 Oct 2020

👍1

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant.

pgayvallet on 21 Oct 2020

So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant

The best configuration with sample ratio: 0.1, breakdownMetrics: false; centralConfig: false; metricsInterval: '120s'. 50th percentile is doubled from 118 to 225, 95th percentile is almost doubled from 574 to 950.
It mostly affects query functionality when requesting /api/saved_objects/* & /api/metrics/vis/data endpoints. Test case query timeseries data is almost tripled.

with APM enabled:

restrry on 21 Oct 2020

@TinaHeiligers you asked how to perform testing:

how to run Kibana with APM agent locally:

clone https://github.com/elastic/apm-integration-testing
cd apm-integration-testing
Run ES & APM servers with ./scripts/compose.py start master --no-kibana
cd ../kibana
you might need to change elasticsearch credentials (I used admin/changeme)
make sure APM agent is active and points to the local APM server - set in kibana.yml:

elastic.apm.active: true
elastic.apm.serverUrl: 'http://127.0.0.1:8200'
# elastic.apm.secretToken: ... <-- might be required in prod/cloud
# optional metrics to adjust performance 
# see https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html
elastic.apm.centralConfig: false
elastic.apm.breakdownMetrics: false
elastic.apm.transactionSampleRate: 0.1
elastic.apm.metricsInterval: '120s'

run Kibana: ELASTIC_APM_ACTIVE=true yarn start
you can see transactions in the APM app
stop Kibana
stop ES & APM servers: cd apm-integration-testing; ./scripts/compose.py stop

how to run load testing against Kibana:

clone https://github.com/elastic/kibana-load-testing
follow readme https://github.com/elastic/kibana-load-testing#kibana-load-testing

how to test Kibana on Cloud

spin up Kibana v7.10 server on Cloud
adjust https://github.com/elastic/kibana-load-testing config to point to Cloud instance
run kibana-load-testing against v7.10 on Cloud to get numbers without APM agent enabled.
spin up APM server on Cloud
adjust kibana.yml file to enable APM agent (ask Cloud team for assistance - elastic.apm.* settings aren't listed in allow list) and point to APM server in Cloud
make sure APM agent works and Kibana communicates with APM server (see APM app in Kibana)
perform load testing and compare numbers with the previous results
feel free to adjust APM settings to compare how config values affect the results

restrry on 6 Nov 2020

👍1

@restrry I've followed your instructions above and with a little tweaking, was able to run the load tests against a local Kibana instance with and without APM running (through Docker).

My setup thus far:

Kibana local on master
ES and APM server through Docker
Load testing against the default DemoJourney with version=8.0.0

I left the DemoJourney simulation as is regarding requests:

  setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

In the screen shots below, I've highlighted the same queries in both cases, for ease of comparison.
Without APM:
local_without_APM

Full Results:
local_Kibana_without_APM.zip

With APM, using the Kibana apm settings suggested in the instructions:
local_with_APM

Full Results:
local_Kibana_with_APM.zip

Summary:
We are indeed seeing an impact of APM on Kibana performance, with an increase in the 95th percentile response times.
I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

TinaHeiligers on 11 Nov 2020

Looks good overall. The only outlier is the query dashboard list case that in the 95th percentile is faster with APM agent enabled.

I'll redo everything from v7.10-SNAPSHOT, after which I'll move on to Cloud unless I hear otherwise 😉 .

🚀

restrry on 11 Nov 2020

Progress was slow today, I really struggled to get Kibana 7.10 running and resorted to running Kibana off the distributable.

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)

kibana-7_10-localDistributable-no-APM
Note: Nothing really useful from this setup as roughly half of the queries threw errors.

_Full results:_
demojourney-20201111231041086.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured

Kibana-7_10-localDistributable-with-APM

_Full results:_
demojourney-20201111234910265.zip

Summary:
There's a huge discrepancy in the results from the queries that were successful. I don't trust these results and am rather moving on to Cloud testing. Hopefully that will be more reliable 😉

TinaHeiligers on 12 Nov 2020

Note: Nothing really useful from this setup as roughly half of the queries threw errors.

@dmlemeshko I experienced a similar problem when only the login scenario succeeded. What could be a reason for this?

@TinaHeiligers What Cloud settings did you use? There are recommended ones in https://github.com/elastic/kibana-load-testing

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}
kibana {
    memory = 1024
}

restrry on 12 Nov 2020

I fixed a login issue for 7.10 when running load testing with new deployment + canvas end-points needed to be updated.
Here is my test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Gatling Stats - Global Information 2020-11-12 12-26-54

demojourney-20201112111618663.zip

7.10.0.conf deploy config has the same memory values @restrry posted above

Another run with _rampConcurrentUsers_ changed to 20..150
Gatling Stats - Global Information 2020-11-12 12-52-25

demojourney-20201112113251442.zip

dmlemeshko on 12 Nov 2020

@restrry

What Cloud settings did you use?

I haven't tested on Cloud yet, I'll do that today with the recommended settings.

TinaHeiligers on 12 Nov 2020

@dmlemeshko That's for fixing that issue! I reran the load test on a local Kibana 7.10 distributable and not getting the errors seen previously.

Test setup for both runs:

setUp(
    scn
      .inject(
        constantConcurrentUsers(20) during (3 minute), // 1
        rampConcurrentUsers(20) to (50) during (3 minute) // 2
      )
      .protocols(httpProtocol)
  ).maxDuration(15 minutes)

Load tests without APM:
Elasticsearch: snapshot v7.10
Kibana: 7.10 (distributable)
Kibana-7_10-distributable-no-APM-test-2

Full result
demojourney-20201112153808121.zip

Load tests with APM:
Elasticsearch and APM run from Docker (v7.10)
Kibana: 7.10 (distributable) with apm configured
Kibana-7_10-distributable-with-APM-test2

Full result
demojourney-20201112161648302.zip

Summary:
With the exception of the request to discover and discover query 2, all the response times increase when APM is enabled.
OF the response times already starting at over 500ms, the increase in response time ranged between 12 and 40%, taking login response time to over 1000ms with "query gauge data" approaching the 1000ms mark.

TinaHeiligers on 12 Nov 2020

On Cloud staging, using an existing deployment without APM:

Load test results without APM
Cloud_Staging_no_APM

Full Result
demojourney-20201114173519337.zip

~~I'm reaching out to the Cloud folks to add the apm* config to the Cloud deployment and will post the results when I have them.~~

On cloud staging, using an existing deployment with APM:

Test run:

mvn install
export env=config/cloud-tina-7.10.0.conf // contains the details of the cloud staging env
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Load test results with APM
cloud_staging_with_APM

Full results
demojourney-20201117164239422.zip

On Cloud staging, creating a deployment as part of the test run:

export API_KEY=<Key generated on Staging Cloud>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

Script-created deployment
deploy-config:

version = 7.10.0

elasticsearch {
    deployment_template = "gcp-io-optimized"
    memory = 8192
}

kibana {
    memory = 1024
}

Load test results
Cloud_Kibana7-10-auto-deploymentcreation

Full Result
demojourney-20201112213550120.zip

On Cloud staging, creating a deployment as part of the test run: Not done

TinaHeiligers on 12 Nov 2020

❤1 👍1

@restrry I've added the results from the Kibana load testing on the cloud (staging) test run where APM is enabled in Kibana.
The results are similar to what we've seen on local instances of Kibana with and without APM: An overall increase in the 95th percentile response times by ~16%.
For both test runs, the number of concurrent users was set to 20 during 3 min and the number of users was ramped up from 20 to 50 during a 3 minute interval.

Please let me know if I should repeat the tests with fewer/more concurrent users and/or change any of the APM settings.
I will document the steps to take to add configurations not exposed be default on Cloud. Please let me know where the best place is to add these (I don't think making it public in this issue is appropriate 😉 )

cc @joshdover

TinaHeiligers on 17 Nov 2020

@TinaHeiligers @restrry
If you want to have more "clean" test results, I suggest spinning up VM in the same region where you create stack deployment

I can help with it, but if you are familiar how to add VM the follow up steps are:

// e.g. I run tests and create VM in Franfurt (europe-west3-a)
// zip project and upload to VM
zip -r KibanaLoadTesting.zip .
gcloud compute scp  ~/github/KibanaLoadTesting.zip root@<vm-name>:/home/<user-name>/test --zone=europe-west3-a
// start docker image with JDK/maven in other terminal
sudo docker run -it -v "$(pwd)"/test:/local/git --name java-maven --rm jamesdbloom/docker-java8-maven
// run tests with the same command you did locally
// download test results
sudo tar -czvf my_results.tar.gz /home/<user-name>/test/KibanaLoadTesting/target/gatling/demojourney-<report-folder>
gcloud compute scp  root@<vm-name>:/home/<user-name>/test/KibanaLoadTesting/target/gatling/my_results.tar.gz </local-machine-path-to-save-at> --zone=europe-west3-a

dmlemeshko on 17 Nov 2020

@dmlemeshko I'm not familiar with adding VM and would greatly appreciate your help! I'm happy to watch you go through the process on Zoom. In the mean time, I'll work through the guide.

TinaHeiligers on 17 Nov 2020

Why we have such a significant difference between On cloud staging, using an existing deployment with APM and On Cloud staging, creating a deployment as part of the test run?
I think it makes sense to spin up a new deployment for both Kibana & kibana-load-testing as @dmlemeshko suggested https://github.com/elastic/kibana/issues/78792#issuecomment-729123745
I scheduled a call to discuss the testing strategy.

restrry on 18 Nov 2020

👍1

Here are the steps how to spin up Google Cloud VM and run tests on it:

Login to https://console.cloud.google.com/ with corp account
Create CPU-optimized VM (4CPUs, 16 GB memory is enough) with any Container Optimized OS as boot disk, e.g _load-testing-vm_
Note: use US-Cenral1 region, same as for stack deployment
Zip https://github.com/elastic/kibana-load-testing and copy to VM

Connect to VM, create _test_ folder

gcloud beta compute ssh --zone "us-central1-a" "load-testing-vm" --project "elastic-kibana-184716"
mkdir test
chmod 777 test

In other terminal upload archive to VM

sudo gcloud compute scp KibanaLoadTesting.tar.gz <user>@load-testing-vm:/home/<user>/test  --zone "us-central1-a" --project "elastic-kibana-184716"

In first terminal (VM) unzip project and start docker container with mapping local/container path, so later you can exit container and keep results on VM

cd test
tar -xzf KibanaLoadTesting.tar.gz
sudo docker run -it -v "$(pwd)":/local/git --name java-maven --rm jamesdbloom/docker-java8-maven

Now you are in container and should be able to see _test_ folder, that contains unzipped project. Run tests as locally

export API_KEY=<Your API Key>
export deployConfig=config/deploy/7.10.0.conf
mvn clean -Dmaven.test.failure.ignore=true compile
mvn gatling:test -Dgatling.simulationClass=org.kibanaLoadTest.simulation.DemoJourney

When tests are one, type exit. Check target/gatling for your tests results. Zip and download to local machine:

sudo tar -czvf results.tar.gz demojourney-20201118160915491/

From local machine run

sudo gcloud compute scp  <user>@load-testing-vm:/home/<user>/test/target/gatling/results.tar.gz . --zone=us-central1-a

Results should be available in the current path

dmlemeshko on 18 Nov 2020

I think it'd also be worth understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM. Due to the many performance tweaks that were made to support Fleet, there may not be large regression in 7.11 w/ APM enabled. If the difference is smaller, enabling this in 7.11 clusters may be an easier pill to swallow.

Next, I'd also like to experiment with tweaking some other settings to see if we get any performance improvements:

elastic.apm.asyncHooks: false
elastic.apm.disableInstrumentations
- Modules to try disabling: bluebird, graphql
- Full list of instrumented modules here
- We could try disabling hapi, elasticsearch, and http, but I suspect those are the most useful ones. If disabling any of these improve the numbers, we may need to ask the APM agent team help us optimize those.

If none of these result in improved performance, we may need to work directly with the APM team to look at some flamegraphs / profiles and see where most of the time is being spent in the APM agent code.

joshdover on 18 Nov 2020

@dmlemeshko I'm stuck on the step:

In first terminal (VM) unzip project and start docker container with mapping local/container path, so later you can exit container and keep results on VM

When I run the following (in the VM)

christianeheiligers@heiligers-loadtest-kibana:~/test$ sudo docker run -it -v "$(pwd)":/local/git --name java-maven --rm jamesbloom/docker-java8-maven

I'm getting:

sudo: docker: command not found

I don't know if I created the VM correctly (apparently, the instance doesn't know what 'docker' is and must therefore not have a container maybe??) and the link to the recording from your walkthrough hasn't been added to the meeting invite yet. I've been following your guide.
The VM I created is:

"deviceName": "heiligers-loadtest-kibana",

Any and all help will be greatly appreciated!

TinaHeiligers on 19 Nov 2020

@joshdover

understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM

I'm struggling with the VM setup but I could tackle this in Cloud Staging if you don't mind the 'noise' generated when I run the tests locally (pointing to cloud instances).

TinaHeiligers on 19 Nov 2020

👍1

@TinaHeiligers you are getting this error

sudo: docker: command not found

since you don't have docker pre-installed on Ubuntu image, to fix it you need to recreate your VM and change Boot disk:

Google Cloud Platform 2020-11-19 11-53-00

Currently you have Ubuntu, but it should be one of Container Optimized OS.
You still can install docker on Ubuntu, but it will be faster to simply create a new VM.

VM run instructions are available in repo now

dmlemeshko on 19 Nov 2020

👍2 ❤1

I think it'd also be worth understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM. Due to the many performance tweaks that were made to support Fleet, there may not be large regression in 7.11 w/ APM enabled.

I thought that work was done in v7.10, but it doesn't hurt to test with 7.11-SNAPSHOT as well.

7.9 w/o APM

I believe we added support for APM on Cloud in 7.10 only https://github.com/elastic/kibana/pull/77855

restrry on 19 Nov 2020

I thought that work was done in v7.10, but it doesn't hurt to test with 7.11-SNAPSHOT as well.

You're right I got my versions mixed up. We should be comparing 7.10 w/ APM vs. 7.9 w/o (which is the only option on 7.9)

joshdover on 19 Nov 2020

@dmlemeshko thank you so much for all your help! I've successfully created a VM and have initial results from a 7.10 deployment created during the test run.

Update:
@dmlemeshko When I try to run the tests against an existing deployment, I get BUILD FAILURE errors:

Errors in VM container

00:04:50.769 [ERROR] i.g.a.Gatling$ - Run crashed
java.lang.NullPointerException: null
    at java.io.Reader.<init>(Reader.java:78)
    at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
    at scala.io.BufferedSource.reader(BufferedSource.scala:26)
    at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:27)
    at scala.io.BufferedSource.charReader$lzycompute(BufferedSource.scala:37)
    at scala.io.BufferedSource.charReader(BufferedSource.scala:35)
    at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:64)
    at scala.io.BufferedSource.mkString(BufferedSource.scala:93)
    at org.kibanaLoadTest.helpers.Helper$.readResourceConfigFile(Helper.scala:38)
    at org.kibanaLoadTest.simulation.BaseSimulation.<init>(BaseSimulation.scala:25)
    at org.kibanaLoadTest.simulation.DemoJourney.<init>(DemoJourney.scala:8)
    ... 16 common frames omitted
Wrapped by: java.lang.reflect.InvocationTargetException: null
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at io.gatling.app.Runner.run0(Runner.scala:74)
    at io.gatling.app.Runner.run(Runner.scala:60)
    at io.gatling.app.Gatling$.start(Gatling.scala:80)
    at io.gatling.app.Gatling$.fromArgs(Gatling.scala:46)
    at io.gatling.app.Gatling$.main(Gatling.scala:38)
    at io.gatling.app.Gatling.main(Gatling.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at io.gatling.mojo.MainWithArgsInFile.runMain(MainWithArgsInFile.java:50)
    at io.gatling.mojo.MainWithArgsInFile.main(MainWithArgsInFile.java:33)
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at io.gatling.mojo.MainWithArgsInFile.runMain(MainWithArgsInFile.java:50)
    at io.gatling.mojo.MainWithArgsInFile.main(MainWithArgsInFile.java:33)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at io.gatling.app.Runner.run0(Runner.scala:74)
    at io.gatling.app.Runner.run(Runner.scala:60)
    at io.gatling.app.Gatling$.start(Gatling.scala:80)
    at io.gatling.app.Gatling$.fromArgs(Gatling.scala:46)
    at io.gatling.app.Gatling$.main(Gatling.scala:38)
    at io.gatling.app.Gatling.main(Gatling.scala)
    ... 6 more
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
    at scala.io.BufferedSource.reader(BufferedSource.scala:26)
    at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:27)
    at scala.io.BufferedSource.charReader$lzycompute(BufferedSource.scala:37)
    at scala.io.BufferedSource.charReader(BufferedSource.scala:35)
    at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:64)
    at scala.io.BufferedSource.mkString(BufferedSource.scala:93)
    at org.kibanaLoadTest.helpers.Helper$.readResourceConfigFile(Helper.scala:38)
    at org.kibanaLoadTest.simulation.BaseSimulation.<init>(BaseSimulation.scala:25)
    at org.kibanaLoadTest.simulation.DemoJourney.<init>(DemoJourney.scala:8)
    ... 16 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.466s
[INFO] Finished at: Fri Nov 20 00:04:50 UTC 2020
[INFO] Final Memory: 17M/430M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.gatling:gatling-maven-plugin:3.0.5:test (default-cli) on project kibana-load-test: Gatling failed. Process exited with an error: 255 (Exit value: 255) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Any idea why running a test and creating a deployment at the same time works but not against an existing deployment?
The existing deployment is in the same region as the one that's created during the test and I can run the tests against existing deployments locally without an issue.

@restrry _If_ we can get the tests to run agains an existing deployment, I'll recommend the following strategy for running tests from the VM for the following cases:

[x] 7.9 w/o APM
[x] 7.10 w/o APM
[x] 7.10 with APM
[ ] 7.11-SNAPSHOT w/o APM
[ ] 7.11-SNAPSHOT with APM

I'm not sure if we can create the instances on Staging first and only run the tests from the VM. It might take a little time to figure out how to inject the apm* config settings in the deployment template that creates an instance as part of the test.

cc @joshdover JFYI

TinaHeiligers on 19 Nov 2020

Latest update:
I've managed to get the tests to run agains an existing deployment on a VM in the same region and am repeating the tests for the following cases:

[x] 7.9 w/o APM
[x] 7.10 w/o APM
[x] 7.10 with APM
[ ] 7.11-SNAPSHOT w/o APM (The tests fail all requests except for login)
[ ] 7.11-SNAPSHOT with APM (The tests fail all requests except for login)

TinaHeiligers on 25 Nov 2020

🎉2

@dmlemeshko I'll need your help with running the load tests against a 7.11.0-SNAPSHOT deployment. (config has version = "7.11.0-SNAPSHOT")

The run crashes with:

16:31:42.078 [ERROR] i.g.a.Gatling$ - Run crashed
java.lang.IllegalArgumentException: Invalid version format
    at org.kibanaLoadTest.helpers.Version.<init>(Version.scala:7)
    at org.kibanaLoadTest.KibanaConfiguration.<init>(KibanaConfiguration.scala:50)
    at org.kibanaLoadTest.simulation.BaseSimulation.<init>(BaseSimulation.scala:25)
    at org.kibanaLoadTest.simulation.DemoJourney.<init>(DemoJourney.scala:8)
    ... 17 common frames omitted

The way the version is being parsed doesn't allow for -SNAPSHOT suffixes.

When if I remove -SNAPSHOT from the version and _force_ a 7.11.0 version (config has version = "7.11.0"), the tests run but the only request that doesn't have a 100% failure is login. 🤷‍♀️

7.11.0-SNAPSHOT deployment (run as a 7.11.0 version in the tests)
Screen Shot 2020-11-25 at 16 45 46

I've run these several times (locally) with different deployments and get the same result.
Have you seen this before and, if so, how do we fix it? We need to get 7.11.0-SNAPSHOT stats to compare with the 7.9.3 and 7.10.0 versions.

TinaHeiligers on 26 Nov 2020

@TinaHeiligers I fixed an issue with snapshot builds, tested both with new and existing deployments. Please pull the latest master

dmlemeshko on 26 Nov 2020

🎉2

Comments from a new Node Agent engineer here -- @sqren and @restrry asked us to drop by and lend our two cents on configuration scenarios.

Also -- this is mostly echo-ing things that @joshdover has already said.

As far as configuration goes, I'd definitely be curious to see if toggling the asyncHooks configuration helps or hurts. The Agent (like other APM agents) uses node's async_hooks module to track asynchronous context across transactions (i.e. "this callback/promise goes with with this http request"). When this is disabled we fallback to using the patch-async module to track this async context. Under some workloads the later is more performant. If Kibana is still using bluebird promises then disabling the bluebird instrumentation might yield positive perf. results -- but at the possible cost of some lost transaction state.

If we go this route we'd want to investigate what a trace that involves bluebird transactions looks like with this both on and off.

Other than that (also as previously mentioned) -- transactionSampleRate is the main knob we have to turn when it comes to improving agent performance. Produce/record less data, improve performance.

Finally, if you're comfortable veering into the realm of superstition, installing a no-op logger might produce interesting results. This isn't based on any particular known problem with the elastic agent's logger -- just things I've seen elsewhere in the past.

astorm on 2 Dec 2020

Kibana: Measure APM agent impact on the platform performance

Most helpful comment

All 33 comments

Setup

Results

Without APM agent:

With APM agent:

On Cloud staging, using an existing deployment without APM:

On cloud staging, using an existing deployment with APM:

On Cloud staging, creating a deployment as part of the test run:

On Cloud staging, creating a deployment as part of the test run: Not done

Related issues