Beats: Elastic Agent does not open bootstrap port for Elastic Endpoint after reboot

Created on 30 Sep 2020  路  20Comments  路  Source: elastic/beats

Elastic Agent does not serve GRPC bootstrap info over TCP 6788 after reboot, or restart. This results in Policy failure for the Endpoint because it cannot establish connection

For confirmed bugs, please report:

Agent Ingest Management bug failed-test v8.0.0

Most helpful comment

we decided we think this is fixed, and are going to re-test with a full 'regular' build of snapshot that has the needed .asc files tomorrow.

All 20 comments

FYI @paul-tavares

Pinging @elastic/ingest-management (Team:Ingest Management)

@gogochan is elastic agent started? how it was installed? @EricDavisX I think we have covered that case in our "install/uninstall" test cases?

@ph yes, it did run and makes connection to the Fleet. However it didn't open port 6788, until we make modification to the configuration.

OK, so it's installed, it's come back up when the computer restart. But it doesn't open the port. This is odd because, I presume the local configuration "persisted" to disk by the Elastic Agent would tell him to have endpoint running.

@blakerouse Can you take a look?

I think this could be because there was an issue with Dynamic Inputs that broke saving the inputs into the action_store.yml. With no inputs in the action_store.yml then on restart the Elastic Agent would not think Endpoint should be running so port 6788 would not be open.

This was fixed in https://github.com/elastic/beats/pull/21298. Can you confirm that your build includes that PR?

@ph to your question, we have re-start of agent tested in e2e-testing, but it doesn't have endpoint in the config at the time. bummer. but, we can do that - should it be a separate test tho? or should we update all relevant test cases to also have Endpoint enabled in policy and test for the related expectations? If its the latter it will have impact on the nomenclature and layout of the tests (I bet the Robots team will insist on keeping it in top-notch logical layout). The one-off change is easily doable, it just needs someone's time to get done, too.

here is the line: https://github.com/elastic/e2e-testing/blob/master/e2e/_suites/ingest-manager/features/fleet_mode_agent.feature#L37

I've logged this ticket for us to improve this with priority:
https://github.com/elastic/e2e-testing/issues/336

@EricDavisX Yes probably a separate test seems the simplest route?

Is this still an issue or was it fixed by #21298?

I am not able to validate this as I cannot spawn an instance of 8.0 at the moment.

Validated using the latest_snapshot https://snapshots.elastic.co/8.0.0-af3cda3c/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-windows-x86_64.zip

Other than Agent doesn't restart on system reboot, if started manually, the Agent does serve GRPC bootstrap information via 6788

Closing this.

I don't doubt that the manual start of Endpoint works as Chan notes - but I am seeing this with a linux endpoint on latest code. The host has Endpoint up and running and after it is re-booted it throws errors in the Agent log and has 'Agent Connectivity' errors in the Endpoint side log:

Screen Shot 2020-10-06 at 4 01 09 PM
Screen Shot 2020-10-06 at 3 59 35 PM

I think we urgently need to pair program this on Agent + Endpoint. @ph and @ferullo - @gogochan and @blakerouse are you free to find an environment and check it out? I have one now and can give logs if helpful...

logs attached
agent-endpoint-reboot-test.txt

hash of agent is: af3cda3c from today鈥檚 *just finished build artifacts here
Kibana info is$ git show -s 6f983728d7f8c2cf065a6d5099157a5cfdc3cd08
Date: Tue Oct 6 09:46:56 2020 +0300

installed it with 'install' command and while it took 5 mins for it to come on line (not reflected in logs) it did eventually and then gave the 1 minute check-in calls successfully. i wanted to see that before I rebooted it.

@EricDavisX Based on the error reported in the screenshot, it seems that you might actually have 2 Elastic Agents installed and running?

Are you sure you don't have both the *.deb and the install based installation running? From the log message it shows bind: address is already in use. So that means that Elastic Agent could not open that port for Elastic Endpoint to connect back.

it was online happily before I ran it. and i captured the ps ax output as such:
[zeus@mainqa-atlcolo-10-0-6-147 elastic-agent-8.0.0-SNAPSHOT-linux-x86_64]$ ps ax | grep elastic
23306 ? Ssl 0:03 elastic-agent
23419 ? Ssl 0:01 /opt/Elastic/Endpoint/elastic-endpoint run
23463 pts/0 S+ 0:00 grep --color=auto elastic

lets pair up and figure out and post back what we find. its a long enough thread already, lol

we decided we think this is fixed, and are going to re-test with a full 'regular' build of snapshot that has the needed .asc files tomorrow.

@EricDavisX have you been able to test 7.10.0 BC1 to see if this is fixed? I have not been able to reproduce it on my Windows testing.

have not tested 7.10 BC yet - I saw a good report that 7.10 Agent tests were all passing tho... so its likely fixed. let me review the specific vm / test case later on 7.10 and 8.0 both I guess, and we can close it out.

reassigned to @EricDavisX send it back to us if its still an issue.

it is not reproducible as noted here with the 7.10 BC1 build of Agent - i have other issues, but this is fixed. closing.

Was this page helpful?
0 / 5 - 0 ratings