Triplea: Bots disappear from lobby after 48 hours

Created on 24 Sep 2018  路  32Comments  路  Source: triplea-game/triplea

@DanVanAtta @ssoloff @RoiEXLab @ron-murhammer
https://github.com/triplea-game/triplea/issues/4094

Problem Urgent

Most helpful comment

The root cause is #3676 (1.9.0.0.10937).

What's happening is, on the second and subsequent times the lobby watcher is created, the triplea.lobby.host and triplea.lobby.port system properties are not set because they were cleared during the first lobby watcher creation. InGameLobbyWatcher#newInGameLobbyWatcher() returns null in this case.

The code removed in #3676 would save these system properties to "temporary" system properties, which were conveniently restored between shutting down the previous lobby watcher and starting the new lobby watcher. That is, the lobby host and port properties would be available on the second newInGameLobbyWatcher() call, so that it returned a non-null reference.

All 32 comments

I put a copy of the JDK on the Atlanta bot server under the admin account. Unfortunately, I can't run any of the tools (e.g. jstack) to grab thread dumps because I can't run the tools as the triplea user due to sudo not being configured for those paths.

I'm going to be AWK for a few hours if anyone with access wants to give it a try.

I can sudo it @ssoloff just tell me what to run

@prastle Sorry, don't have time to explain in detail right now. See if this tutorial helps. If not, I'll check in later today.

I would say if you can't get the thread dumps, go ahead and reboot all the bot servers except for one of them, so we have some frozen bots that we can analyze later.

To be clear just PM the commands you want run on the forum
@ssoloff or ssh in as admin with your sudo.

triplea user no longer has those powers. and ok
I will see what I can do

@ssoloff admin has the functions you need sadly I don't know which pid I need. Lobby is slow ill refire some if needed but atm leaving for you to look at. SSH in as Admin should work for you. Or perhaps it's just my old sudo account password still works as sudo.

sadly we are now down to 6 bots out of 28
I will refire some
@ssoloff
refired 6 of 7 London bots
I also refired Atlanta other two are yours to disect

Please note it took 9 minutes to parse maps

I logged in to the Atlanta and Tokyo bot Linodes as root using the Lish console, from which I was able to su into both the admin and triplea accounts. I've attached thread dumps for bots 401 and 601. When I checked the lobby, 401 was connected, while 601 was not. So I'm assuming @prastle had already restarted the Atlanta bots but the Tokyo bots were left in the MIA state from earlier.

I don't know when I'll be able to take a detailed look, but I'll shoot for later this evening. Anyone else should feel free to take a look at the logs to see if there's anything noteworthy.

bot-401-thread-dump.log
bot-601-thread-dump.log

log into the 400 is London
300 also has not been restarted

300 should have your info as well

One thing I quickly noted is that the 401 bot has three lobby watcher threads running (Decoder, NIO Reader, NIO Writer), while the 601 bot doesn't. I don't think that's surprising given the symptom here.

Next thing to do would be to look at the 601 bot logs to see what caused the lobby watcher threads to terminate and see if we can repro that condition.

600 is completely mia
suggest use that as test base
and we refire 300 for now?

@prastle I agree about the 600 series. If it's okay to leave those offline for possibly the next 24 hours for debugging, that would be great.

Sorry, I don't know what to tell you about the 300 series.

kk

The logs for 601 didn't help much. For 9/24, all I saw was one user connect and disconnect a few minutes later. There were no exceptions.

Unfortunately, the Lish console is kinda useless for this purpose--it truncates, not wraps, text after 80 columns, so I'm only seeing a portion of each line. I don't have access to Papertrail, but maybe looking there will provide more insight?

@ssoloff I have never accessed papertrail either. Perhaps @DanVanAtta or @RoiEXLab or @ron-murhammer could have a look at it. One piece of good news is that I decided to leave my own headless bot up for 24 hrs running .12102 . It seems to still be working fine.

I left the 300 series with the missing bots alone as well if you want to have a look at it.

London has a missing 407 bot again.

Reposting in this thread, as this is the appropriate place for discussing fixing the ghost bots:


Just want to clarify that the bots aren't actually crashing--only the Lobby Watcher component is. I confirmed I can directly connect to bot 407 (which, ATM, is not visible in the lobby) and start a game. The Lobby Watcher is what allows players to "see" the bot in the lobby and connect to it without knowing it's IP and port.

So, we need to figure out what's causing the Lobby Watcher specifically to stop running.

There appear to be four ways the Lobby Watcher can be shut down:

  1. If the lobby cannot reach the bot, the Lobby Watcher will shut down. I don't think this is the case we're seeing because, if it does happen, the entire bot ultimately shuts down, not just the Lobby Watcher, and we've already verified that the MIA bots are still running.
  2. If a communications error occurs between the lobby and the Lobby Watcher. The Lobby Watcher simply shuts down in this case. However, it doesn't appear that the exception that caused the error is ever logged, which would explain why we're not seeing anything in the logs.
  3. If a bot owner executes the Lobby > Remove Game From Lobby command from within the running game. This obviously isn't the case we're dealing with since we're concerned with unattended bots.
  4. By default, every 12 hours, the Lobby Watcher is restarted. It's possible something's going wrong during the restart. However, it does appear that any exception thrown during the restart will be logged. Therefore, if it's failing to restart, it must be due to something other than an exception since we're not seeing anything in the logs.

There are a few other cases, but they appear to be related to the bot owner manually stopping the bot on the game setup screen. Those shouldn't be applicable in this case, again because we're dealing with unattended bots.

I'm thinking item (4) might be the best theory to consider first. It should be easy to change the system property that controls how often the Lobby Watcher is restarted. We could try to reduce that to a really small value (say 60 s) to run some kind of stress test locally and see if that trips up the Lobby Watcher restart. I think that's going to require a code change, though, because there is code that enforces a minimum value (6 hours).

I'm thinking item (4) might be the best theory to consider first.

Well, that was quick. :smile: I seem to have reproduced the problem after the second attempt to restart the Lobby Watcher (I set my timer to restart every 30 s; hopefully that short delay is not tripping a false negative). I confirmed the Lobby Watcher threads are no longer running, but the server continues to try and start a new Lobby Watcher every 30 s. The bot does not appear in the lobby, as expected.

I'm going to be busy for the next few hours. I'll try to dig deeper into the root cause later this evening.

@prastle If I truly reproduced the problem, then your 48-hour estimate is spot on. Per the infrastructure scripts, we restart the bot lobby watcher every 48 hours (172,800 seconds):

https://github.com/triplea-game/infrastructure/blob/7b1df339f2c26cedf4141c61e6d89d39ca97d708/roles/bot/files/run_bot.sh#L57

The root cause is #3676 (1.9.0.0.10937).

What's happening is, on the second and subsequent times the lobby watcher is created, the triplea.lobby.host and triplea.lobby.port system properties are not set because they were cleared during the first lobby watcher creation. InGameLobbyWatcher#newInGameLobbyWatcher() returns null in this case.

The code removed in #3676 would save these system properties to "temporary" system properties, which were conveniently restored between shutting down the previous lobby watcher and starting the new lobby watcher. That is, the lobby host and port properties would be available on the second newInGameLobbyWatcher() call, so that it returned a non-null reference.

@ssoloff Nice work. Thoughts on essentially just reverting #3676 since it was just clean up anyways? I'm sure this could be improved but might be better to just get everything back to functioning first.

Home again and yes it is a roughly 48 hr interval. Great stuff! But I still recommend moving forward to a later pre release in the lobby and bots to fix the other issues. thanks for the hard work @ssoloff

500 series completely missing again should I restart them? @ssoloff

login as: admin
Authenticating with public key "deleted"
Passphrase for key "deleted":
Last login: Mon Sep 24 18:09:52 2018 from 24.53.232.156
admin@bot55_london_uk:~$ sudo service triplea-bot@07 status

Warning: Journal has been rotated since unit was started. Log output is incomple
lines 1-11/11 (END)

I can confirm your previous statement that they are still running in the background.

refired those bots 48 hrs to go

@ron-murhammer Unfortunately, a clean revert isn't possible due to some code being deleted between now and then. However, it still appears to be fairly straightforward to restore that functionality. I started on it earlier, trying to do it in a way where we won't think it's unused code again in the future. Hopefully, I'll have something submitted before I go offline.

@prastle Yeah, go ahead and restart any bots you see missing. We no longer need them for debugging.

Unfortunately, for now, you're going to have to keep restarting the bots every 48 hours. Even if a fix goes in tonight, we can't deploy it until #4087 gets the infrastructure issues straightened out. :disappointed:

@prastle All bots were upgraded to 1.9.0.0.12226 and restarted. That should fix both the player ordering bug and the bots leaving the lobby every 48 hours.

It seems the problem has been repaired by our GREAT! DEVS!
I think we can close this @ssoloff @ron-murhammer Opps ya did nice work guys.

Was this page helpful?
0 / 5 - 0 ratings