Triplea: Branch Model of Infrastructure

Created on 22 Sep 2018  路  19Comments  路  Source: triplea-game/triplea

Branch model & deployment strategy have some details here: https://github.com/triplea-game/infrastructure#deployments

A key implication of gitlab flow is that all changes flow through master before going to prod. For us 'master' is effectively 'qa', and that is saying in another way that any production changes first go to QA (AKA prerelease).

Questions for the group:

  • Any questions on this topic?
  • Is the documentation looking okay? (let's perhaps spend more time reviewing documentation updates and leave those PRs open so we can all read them)
  • Should we add any additional environments to the flow?
  • Should we rename branch: 'master' to 'QA'?

All 19 comments

I think this broke down about a month ago because we had a hotfix that needed to be pushed to prod, but there were already a few commits on master that had not yet been pushed to prod, and those commits were incompatible with the current bot/lobby versions in prod.

That led to cherry-picking some changes from master to prod for such hotfixes. That then further broke down to committing changes directly to prod to fix emergency issues. As noted, it didn't help either that triplea-game/infrastructure#122 was stalled due to PreProd being broken.

One question that needs to be resolved is what will be our process to deploy hotfixes into production when we can't merge master -> prod outright?

@DanVanAtta The issue is that there doesn't seem to be an easy way to test anything so master is essentially useless. Just a filler step that takes time.

We currently can't upgrade the bots any further because of the artifact renames and need the changes here: https://github.com/triplea-game/infrastructure/pull/122. But its unclear on if we can do a rolling upgrade if that is done (upgrade one set of the bots at a time).

When trying to update the bots earlier this week, the lobby version was updated in the infrastructure scripts which restarted everything and then the bots didn't reconnect due to other errors so had like a 2 hour outage.

The 2 current problems with bots are (11996):

  • There was a player ordering bug in setup screen that is now fixed in https://github.com/triplea-game/triplea/pull/4066
  • Bots seem to be disappearing from lobby and not reconnecting (unknown reason but over the 2 or so days that 11996 has been on the lobby/bots most of them disappeared)

@DanVanAtta It is interesting to note that none have disappeared since the manual restart of the missing bots.

It appears they wipe out after 48 hrs. Just a general observation.
Multiple restarts all week. @DanVanAtta @ssoloff

It should also be noted that my private headless bot has run for 4 days now? Short of power outage it seems stable. Running pre .12102

@prastle Would you say that your bot is being used by players just as much as the standard bots?

It has had a few games I have joined one or two and watched it play on.
I never checked or counted to see if it has had as much use as the bots. All I can say is it hasn't crashed. I host my own when I play while leaving it running
But to be direct... most bots are empty often as well.

So since many empty bots why are they disappearing
?

Atm the same bot has just had 2 players start a game and it is round 4, The bot has been running 60? hrs maybe longer?

They started their a new game a few hrs ago.

Just want to clarify that the bots aren't actually crashing--only the Lobby Watcher component is. I confirmed I can directly connect to bot 407 (which, ATM, is not visible in the lobby) and start a game. The Lobby Watcher is what allows players to "see" the bot in the lobby and connect to it without knowing it's IP and port.

So, we need to figure out what's causing the Lobby Watcher specifically to stop running.

Correct
u have found and identified better than me :)

@ssoloff I can also confirm that by status all disconnected bots have always still been "running"

I think this broke down about a month ago because we had a hotfix that needed to be pushed to prod, but there were already a few commits on master that had not yet been pushed to prod, and those commits were incompatible with the current bot/lobby versions in prod.

@ssoloff I would agree this is where the break-down occurred. In hind sight, we could have:

  • upgraded version number in infrastructure to latest as soon as we merged breaking changes to triplea-game.
  • verified those changes after they are merged to master and auto-deployed to prereleases
  • merge master to prod
  • restarted bots at our convenience if at all.
  • merge 'hotfix' to master
  • merge master to prod
  • restart bots when needed to pick up hotfix change

That led to cherry-picking some changes from master to prod for such hotfixes.

Indeed, direct merges to master means we've lost already, there should never be a need to cherry-pick to prod.

That then further broke down to committing changes directly to prod to fix emergency issues.

This is a concern that I think could potentially happen again if we are not diligent to get breaking changes pushed to prod. @RoiEXLab @ssoloff @ron-murhammer any thoughts of tracking those updates with a story-board? https://github.com/triplea-game/triplea/projects/5 Any such PRs we can mark as part of that project and then would have tracking as we stage and then push the updates to prod.

One question that needs to be resolved is what will be our process to deploy hotfixes into production when we can't merge master -> prod outright?

Always deploy changes to master then merge master to prod wholesale. If master can't be deployed, then roll back the game updates that were incompatible. In essence we should always have a deployable master. Master is always deployed to pre-prod, if master is broken then pre-prod is broken. IN those cases we should fix quickly or rollback and not leave pre-prod broken. If we have broken pre-prod, what business have we to deploy to prod?

@DanVanAtta The issue is that there doesn't seem to be an easy way to test anything so master is essentially useless. Just a filler step that takes time.

Master is auto-deployed to prerelease. The pre-prod server crashed due to out of disk as these auto-deployments were not being cleaned up.

When trying to update the bots earlier this week, the lobby version was updated in the infrastructure scripts which restarted everything and then the bots didn't reconnect due to other errors so had like a 2 hour outage.

This behavior was fixed on master. It was not intended for there to be merges directly to master that would by-pass that update.

Let's stay focused on the branch model in this thread. Notably the 2 hour outage mentioned seemed to have been in part so extended because we were out of sync. Deploying a 'hotfix' ideally would have been very quick. As part of the branch model, the review being set to 1 required is also intentional to ensure we are doing the right thing, spread knowledge and avoid mistakes. Unless it is an emergency please do not use the admin override this.

@DanVanAtta I'm fine with the branch model if its actually functional. To my knowledge the pre-prod server still isn't working. Generally, changes are being reviewed and the 1-2 hour outage was a mistake but it was reviewed by another person so that isn't the issue. The main issue is that there are a lot of scripts and the documentation isn't complete enough to make it obvious how to do many of the necessary tasks.

pre-prod server being broken is ideally treated as stop-the-presses project blocker. We're not quite yet there. I'll note that master was auto-deployed there for a few months, but we are getting on the same page now..

The 1-2 hour outage was exacerbated by us being out of sync and the branch model not being clear, it was not the whole story though for sure.

The branch model calls for changes to be merged to master, reviewed, then master is 'promoted' to prod and reviewed once more.

The main issue is that there are a lot of scripts and the documentation isn't complete enough to make it obvious how to do many of the necessary tasks.

Indeed; infrastructure is a lot of automation, it's most of loose bits and ends that were hanging around our installation steps that I had to reverse engineer. To some extent the scripting is the automation, the only updates needed to upgrade versions or deploy new bots or remove bots is to update the host_control.sh file.

I don't know if more documentation is going to help; It would be helpful for me to know where there is confusion and clarity. Is the branch model at least clear?

For other infrastructure questions, IMO a new issue thread would be easiest to tackle those topics, but we can get into those questions here too.

Summarizing open items here:

  • clarify/discuss any remaining branch model topics that are still fuzzy
  • ensure we have this documented/spend some time clarifying & cleaning up that documentation
  • follow up with any remaining questions about infrastructure and fix up documentation gaps (https://github.com/triplea-game/triplea/issues/4117)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

General-Dru-Zod picture General-Dru-Zod  路  5Comments

Khobai picture Khobai  路  9Comments

panther2 picture panther2  路  6Comments

DanVanAtta picture DanVanAtta  路  8Comments

FrostionAAA picture FrostionAAA  路  7Comments