Phinx: Support pre and post migrations for zero-downtime deployments

Created on 25 Jun 2020 · 28Comments · Source: cakephp/phinx

It is common to run migrations at deploy time, and those migrations are often split into pre- and post-deploy migrations. Pre-deploy migrations affect the current version of the app for a few moments before the new version is deployed. Post-deploy migrations are only applied after the new version has been deployed. This is to ensure no errors occur during a live, zero-downtime deployment.

For example, new columns are usually added during pre-deploy, because if those columns are not present the moment the new version goes live, errors will occur because the app will try to access non-existent columns. Deleted columns should often be removed during post-deploy, because if they are removed whilst the current app version is running, errors will occur because the current version may still be trying to access them.

To support these two types of migrations, there needs to be a mechanism to mark a given migration is either pre or post and the migration command must accept an option to toggle which migration subgroup is being targeted. This changes migration command behaviour from one-pass two two-pass style, where the command must be invoked twice - once for each migration subgroup - to perform the complete migration.

Key discussion points

Should we implement this feature?
How should migrations be marked as _pre_ and _post_?
Should a new migration command be added or should the existing command be repurposed to support this behaviour with additional arguments or options?

Source

Bilge

Most helpful comment

My experience the overhead of trying to manage pre and post migration stages requires too much human oversight and is still prone to mistakes. Instead only having a pre stage and doing an additional deploy covers your 'post' migration stage.

markstory on 18 Jul 2020

👍2

All 28 comments

An opinion ... my team looked at how it would be best to handle this sort of thing and the conclusion we came to was to put this process into the workflow and devops pipelines, rather than in the production pipeline where it could have negative impact or attempting to coerce pipeline aspects into Phinx.

To that end we came up with a silly acronym - ProMiReD (you can google that and see the top non-ad link! Rare to be number 1 on Google for anything useful!).

It may not be a perfect generic solution for all, but it is one that works for us.

We did explore the idea of tagging before and after migrations, but invariably, by the time you've chained in code changes and repeatables, the idea that Phinx alone can solve this problem is demonstrated as non-viable. At least with our team.

But having implemented ProMiReD, we have commit pipelines that reject PRs that affect more than one of migrations, repeatables, or code.

The advantage is that each task deals with 1 aspect of the reason for the change. Smaller tasks. Easier to review and test. Easier to deploy. Speed up all round.

So. Like I said, an opinion.

rquadling on 27 Jun 2020

@rquadling I'm not going to pretend I understood most of that self-published article, but I think I understand the core of your message, which is simply that you do not need post-migration actions because you perform such actions in a subsequent release rather than the current one. That is, rather than delete an obsolete column during the release it becomes obsolete, it is deleted during a subsequent release (assuming anyone remembers to do so). That effectively means your position is just to do nothing, which is a valid position that doesn't enhance Phinx in any way. I still think it's a worthwhile enhancement to have the database in the correct state immediately instead of carrying the burden of clean-up into the future.

Bilge on 27 Jun 2020

You don't need to understand the article, but if you work in a multi-dev/multi-branch team, with CI/CD processes, then you need a good set of rules to reduce the possibility of downtime. Like I said, this was the conclusion of our dev/infra team. Something that works VERY well for us.

I did suggest similar tagging mechanism back in 2017 : https://github.com/cakephp/phinx/issues/1044 : and though I'd have coded a solution for Phinx, my team and I never really formed a solution for Phinx and instead formulated a simpler process that was easier to develop and, more flexible for us as we moved from a manual process to an automated one.

The point I was attempting to raise is if you incorporate such a pre/post solution to Phinx, you will essentially have to run Phinx twice ... so why bother with the pre/post option as you would have to have run the code release in the middle of the two steps.

Imagine this simple scenario:
Task 1 : Introduce new columns with appropriate default - Phinx does this.
Task 2 : Introduce new code/UI to get new column in full play - Nothing to do with Phinx.
Task 3 : Run backfill if the task is the column requires it - Phinx does this.
Task 4 : Release updated reports that rely on the new column with the backfill complete - Nothing to do with Phinx.

Task 1 could happen near instantly upon deployment.
Task 2 could take several hours depending upon the architecture.
Task 3 could take several hours, depending upon the row count/logic of backfill.
Task 4 could take several hours depending upon the architecture.

For our team, the solution for the above would be to have 2 tasks in Jira. In this example, the reviewer would happily accept the migration and code changes in 1 branch as they would be able to see no conflict in the order of release - it obeys our ProMiReD pattern.

As would the second task.

And the pipeline can run the migrations, the repeatables, and release the code and nothing is broken.

For this pattern to use a pre/post setup we would STILL be running Phinx twice, we would still be needing to do 2 code releases.

I think the simplest solution is to just have an appropriate task management process in place.

rquadling on 27 Jun 2020

Yes, Phinx would run twice; this is clearly stated in my original proposal. You have failed to establish why this is bad or why it is a problem.

For this pattern to use a pre/post setup we would STILL be running Phinx twice, we would still be needing to do 2 code releases.

This is patently wrong; there is one release, during which Phinx is called twice. That is all. The zero-downtime deploy looks like this:

Run pre-deploy migrations (i.e. additions)
Symlink switchover to new code
Run post-deploy migrations (i.e. removals)

Bilge on 27 Jun 2020

I'm not sure the complexity of this feature is worth the benefit. Your argument seems

I still think it's a worthwhile enhancement to have the database in the correct state immediately instead of carrying the burden of clean-up into the future.

Typically you would verify the deployment so there is a manual validation step between the pre migration and post second migration. Phinx would need to verify that any previous post migrations weren't skipped because now the migration is split between manual devops steps and phinx automation.

What happens if there is a pre + post pair and then a subsequent migration merged before deployment? How should phinx resolve this when deploying pre migrations?

I agree that having to do multiple deploys only to clean-up a zero downtime migration seems like more work -- but that clean-up doesn't have to be deployed immediately.

othercorey on 27 Jun 2020

👍1

I also think this should then be solved on the deployment layer, not in phinx.

dereuromark on 28 Jun 2020

What does a deployment-layer solution look like?

Bilge on 28 Jun 2020

The solution we have is that we have a team agreed process that prohibits (in the main) mixing of migrations, repeatables, and code. Each task covers the specific area of concern. We use workflow management to make sure that the list of tasks ("the story" if you will) includes the clean-up process. "Forgetting" means the story isn't complete and so the project owner and management will see this in the sprint reports.

Fundamentally, introducing complexity into Phinx to balance a lack of deployment control is what we are all saying.

Phinx (and tools like it) are much better for NOT having to do everything under the sun.

It was also why I didn't attempt to implement repeatables in Phinx. We had tried, but the complexity and restrictions required clearly indicated to my team that a separate standalone solution was a far better fit.

rquadling on 28 Jun 2020

👍1

Closing as there is nothing we can currently do or achieve on this it seems.
PRs for smaller parts of this are welcome, if anyone wants to try to enhance the API.

dereuromark on 18 Jul 2020

Nothing we can do? I don't think that's true at all, unless you mean nothing we want to do.

Bilge on 18 Jul 2020

The responses were pretty clear on that: It would make it much more complex than it currently is
You are welcome to make a PR that showcases it and we can rediscuss based on that implementation.
But this ticket itself would just stay open as is forever, as we all are indeed not sure how to approach this.

dereuromark on 18 Jul 2020

I don't have a lot of energy to argue about this, but that seems like the wrong approach to feature development. We should be able to discuss the technical approach and whether it's worthwhile without jumping into code. The downside of jumping into code is that we thereafter conclude it's not worth doing and then someone has wasted a lot of time. The discussion should come first. The only reason to jump into code first is when nobody can determine whether a technical solution is even feasible, in which case a prototype is needed, but that's certainly not the case here; the possibility is clear and an implementation of any nature would be relatively straight forward.

Bilge on 18 Jul 2020

@Bilge I presented concerns. You ignored them. What else do you expect to discuss here?

othercorey on 18 Jul 2020

Looks like the discussion is not quite finished after all.

dereuromark on 18 Jul 2020

I ignored it because I couldn't make any sense of anything you wrote.

Typically you would verify the deployment so there is a manual validation step between

That's a constraint you've invented.
I cannot see how, even if that constraint was a requirement, it would inhibit the pre/post migration strategy in any way.

Phinx would need to verify that any previous post migrations weren't skipped

Why? There is nothing magical about pre and post migrations, they are still just migrations. Phinx already has a mechanism to determine whether a migration has been run or not. There is nothing special or extra going on here, besides designating migrations as either pre or post and running migrations matching one of those two groups. Think of it like tagging and only running the pending migrations matching that tag. Incidentally, tagging is another possible implementation approach.

What happens if there is a pre + post pair and then a subsequent migration merged before deployment? How should phinx resolve this when deploying pre migrations?

What is there to resolve? It just runs all the pending pre migrations when it is instructed to do so, and all the pending post migrations when it is instructed to do so. That's it. There's nothing more complicated going on here.

Bilge on 18 Jul 2020

I ignored it because I couldn't make any sense of anything you wrote.

It sounds like you don't want to discuss anything, but simply complain someone else doesn't implement a feature they don't understand.

othercorey on 18 Jul 2020

I made it very clear in the initial post what I expect the discussion points to be and so far nobody has engaged any of those points. That's not to say you cannot raise additional concerns or questions but it would help if you would take the time to understand the concept before raising such questions.

It sounds like you don't want to discuss anything

On the contrary, I am happy to engage in good faith, which is more than can be said for someone who only quotes the opening line of my response and follows with personal attacks in lieu of all the following technical discussion. I wish to discuss this feature but only with others who wish to do the same.

Bilge on 18 Jul 2020

I'm sorry but you ignored my reply for 21 days. I don't see that as being happy to engage. I'm not interested in discussing the feature any further.

othercorey on 18 Jul 2020

@Bilge The tagging you just mentioned was discussed above and it seems not a too viable solution.
But if it can be an opt-in topic/feature for some, I don't see why we couldn't check it out.
Usually we try to include features that the majority of users can and possibly want to use, though.

To sum it up: Tagging/Grouping and running specific such groups is a possible feature that could be implemented.
But as this is a community driven library, this means the main stakeholders (in this case @Bilge ) would have to be interested in doing the leg work here, creating the initial PR and helping to drive the feature.
Otherwise this is very likely of not going to be implemented.

So I can only say again: We need the PR here, as further discussion here on theoretical topics will not bear more fruit.
Most here voiced concern over such a feature as they faced issues then down the line or see concerns in complexity.
A concrete draft or a more detailed summary of how it would work is needed either way. This could help to convince others more than the very abstract debate here so far.

dereuromark on 18 Jul 2020

Why isn't tagging viable? Tagging is just an expanded version of pre/post, where "pre" and "post" are just two fixed, application-defined "tags", as opposed to a more open "tagging" system where tags are not fixed and can be any label the user wishes. I don't know if such a use-case for tagging actually exists, but both could be implemented in a similar fashion so I'm unclear as to why one is viable and the other isn't.

You seem to be under the impression this is something I just invented and therefore I have to carry the torch. I don't believe either is true. Pre- and post- migrations are a cornerstone of the zero-downtime deployment strategy used at many businesses where I've consulted, to the point where I mistakenly assumed it was ubiquitous and this topic would need little introduction. That's my mistake. But it's nevertheless likely that there are many more people out there with an interest in this feature, whether or not I end up needing to be the one to implement it.

Bilge on 18 Jul 2020

markstory on 18 Jul 2020

👍2

@markstory Thanks for your input. I can accept that may be a good solution and it certainly is the only solution possible at present. My concern is then, how do you ensure candidate post migrations actually get written and run? I suppose we can add them to a pending directory that has no special meaning to Phinx and can be moved into the migrations folder in the next release. How do you normally manage this?

Bilge on 18 Jul 2020

@Bilge Write two pull requests and rely on the team/developers to merge them. We have relied on planning and task management tools/processes to ensure all the work was done.

markstory on 18 Jul 2020

What guarantees migrations run in the correct order? (i.e. pre before post)

Bilge on 18 Jul 2020

@Bilge They are included in two separate deploys. For example if migration 'seedling' is your pre-migration, and 'tree' is your post-migration, we would merge and deploy 'seedling'. Then once the first deploy is complete, the changes containing the 'tree' migration would be merged and deployed.

Only having a single migration stage makes it simpler to know when a migration will be run, as they are always run before new code is deployed. For significant schema changes this can require some additional planning as you need to ensure that each release can function on their own. Even with pre and post migration models that thinking is required once you move beyond a single webserver as you can no longer deploy and restart atomically to all servers.

markstory on 18 Jul 2020

Why would multiple webservers be a problem?

Run all pre-migrations.
Symlink new code.
Run all post-migrations.

Each step is synchronized across all nodes before proceeding to the next.

Bilge on 18 Jul 2020

Yes, that model would work well with pre & post migrations. I've seen other setups which don't take into account the need to symlink and restart all servers before running post migrations. I've also seen post migrations that assume there is an ~= 0 second delay between symlink + post migration execution.

Having the gap between symlink + post migrations is one of the things that led teams I worked on to dropping the post migration stage, as managing that stage added more process and complexity to deploys, which could be reduced by only having pre migrations, and doing two deploys.

markstory on 20 Jul 2020

To answer the point regarding follow up tasks being coded, for us, this is project management.

If the developer has identified follow up tasks that are out of scope of the current task, and do not follow the deployment process and so cannot be incorporated into the current branch, then a new task is generated, pointed, added to the appropriate sprint and you move on. If the follow up task is critical then that can be escalated within the team and a human can manually make sure it all happens in the right order.

If the work somehow started out as a post deploy and then the dev realised there needed to be some pre deploy logic then the work is cut between 3 branches and the project management tool is used to do all the appropriate task blocking.

What we have found is that we have not had that pattern of a migration MUST be run IMMEDIATELY after code deployment.

Not all solutions are technical. Training and management tools do a LOT of the work for us.

There any number of suitable alternatives for running deployments in the right order.

We use Jira for project management and so tasks are ready to be merged only if they have been approved and non blocking. We have 2 states of blocking. Blocking a release and blocking a merge. We have found that sometimes the code doesn't require a full release as such just that some coded pre-condition needs to exist for the merge to take place.

We use a Jenkins to do all the heavy testing pre/post-merge.

We use BitBucket with a triggered deployment once all the post-merge testing has been done. No one really gets involved in the process as, for us, it is clearly defined.

If you are worried about a developer not doing their job, then that is a completely separate issue and not one that Phinx can answer.

rquadling on 30 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings