Is your feature request related to a problem? Please describe.
Today, when a PR is merged, _all_ of the dirty commit history of that PR is merged into the commit history of the mainline code. This leads to several problems:
The latter point is far and away the largest problem. With so many commits, (and the long build time of libcudf) it takes an impossible amount of time to try and use git bisect to triage bugs, performance regressions, etc. This is harmful to development.
There's no reason for us to continue doing this when other major OSS projects use a squash commit strategy for the same reasons I outlined above.
Describe the solution you'd like
We should switch to squashing all of the commit history from a PR branch into a single commit when merging it into the mainline branch.
Describe alternatives you've considered
The status quo of a dirty commit history is untenable. Without being able to use tools like git bisect it makes developer's jobs much more difficult.
@harrism mentioned a potential downside of squash merges.
You end up with one huge commit that could be thousands of lines of changes.
I believe this concern can be addressed by enforcing smaller PRs in the review process and encouraging developers to break up large pieces of work into pieces.
FWIW I'm very much in favor of this. git annotate would become much more useful. The commit comment will describe the overall change and link directly to the PR, rather than mention something along the lines of "address review comment" or "fix test". As another benefit, it's much more likely that any particular commit is _buildable_.
Some helpful additional reading material on the topic:
https://blog.dnsimple.com/2019/01/two-years-of-squash-merge/
https://medium.com/@elliotchance/comparison-of-merging-strategies-in-github-2f948c3b8fdc
Another advantage of squash merges is that the CHANGELOG.md no longer needs to be manually maintained. Since each PR will only have a single commit in the mainline branch, the CHANGELOG.md would be redundant with the commit history.
We could either remove the CHANGELOG or automate it's contents from the commit history.
Yet another reason for a tidy commit history is "continuous benchmarking" of running a benchmark suite against every commit. If our mainline commit history is dirty with all of the intermediate commits from feature branchs, then it becomes all but impossible to do continuous benchmarking since there would be far too many commits to test, and there is no guarantee that state of each commit is blessed.
FWIW I raised this issue (as it relates to this project) with @harrism a while ago https://github.com/apache/arrow/pull/5062#issuecomment-521023727. In my experience using the green button (or using plain git merge) almost exclusively does harm to the development process. Whatever information is "lost" by the squash is almost never useful for development.
Another benefit of the one-PR-one-commit policy is that cherry-picking PRs into branches and creating patch releases is substantially easier.
I think at this point everybody who has commented is in support. However, we really have to convince our devops team because they are the ones who have the power to enable squash merges (currently disabled).
Yet another reason for a tidy commit history is "continuous benchmarking" of running a benchmark suite against every commit. If our mainline commit history is dirty with all of the intermediate commits from feature branchs, then it becomes all but impossible to do continuous benchmarking since there would be far too many commits to test, and there is no guarantee that state of each commit is blessed.
This is not necessarily true as you can easily filter out the merge commits from the intermediate commits. Then you can do a divide and conquer on testing the set of merge commits instead of relying on one machine to do all of the git bisect
It is bascially what @dillon-cullinan and @rlratzel have been doing with our internal ASV/benchmarking work. I'm happy for us to work with conbench as it develops and be early adopters. Until then we need more input from the teams on these situations so we can leverage the tools we've already built or adapt them to help you out.
Another advantage of squash merges is that the
CHANGELOG.mdno longer needs to be manually maintained. Since each PR will only have a single commit in the mainline branch, theCHANGELOG.mdwould be redundant with the commit history.We could either remove the CHANGELOG or automate it's contents from the commit history.
The sticking point here as I have argued for years now is there is no way for us to know when something is a new feature, improvement, or bug fix automatically. Labeling is not being used currently by the teams for us to even automate this, nor do I have confidence that it would be. There would be similar complaints about failing a check to have the correct label before merging.
In the next release as a part of the new build process we're proposing to ignore all PRs that do not have a CHANGELOG entry. For the developer to get results they must put the entry in and update the log. This combined with being able to selectively rerun tests for conda, gpu, notebooks, and style checks will make the dev process more interactive instead of just let it run.
I also feel like there is a larger problem here. The developers have tools to do git rebase -i to clean up their commits before pushing to reduce the number of commits in a PR. The challenge here is there is a culture of just push to gpuCI and wait. This does nothing to change that and just encourages it more as it will hide all of the "Trying again" and "oops that didn't work" which I've seen as actual commit messages.
Devs should be doing local development and local testing prior to pushing to gpuCI. That would significantly reduce the number of commits in a PR. My other concern is the possibility for conflicts and challenges to devs that have in progress PRs more so than we have today. I'll let @raydouglass explain more.
One concern I have with squashing is the following scenario. I know this was common in my previous jobs.
Two devs are working together, one is building off of the other's feature. So there is branch feature-a with several commits, then the second dev branches off of feature-a to create feature-b and adds commits. When feature-a is squash merged into branch-0.x, feature-b now has conflicts with branch-0.x because they have diverged histories now. Merge commits do not have this issue.
FWIW we have allowed squash merges on two smaller RAPIDS libraries in the past. It resulted in a dozen or more times where we (ops) had to resolve the conflicts above and others in git and force push to resolve. So we switched those repos to the merge commits now where everyone can resolve the conflicts within github (except for limited cases). Since then the times we need to step in is very limited.
This is not necessarily true as you can easily filter out the merge commits from the intermediate commits. Then you can do a divide and conquer on testing the set of merge commits instead of relying on one machine to do all of the git bisect
Sure, you can filter out merge commits, but if I merge a PR with 100 commits, now the mainline branch has 100 new commits in it. Furthermore, there is zero guarantee that all of those 100 commits are stable. Extrapolating that out to having dozens of PRs per release... that's a difference of thousands of commits vs. dozens of commits in the history.
The sticking point here as I have argued for years now is there is no way for us to know when something is a new feature, improvement, or bug fix automatically. Labeling is not being used currently by the teams for us to even automate this, nor do I have confidence that it would be.
That's a valid concern. How confident are you that people are putting the CHANGELOG entry in the right place with today's process? Oftentimes when I'm creating an entry in the CHANGELOG, I'm not always sure where I should put it because the line between those 3 things can be fuzzy sometimes.
Labeling is not being used currently by the teams for us to even automate this, nor do I have confidence that it would be
It's not used because there isn't any need or requirement to use it. If it were required, it would be used. Using tags to automate the changelog will be imperfect. However, I do not think it will be any worse than today's manual system.
The developers have tools to do git rebase -i to clean up their commits before pushing to reduce the number of commits in a PR.
That's true, but that requires _every_ developer to: 1) understand what a rebase is 2) how to rebase 3) remember to do it. That's a lot of manual effort multiplied by the number of developers vs. just letting GitHub do squash merges.
Devs should be doing local development and local testing prior to pushing to gpuCI. That would significantly reduce the number of commits in a PR.
In my development workflow, I use lots of small commits. But it's not because I rely on gpuCI for my testing, it's just what I learned as a git best practice to do small, atomic commits. So my PRs will always have lots of commits.
This does nothing to change that and just encourages it more as it will hide all of the "Trying again" and "oops that didn't work" which I've seen as actual commit messages.
I agree the change to squash commits will not solve this issue, but I think it is orthogonal.
One concern I have with squashing is the following scenario. I know this was common in my previous jobs.
Two devs are working together, one is building off of the other's feature. So there is branch
feature-awith several commits, then the second dev branches off offeature-ato createfeature-band adds commits. Whenfeature-ais squash merged intobranch-0.x,feature-bnow has conflicts withbranch-0.xbecause they have diverged histories now. Merge commits do not have this issue.
Yes, this is a real issue that needs a bit of careful handling.
This blog does a good job of describing it: http://www.thecaucus.net/#/content/caucus/tech_blog/516
Few things to keep in mind:
Regarding rebasing: in my experience if you rebase after opening a PR, the only way to push is to force push. If you force push after someone has reviewed your code, then the GitHub review comments are disassociated with the code in the new force push, so they get lost and therefore are likely to not get addressed. In the past I've effectively had to review PRs twice when a developer rebases and force pushes. It's not good GitHub practice.
Regarding many small commits: like Jake, I use this practice. It makes it easier to find where I introduced bugs during development of a feature. It makes it easier to find where I introduced perf regressions when doing optimizations in sensitive code.
I don't think the changelog is a good reason not to squash. I think the changelog exists because the commit history is not useful as a changelog.
I didn't get the impression the proposal is to squash anything while a PR is being reviewed. I understood the proposal as squashing commits when a PR is merged. That does make the commit history very useful.
@jlowe correct, but @mike-wendt suggested we wouldn't have a cluttered commit history (and therefore wouldn't need to squash) if people always rebased onto the target branch rather than merging the latest from the target branch. I'm arguing that rebasing once reviewers have started commenting is bad practice.
In http://github.com/apache/arrow (same as Apache Spark and a number of other large projects) we use a tool that squashes on merge and deals with multi-author PRs as well (> 6000 PRs handled so far)
https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
It produces commits that look like this
https://github.com/apache/arrow/commit/96217193fc726b675969e91e86a63407bc8dce99
Very rarely do contributors rebase during the course of working on a patch. You can of course use the GitHub UI to do squash merging. One of the benefits of the merge tool we use is that it captures essential information about a PR:
If you use the GitHub UI you don't really get any of this in the git log (the commit message anyway, there is the author and committer field of git but you drop information in some cases)
Squash merges are now enabled and the default. Thanks @mike-wendt !
Thank you. It's much easier to understand your changelog now
This doesn't yet affect our changelog (changelog.MD). :)
Sorry I meant git log!
Most helpful comment
Another advantage of squash merges is that the
CHANGELOG.mdno longer needs to be manually maintained. Since each PR will only have a single commit in the mainline branch, theCHANGELOG.mdwould be redundant with the commit history.We could either remove the CHANGELOG or automate it's contents from the commit history.