Omr: Use Clang to format the OMR source

Created on 19 Nov 2019  路  56Comments  路  Source: eclipse/omr

Background

The OMR codebase contains several components each of which has its own coding style. The inconsistencies between (and within) components, coupled with undocumented styles has the potential to consume a lot of human time during reviews to get PRs into shape for merging. To improve code consistency and avoid stylistic noise between reviewers/contributors during semantic PR reviews we propose to add automated tooling to the project to enforce a predefined code format style which will be checked on each contribution.

In summary automatically formatted code is:

  • Easier to write

    • You don't have to worry about formatting as a tool will take care of it for you

  • Easier to read

    • When all code is consistent you know exactly what to expect and you don't have to switch mentally of thinking a different way based on which part of the codebase you are in

  • Easier to maintain

    • Your commits will not contain changes unrelated to your diff because your editor chose to format a bunch of whitespace in surrounding areas based on your editor configuration

  • Uncontroversial

    • We won't have to propose stylistic changes in PRs anymore thus saving reviewer and contributor resources and getting code merged faster

Goals

  • Format the entire OMR C/C++ code base using clang-format based on a community agreed style
  • Add automation via GitHub hooks to validate code format of every commit within a PR
  • Provide optional git hooks for users to be able to automatically format their code on git commit
  • Provide instructions on setting up clang-format on the most popular code editors used in the OMR community
  • Minimize spending reviewer/contributor resources on any stylistic PR comments and allow us to focus on the semantics of the change
  • Allow easier integration with other automation such as clang-tidy, clang-modernize, etc.

Discussion topics

To be discussed during the OMR Architecture (#4560) call. Here are some topics to get us started:

  • Is it worth it?
  • "A formatting process which is both manual and insufficient is doomed to be abandoned."

    • Discuss the need for automation

    • Format once and loosely enforce or always and strictly enforce?

    • Format before commit or after merge?

    • Should we add GitHub hooks which fail/warn on unformatted code?

  • What style to use?
  • Should we format existing code?

    • How to do it, i.e. one fell swoop, component based, etc?

    • What about all PRs in flight at the time of format?

  • Fixing on a particular clang-format version
  • Will git blame work after a format?
  • Is the code formatting going to break something?
  • Do we format third party code?
  • Can we exclude files or pieces of code from formatting?

    • How lenient should we be on this?

  • Can Clang handle all of our existing (and future) code?
  • If we want to do this when do we do it?

Case studies

Where we go from here

To be discussed during the OMR Architecture (#4560) call and updated subsequently.

discussion

Most helpful comment

Code after formatting can not be reviewed properly due the large volume

Probably the best way to deal with this is to sample the codebase, and especially files that you think/know are formatted a certain way, to see how it is affected. If there's a strong reason to keep the formatting, there are comments that can be put in the code to indicate sections of code that the formatting tool shouldn't touch.

Formatting would invalidate all outstanding branches/forks because of changes in white spaces. Every single outstanding change line would have a conflict required to be resolved. It would be huge impact for people doing research work (for example most of GC Team members)

This, I believe, is pretty easy to deal with; devs would just have to run clang format on their branches and then rebase.

We would not have ability compare code "before" and "after" used a lot for debugging

https://github.com/eclipse/omr/issues/4577#issuecomment-556507639 should address that issue.

Also I would not underestimate subjective personal factor. The 90% of developer's work is a code reading. Even we move out of brackets discussion what is "correct" or "wrong" look of the code (coding standards document prescribes basics, is not it?) look of reformatted code would be "unusual" for code owner and other developers use to work with it.

This is definitely true, and is often downplayed. However, at the same time, as devs, we're pretty good at adapting to things. Plus, just as the intangible human subjective/personal preferences play a role, I believe the intangible psychological benefit of one unified project must also play a role. It may seem silly, but a unified codebase right from the coding standard up to the quality bar has to, at some level, strengthen the project (be it the code or the community); we're all experts in different components, but we contribute the same quality code to the same project with the same expectation as anyone else.

All 56 comments

Note: You should be able to use git-hyper-blame from the Chromium repo, which to quote the linked man page:

git hyper-blame is like git blame but it can ignore or "look through" a given set of commits, to find the real culprit.

This is useful if you have a commit that makes sweeping changes that are unlikely to be what you are looking for in a blame, such as mass reformatting or renaming. By adding these commits to the hyper-blame ignore list, git hyper-blame will look past these commits to find the previous commit that touched a given line.

(TIL: This has been implemented in upstream git: https://github.com/git/git/commit/ae3f36dea16e51041c56ba9ed6b38380c8421816)

I have summarized the the concerns raised during our OMR Architecture call below.

  • [37:51] [Concern regarding auto-formatting of deeply nested conditionals in the optimizer](https://youtu.be/21yPv8GsvY4?t=2271)

    • CompilationThread.cpp, EscapeAnalysis.cpp, OSRData.cpp, and one of the codegen tree evaluators

    • Concerns about hard column limits hampering readability of the code

    • Proposed solution by some contributors/committers would be to refactor/rewrite such complex conditionals



      • Concerns about this proposed solution not being feasible



  • [42:02] [Feedback of OpenJ9 applying formatting fixes incrementally following a written coding standard before going open source](https://youtu.be/21yPv8GsvY4?t=2522)

    • Only format pieces being actively worked on (method by method)

    • Cannot be enforced by a tool

    • Some code will never be touched, or won't be touched for years

    • Will introduce inconsistency blend of new and old styles

    • Does not work well with git blame as you cannot ignore a single formatting commit, so formatted code in a method that is not related to the change is hard to be ignored

    • Some feedback from other projects regarding this topic can be found under _Format-only changes_ section of the following blog post:


  • [43:50] [Use // clang-format off and // clang-format on to disable formatting on hand formatted areas we really care about](https://youtu.be/21yPv8GsvY4?t=2630)

    • Should only be a handful of these

    • Should not be used liberally

  • [45:26] [How widely does clang-format run? Does it run on all the platforms we currently have developers on?](https://youtu.be/21yPv8GsvY4?t=2726)

    • clang-format runs on Linux, MacOS, and Windows

    • clang-format is supported on most IDEs/editors including VSCode, Visual Studio, CLion, Eclipse, Atom, Sublime Text, Vim



      • It can also be configured to run as a pre-commit git hook which can be provided by the project



  • [48:36] [Concerns about local changes being different than what is PR merged by an auto-formatter into master](https://youtu.be/21yPv8GsvY4?t=2916)

    • An automated PR check will validate each commit conforms to the specified clang-format style



      • PR check will fail if the style does not match and committers will not merge the change until it is fixed


      • This means your local changes will match what ends up being merged



    • The PR check can provide a patch file which you can apply to your changes to format them correctly, or you can run clang-format yourself

    • Proffered workflow is to run auto-formatting before you open a PR



      • Concerns that this raises the bar to entry





        • Contributors should follow the contribution guides, so whether a tool tells them the change requires formatting fixups or a reviewer does is indifferent



        • Arguably having a tool defined format lowers the bar to entry since the user does not have to read a document describing what the format is and can just run the tool (or apply the patch produced by PR the automation telling the contributor what is expected)






  • [55:05] [Use automation to give user a warning that if the PR is merged as is it will get formatted upon merge](https://youtu.be/21yPv8GsvY4?t=3305)

    • Will result in a ton of "format commits" which brings us back to the state of local changes not matching what gets merged discussed above

  • [59:16] [What timeline may such a change land at](https://youtu.be/21yPv8GsvY4?t=3556)
  • [59:28] [What happens to all in-flight PRs at the time the formatting commit lands? Will there be merge conflicts on every PR?](https://youtu.be/21yPv8GsvY4?t=3568)
  • [1:01:35] [Discussion on the need for an explicit committer vote on a concrete proposal](https://youtu.be/21yPv8GsvY4?t=3695)

    • Will need to nail down the style and concretely define steps which will be taken to achieve the final goal as well as necessary documentation to aid developer workflow



      • Eclipse Project Handbook has a section of such proposals which we will make use of



  • [1:04:00] [Is there a consequence for OpenJ9 in OMR making this decision to introduce automated code formatting?](https://youtu.be/21yPv8GsvY4?t=3840)

    • Concerns regarding contributors to both projects having to conform to two different standards across the two projects

    • Past experience from Smalltalk automated formatted code was not pleasant due to less readable formatting



      • Past formatting was on a per-method basis and not as easily controlled



  • [1:06:20] [Concerns about developers getting the new tool setup](https://youtu.be/21yPv8GsvY4?t=3979)

    • A concern which may not be shared by all but some developers have had problems getting compile environments setup and every extra tool introduced makes it that much harder

    • Developers don't have to install clang-format locally



      • They can try to match the tool formatted style manually and fixup changes once automation on the PR side tells them it needs to be changed





        • A tedious solution to the problem






  • [1:10:04] [Is there a plan yet for how we will roll this out?](https://youtu.be/21yPv8GsvY4?t=4204)

Before forging ahead with an official vote and proposal I think it is important to settle on what exactly the format we will be proposing. To aid in the search for a reasonable format which we can get accustomed to I have ran clang-format with all major built-in formats on the OMR codebase and created branches so we can look at the output.

Some styles are preferred to others. Please state your preferences and examine the branches below and hyperlink any areas in which the code formatting makes code hard to read in a particular format, or propose alterations to existing styles which we can prototype. I hope that after some iterations of configuration we can land at a style which we can put to a vote. I will tentatively propose a 1 week cut-off for discussions and circle back at that point to evaluate whether we are making forward progress and steps to take from there.

> clang-format-9 -version
clang-format version 9.0.0-2 (tags/RELEASE_900/final)
  1. Using LLVM style:
  2. Using Google style:
  3. Using Chromium style:
  4. Using Mozilla style:
  5. Using WebKit style:
  6. Using style from previous attempt at automated formatting from #3390 (WebKit based):

Given that Mozilla has adopted Google style, I would expect that style to start to rot slowly over time.

I would recommend avoiding it.

Regardless of the style chosen, in the compiler the forward declarations for classes in the TR namespace consume a lot of real estate when reformatted. Prior to reformatting, I'll have someone sweep through the code and turn them into:

namespace TR {
class A;
class B;
class C;
}

It will be less tedious to do prior to reformatting. It should be a safe and easy thing to do.

what column width were these run at - there is a heck of a lot of ugly line wrapping going on in at lest the LLVM sample...

The webkit format is nearly unreadable because of how the Boolean conditions are wrapped https://github.com/fjeremic/omr/blob/51db3486144d561fb44a94f5fb899d5c540d3c29/compiler/optimizer/VirtualGuardHeadMerger.cpp#L539
Looks like nearly everything is on one line and you have to scroll forever...

Do any of the styles allow the Boolean operators to be at the start of a line rather than the end? It is a style that was adopted in the optimizer a while ago and it makes reading a lot easier since you can more quickly see how things combine... None of these styles seem to have that?

Regardless of the style chosen, in the compiler the forward declarations for classes in the TR namespace consume a lot of real estate when reformatted.

Sure, this can easily be done on any style we end up proposing.

what column width were these run at

All the styles were ran with default column width values. I believe they all use column width 80 or unlimited. Style 6. based on WebKit from our previous try at clang-format which most closely matches some of our current patterns uses a column width of 120 which matches that of the GitHub review interface. You can find the same line in that style here:

https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/VirtualGuardHeadMerger.cpp#L577-L587

which has boolean operators broken at the start of a line if they overflow the column limit.

I'm going to reference #350 as well since there is some useful information there from the previous attempt, most notably some experimentation and a mailing list thread with a ton of discussion about formatting that we had before [1].

There was also more discussions on some of the choices in a Slack thread which can be found in [2].

[1] https://www.eclipse.org/lists/omr-dev/msg00023.html
[2] https://eclipse-omr.slack.com/archives/CCQ8B4B39/p1545061929006600

Hi,

Couple of comments:
a) You should not try to choose a new format. The approach should be to get clang to generate existing format as far as possible. This will minimise impact on everyone. I strongly recommend against bike shedding conversations on format choice.
b) The only real issue is the impact on existing pull requests - this needs to be addressed first.

Regards

a) You should not try to choose a new format. The approach should be to get clang to generate existing format as far as possible. This will minimise impact on everyone. I strongly recommend against bike shedding conversations on format choice.

Thanks for the feedback, I would tend to agree. Based on previous discussions in [1] and [2] that format would be the one we landed on in #3390 which was 6.

[1] https://www.eclipse.org/lists/omr-dev/msg00023.html
[2] https://eclipse-omr.slack.com/archives/CCQ8B4B39/p1545061929006600

b) The only real issue is the impact on existing pull requests - this needs to be addressed first.

This problem can be solved following the technique outlined by the MongoDB folks described in detail in [3] (under _Rescuing stranded changes_ section). We will adopt the same approach for all in-flight PRs.

[3] https://engineering.mongodb.com/post/succeeding-with-clangformat-part-3-persisting-the-change

Do any of the styles allow the Boolean operators to be at the start of a line rather than the end?

Webkit seems to do that https://github.com/fjeremic/omr/blob/clang-format-3390/compiler/control/OMROptions.cpp#L4277-L4280

The webkit format is nearly unreadable because of how the Boolean conditions are wrapped

Wonder how in the link above, it did wrap the lines.

EDIT: actually @andrewcraik , in webkit, VirtualGuardHeadMerger is formatted with boolean operators at the beginning of the line, and with line breaks: https://github.com/fjeremic/omr/blob/clang-format-3390/compiler/optimizer/VirtualGuardHeadMerger.cpp#L577-L579

Comments on nested elements are unexpectedly indented, see
https://github.com/fjeremic/omr/blob/clang-format-webkit/port/common/omrcuda.cpp#L1223-L1227

Perhaps that's just a bug in 5 (I have a hard time believing it was intentional)?

That raises another question, though: what do people think we should do when the chosen format changes (or is 'fixed')? Do we re-apply it to the entire repository? Or if the change is minor, allow the code to be corrected as it is modified?

EDIT: actually @andrewcraik , in webkit, VirtualGuardHeadMerger is formatted with boolean operators at the beginning of the line, and with line breaks:

I have no idea how I saw it as broken as it was when I viewed it originally.

It is interesting that the original format for that block was:

   else if (
            !tailSplitEnd->getOpCode().isReturn() &&
            !tailSplitEnd->getOpCode().isJumpWithMultipleTargets() &&
             tailSplitEnd->getOpCodeValue() != TR::athrow &&
            !(tailSplitEnd->getNumChildren() >= 1 && tailSplitEnd->getFirstChild()->getOpCodeValue() == TR::athrow)
           )

The leading operators are an improvement, but it is interesting that the separation onto new lines was thrown out. I'm guessing the style tries to compact code even if it is explicitly line broken by the dev?

Perhaps that's just a bug in 5 (I have a hard time believing it was intentional)?

It is a not a bug, just the way WebKit formats things:

> clang-format-9 -style=WebKit -dump-config | grep ColumnLimit
ColumnLimit:     0

According to the documentation for ColumnLimit when it is set to 0:

A column limit of 0 means that there is no column limit. In this case, clang-format will respect the input鈥檚 line breaking decisions within statements unless they contradict other rules.

So by default WebKit style will not re-flow comments, meaning if there is a tab within a comment (as is such in the case you pointed out) it will not change the tabs. The tab before the start of the comment is changed however, hence why we get the mix that you observe. This has been addressed in #3390 already as Irwin pointed out.

[1] https://clang.llvm.org/docs/ClangFormatStyleOptions.html

The leading operators are an improvement, but it is interesting that the separation onto new lines was thrown out. I'm guessing the style tries to compact code even if it is explicitly line broken by the dev?

That's correct. Developer line brakes are not followed. Line breaking rules will apply only when the line exceeds the set column limit (this is true for all styles).

That raises another question, though: what do people think we should do when the chosen format changes (or is 'fixed')? Do we re-apply it to the entire repository? Or if the change is minor, allow the code to be corrected as it is modified?

Great question. I don't know the answer but I would imagine since clang-format at that point will be part of the workflow it won't nearly be as intrusive since we'll be accustomed to tool formatted code already. I do however believe applying such changes piece-wise may be difficult as automation will check formatting on a file basis, rather than diff basis (this is not easy maintain from what I've read). So I would think if there are future changes to the style they will be applied in bulk at once.

It is a not a bug, just the way WebKit formats things:
According to the documentation for ColumnLimit when it is set to 0:

A column limit of 0 means that there is no column limit. In this case, clang-format will respect the input鈥檚 line breaking decisions within statements unless they contradict other rules.

So by default WebKit style will not re-flow comments, meaning if there is a tab within a comment (as is such in the case you pointed out) it will not change the tabs. The tab before the start of the comment is changed however, hence why we get the mix that you observe. This has been addressed in #3390 already as Irwin pointed out.

If it had left the tabs in comments alone, I could understand, but it expanded the tabs differently (to 8 spaces) in the comment compared with outside the comment (4 spaces). Either tabs are 4 or 8 spaces, but not both.

That raises another question, though: what do people think we should do when the chosen format changes (or is 'fixed')? Do we re-apply it to the entire repository? Or if the change is minor, allow the code to be corrected as it is modified?

Great question. I don't know the answer ...

I wasn't asking anyone to predict the future; rather I was asking for opinions on what should be done in that situation.

I wasn't asking anyone to predict the future; rather I was asking for opinions on what should be done in that situation.

This is because the IndentWidth and ContinuationIndentWidth in vanilla WebKit is set to 4 but TabWidth is set to 8:

> clang-format-9 -style=WebKit -dump-config | grep -E "^(ContinuationIndentWidth|IndentWidth|TabWidth|UseTab)"
ContinuationIndentWidth: 4
IndentWidth:     4
TabWidth:        8
UseTab:          Never

And UseTab is set to Never, but because ColumnWidth is set to 0 it does weird things to tabs within and around comments. It is unfortunate, but we cannot get away without fiddling with the base styles since our codebase was not formatted from the beginning.

What does the formatter do with tabs elsewhere, like within string literals?
IMO, such things should be expunged from the source, but we don't live in an ideal world.
I do think, however, that we should be using TabWidth: 4.

What does the formatter do with tabs elsewhere, like within string literals?

It does not touch string literals.

I do think, however, that we should be using TabWidth: 4.

We can make this amendment to #3390.

I would like to use c++11-style template angle brackets. In #3390 the belief was that xlc didn't support the new syntax, is that actually true?

I do think, however, that we should be using TabWidth: 4.

We can make this amendment to #3390.

Since style 6 (#3390) doesn't use tabs, I'm not sure the tab-width is going to affect anything.

IMHO what really matters is that, if we use tabs:

  • they should only be used for indentation (never for alignment)
  • we should only allow indent/dedent increments which are a multiple of tab width

If we can do this, then tab-width will be inconsequential. Style 6 sidesteps the problem by banning tabs altogether.

The problem with tabs was the conversion of (leading) tabs in a multi-line comment to 8 spaces instead of 4 to be consistent with the indent increment of 4.

In the cuda comment example, the leading tabs were left unconverted.

Hi everyone,

I've built up the automation on my personal fork as an example of what can be done. The automation runs an extra job as part of the Travis CI build which clang-format checks that each commit within a PR is in conformance with the project defined style. The automation then uses the GitHub Status API to tag the HEAD commit in the PR with a separate status badge indicating pending/success/failure of the clang-format job.

You can find the sample automation here (note the two PR checks that happen as part of a single Travis build):
https://github.com/fjeremic/omr-clang-format-automation/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc

As part of the automation I've implemented MongoDBs algorithm for rebasing PRs which were opened before the formatting took place onto commits representing the same changes after the formatting. This is all done via automation so hopefully not much action for the end-user other than running the script. There are several PR examples on my fork which demonstrate the tooling in place. You can check them all out in the above link.

Lastly I will be putting together an official committer vote with a full proposal to implement automated code formatting early next week to keep the momentum going.

_Edit:_

Forgot to mention. In case the automation fails, i.e. a commit is not formatted correctly the automation provides a git patch file which the user can grab and apply to their commit to fix up the formatting without even having clang-format installed.

You can see and example of this in the following PR:
https://github.com/fjeremic/omr-clang-format-automation/pull/3

By following the "Details" link on the failed _continuous-integration/travis-ci/clang-format_ job and looking at the build output for patch file which is displayed. Applying the patch file correctly formats the commit in a way that the formatting check will pass.

One thing I have noticed with style 6 is that some of the optimizer and control files can get some pretty high levels of indent:
https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/compile/OSRData.cpp#L1314
https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/Inliner.cpp#L895
https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/VPHandlers.cpp#L541
https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/VPHandlers.cpp#L2246
Use of nested loops and conditions happens in a lot of places. Could we consider an indent of 2 rather than 4 to help reduce line wrapping and reduce the overall 'width' of the code? It seems like it might make things a bit easier to grok... It will make life on narrow screens a lot easier too...

There are also a few weirdly indented hanging parens (see https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/Inliner.cpp#L1157 as an example

This ternary also looks quite odd - almost like a format bug https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/Inliner.cpp#L1762

It is also sad that the bracket fixups don't seem to work with end of line comments leading to some inconsistency in the braces like https://github.com/fjeremic/omr/blob/78db3731f245ede0829df93d0e0b9400dd716c2c/compiler/optimizer/Inliner.cpp#L3117. That is a limitation of the tool and something I can live with, it just seems suboptimal and desirable to fix...

Overall, the style is, in general, readable on the files I have checked. I will check some more tomorrow to see if there are any fun format issues lurking.

Use of nested loops and conditions happens in a lot of places. Could we consider an indent of 2 rather than 4 to help reduce line wrapping and reduce the overall 'width' of the code? It seems like it might make things a bit easier to grok... It will make life on narrow screens a lot easier too...

Here is a branch with indentation level of 2 spaces and the same lines of code noted:
https://github.com/fjeremic/omr/tree/clang-format-3390-indent2

https://github.com/fjeremic/omr/blob/clang-format-3390-indent2/compiler/compile/OSRData.cpp#L1318
https://github.com/fjeremic/omr/blob/clang-format-3390-indent2/compiler/optimizer/Inliner.cpp#L910
https://github.com/fjeremic/omr/blob/clang-format-3390-indent2/compiler/optimizer/VPHandlers.cpp#L532
https://github.com/fjeremic/omr/blob/clang-format-3390-indent2/compiler/optimizer/VPHandlers.cpp#L2201

There are also a few weirdly indented hanging parens (see fjeremic/omr:compiler/optimizer/Inliner.cpp@78db373#L1157 as an example

This is an outlier due to an #ifdef inside of an if expression. The parenthesis can be removed here. As we encounter such code moving forward we can definitely improve it as one will be able to see what the formatter will do to the code as we're writing it.

This ternary also looks quite odd - almost like a format bug fjeremic/omr:compiler/optimizer/Inliner.cpp@78db373#L1762

The formatter will try to respect developer line breaks as long as they don't break any of the formatting rules applied. In this case it doesn't break any rule. You can see the same line break exists in today's master branch so the formatted code is definitely not a regression:
https://github.com/eclipse/omr/blob/master/compiler/optimizer/Inliner.cpp#L1914-L1917

It is also sad that the bracket fixups don't seem to work with end of line comments leading to some inconsistency in the braces like fjeremic/omr:compiler/optimizer/Inliner.cpp@78db373#L3117. That is a limitation of the tool and something I can live with, it just seems suboptimal and desirable to fix...

This is something which we will unfortunately not be able to easily fix. The tool does not play nice with comments following statements because it does not reflow code from before to after comments or vice versa. This is something we would fix as developers are working on the code in the area. Over time I would imagine these will not be an issue.

Just for the record I vote against such reformatting and if I have right of veto I would use it to block, at least for GC part of the code.
There number of problems are on the surface:

  • Code after formatting can not be reviewed properly due the large volume
  • Formatting would invalidate all outstanding branches/forks because of changes in white spaces. Every single outstanding change line would have a conflict required to be resolved. It would be huge impact for people doing research work (for example most of GC Team members)
  • We would not have ability compare code "before" and "after" used a lot for debugging

Also I would not underestimate subjective personal factor. The 90% of developer's work is a code reading. Even we move out of brackets discussion what is "correct" or "wrong" look of the code (coding standards document prescribes basics, is not it?) look of reformatted code would be "unusual" for code owner and other developers use to work with it. It might reduce effectiveness of development significantly.
I believe such reformatting initiative is risky and there is no direct demand to do it here and now.
This initiative could potentially compromise development and serviceability.

Code after formatting can not be reviewed properly due the large volume

Probably the best way to deal with this is to sample the codebase, and especially files that you think/know are formatted a certain way, to see how it is affected. If there's a strong reason to keep the formatting, there are comments that can be put in the code to indicate sections of code that the formatting tool shouldn't touch.

Formatting would invalidate all outstanding branches/forks because of changes in white spaces. Every single outstanding change line would have a conflict required to be resolved. It would be huge impact for people doing research work (for example most of GC Team members)

This, I believe, is pretty easy to deal with; devs would just have to run clang format on their branches and then rebase.

We would not have ability compare code "before" and "after" used a lot for debugging

https://github.com/eclipse/omr/issues/4577#issuecomment-556507639 should address that issue.

Also I would not underestimate subjective personal factor. The 90% of developer's work is a code reading. Even we move out of brackets discussion what is "correct" or "wrong" look of the code (coding standards document prescribes basics, is not it?) look of reformatted code would be "unusual" for code owner and other developers use to work with it.

This is definitely true, and is often downplayed. However, at the same time, as devs, we're pretty good at adapting to things. Plus, just as the intangible human subjective/personal preferences play a role, I believe the intangible psychological benefit of one unified project must also play a role. It may seem silly, but a unified codebase right from the coding standard up to the quality bar has to, at some level, strengthen the project (be it the code or the community); we're all experts in different components, but we contribute the same quality code to the same project with the same expectation as anyone else.

but a unified codebase right from the coding standard up to the quality bar has to, at some level, strengthen the project

Let's be clear though - this discussion is NOT about the coding standard. It's about whitespace placement which is the least interesting thing to focus on.

Let's be clear though - this discussion is NOT about the coding standard.

I was under the impression that coding standard was interchangeable with coding convention; seeing as it's not, I'll change what I said above to:

It may seem silly, but a unified codebase right from the coding convention up to the quality bar has to, at some level, strengthen the project (be it the code or the community); we're all experts in different components, but we contribute the same quality code to the same project with the same expectation as anyone else.

clang-format handles whitespace only and doesn't address the bigger picture about how code is written - things like the rules in the omr coding standard:

Never combine an assignment and a test, except in a while condition. In a while loop, always use an explicit test.

When possible, try to avoid negative conditions. Positive conditions are easier to understand.

Macros should be "contained" and "context-free".

Formatting doesn't address any of them or many other conventions in the OMR coding standard that help to write readable and debuggable code.

I think it's important to be clear that this issue is addressing formatting only as there are a lot of mental connections to "coding standard" that won't be addressed by this issue

I think it's important to be clear that this issue is addressing formatting only as there are a lot of mental connections to "coding standard" that won't be addressed by this issue

Yeah that makes a lot of sense. The title of this PR does say "format", but I suppose it is still worth being very explicit about exactly what that means.

I think it's important to be clear that this issue is addressing formatting only as there are a lot of mental connections to "coding standard" that won't be addressed by this issue

Absolutely agree. Formatter using "code as tape" concept and manipulate with white spaces. However the only fractional part of coding standards addresses issues like "where I should put bracket". Most of coding standards applying rational to improve correctness of code but not it's readability only

As a major contributor to GC code I officially asking do not apply automatic formatting to OMR/GC code @rwy0717

How automated formatting tool supposed to be used from Eclipse? Do you have recommendations?

How automated formatting tool supposed to be used from Eclipse? Do you have recommendations?

Did some digging around [1] [2]. Given that one of the goals above is

Provide optional git hooks for users to be able to automatically format their code on git commit

After installing cygwin, clang, and the git hook that will be provided (added to <root>/.git/hooks/pre-commit), JGit/EGit should just work.


[1] https://zauner.nllk.net/post/0001-git-hooks-currently-supported-by-jgit/
[2] https://www.eclipse.org/forums/index.php/t/1067404/

How automated formatting tool supposed to be used from Eclipse? Do you have recommendations?

Outside of configuring a git pre-commit hook the easier route for Eclipse is just to install the CppStyle plugin which runs clang-format on the repository using the repository defined style:
https://marketplace.eclipse.org/content/cppstyle

They also have a website and GitHub repo with tons of info:
http://www.cppstyle.com/

Does pre-commit script actually change the content in 'master' files or create a temporary variant that is to be committed?

Does it give feedback what kind of formatting was applied so that a developer learn them over time?

Does per-commit script actually change the content in 'master' files or create a temporary variant that is to be committed?

There are different ways to implement this. [1] and [2] are two such examples. There are others of course. These are very simple hooks to implement. [1] formats all the files which were touched when you execute git commit and it then re-adds the files changed into the commit. With [1] you don't see any of the changes. [2] just formats all the files changed and leaves the changes as unstaged. The user than manually amends the commit with the changes.

Does it give feedback what kind of formatting was applied so that a developer learn them over time?

It depends on what kind of hook you install, or if you even use a hook. If you want to see the changes you can just install a hook which will leave the formatting fixes unstaged so you can see the diff.

Personally however I don't use a git hook and rather just configure my editor to clang-format on save. So I write code trying to match the style of the repository and as soon as I save my editor runs clang-format and I'm done with it. This is typical of most editor plugins such as the Eclipse one linked previously, VS Code [3], vim [4], etc.

Most editors also support selecting (highlighting code) and running clang-format on it if you would like to see what the formatter does to a piece of code.

[1] https://github.com/andrewseidl/githook-clang-format
[2] https://gist.github.com/alexeagle/c8ed91b14a407342d9a8e112b5ac7dab
[3] https://marketplace.visualstudio.com/items?itemName=xaver.clang-format&ssr=false#overview
[4] https://github.com/rhysd/vim-clang-format

Tried cppstyle. Easy to install in Eclipse (good). This one re-formats on explicit ctrl-shift-f, not on a save (ah, well). But it did not quite format it all that way I expected. Removal of extra spaces and lines kind of worked, but repositioning of curly braces did not. Wonder if there is more to it, that i'm missing (EDIT: probably clang-format itself)...

Overall, it comes down to how much investment (effort in inital rebase, effort per code line/block/commit, and updating/configuring the tool) vs how much return (just space formatting).

I don't have strong opinion (as usual, for process related things), but for a typical GC developer (where code is already reasonable well formatted and standardized) it feels like it's gonna be a bit of annoyance for little benefit.

As a major contributor to GC code I officially asking do not apply automatic formatting to OMR/GC code

+1. Probably OMR/GC + OMR/port.

+1. Probably OMR/GC + OMR/port.

Any particular reasons for why you think we should not proceed? If you can help us understand some of your concerns perhaps there is a technical solution to the problems which may have already been discussed. We can learn a lot from the many projects which have gone down this road before us.

Overall, it comes down to how much investment (effort in inital rebase, effort per code line/block/commit, and updating/configuring the tool) vs how much return (just space formatting).

The return on this investment can be summarized with the four bullet points in the original post:

In summary automatically formatted code is:

  • Easier to write

    • You don't have to worry about formatting as a tool will take care of it for you

  • Easier to read

    • When all code is consistent you know exactly what to expect and you don't have to switch mentally of thinking a different way based on which part of the codebase you are in

  • Easier to maintain

    • Your commits will not contain changes unrelated to your diff because your editor chose to format a bunch of whitespace in surrounding areas based on your editor configuration

  • Uncontroversial

    • We won't have to propose stylistic changes in PRs anymore thus saving reviewer and contributor resources and getting code merged faster

In my opinion the biggest one is the reviewer/contributor time savings. Countless hours of churn are spent nitpicking spacing and formatting issues during reviews which consumes both reviewer and contributor time. Every time formatting issues have to be fixed testing needs to be re-spun and there is a time delay between reviewer asking the contributor to actually make the changes. Not to mention the reviewer missing some formatting nitpicks and asking for fixes multiple times.

This is especially a problem with new contributors to the project as we don't currently have a defined style for the various components in OMR, so a contributor has to guess by looking at nearby code what it should be, or even worse just write code in their own personal style which will typically block the merge of a PR due to reviewers flagging that the code formatting is inconsistent with the code surrounding it.

Automation eliminates this problem and removes the noise from the PR reviews which lets the reviewers/contributors focus on the technical contribution, rather then the stylistic formatting thereof. It arguably lowers the bar to entry for new contributors to the project by making use of an industry-standard tool (used by LLVM, Google, WebKit, Mozilla, et.).

Other large benefit is the consistency. No matter which part of the codebase you look at, tool formatted code will look consistent, so you know exactly what to expect, and exactly how certain patterns of code look like, because each instance will be identical. There are no surprises and no mental "switch" needs to be made when writing/reading different parts of the repository.

Any particular reasons for why you think we should not proceed? If you can help us understand ...

As long as the code is correct and readable, the rest is not that important to me. I am not interested in how to format/put white spaces.

As long as the code is correct and readable, the rest is not that important to me. I am not interested in how to format/put white spaces.

@hangshao0 I ask this question with all due respect, as I'm just trying to understand your point of view: If the code after formatting will also be readable and you have no interest in how to format/put white spaces, if the proposed formatting will change only where the white spaces are placed (the end result being the four points listed by @fjeremic above), I don't see why you're against it. Again, I'm not trying to be disrespectful, I'm just trying to get a better understanding.

if the proposed formatting will change only where the white spaces are placed (the end result being the four points ...

I am good with that. I understand there are people who care about the white spaces. I am just wondering if it is worth the effort (certainly "yes" for some people).

I am just wondering if it is worth the effort

That's a reasonable concern. Thanks for clarifying @hangshao0 :)

This is really well put together! I've been hoping to do something similar on my team's codebases. Glad to see such a detailed and informative discussion about the pros and cons - I'll be sharing it at work.

Overall, it comes down to how much investment (effort in inital rebase, effort per code line/block/commit, and updating/configuring the tool) vs how much return (just space formatting).

I don't have strong opinion (as usual, for process related things), but for a typical GC developer (where code is already reasonable well formatted and standardized) it feels like it's gonna be a bit of annoyance for little benefit.

I agree with Aleks and "formatting the GC code" is less favorable for me, for most existing GC code. there are no outstanding formatting issue, I'd rather keep the old code no change until I fully understand the code, also without formatting will be easy for us to trace back history for debugging.

There is some strong feedback in this issue but much of it seems to be from the point of view of a single component or a set of components that currently use the same code format (references like "formatting the GC code" or "OMR/GC + OMR/port").

I'm wondering whether anyone has any feedback to offer on how easy it is to work in OMR on both the compiler and some other component, requiring a developer to directly accommodate the two very different coding formats? Please try to keep any feedback to the trouble working with two formats rather than on perceived good/bad aspects of each format.

I'm wondering whether anyone has any feedback to offer on how easy it is to work in OMR on both the compiler and some other component, requiring a developer to directly accommodate the two very different coding formats? Please try to keep any feedback to the trouble working with two formats rather than on perceived good/bad aspects of each format.

Most recent example in OMR can be found in #4503. Note that each of them happens after a force push, so some were addressed, new ones introduced, etc. etc. I'll list the first 10 formatting related reviews, but there are more:

I get bitten by this often. Although not in OMR here is an example from my latest commit to OpenJ9:

This pretty much happens every time as I can never seem to get the formatting perfect.

Hi everyone,

The Eclipse OMR committers had a vote in #4631 to adopt style 6. (from the original post) on the entire OMR codebase. The vote did _not_ pass, however I'm still hopeful we can find a proposition which will. I am suggesting we leave this issue open for discussion for a while longer so we can discuss possible options as a community if we still see value in pursuing this further.

Thanks to everyone who contributed thus far. Looking forward to hearing from more of you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aviansie-ben picture aviansie-ben  路  6Comments

sajidahmed21 picture sajidahmed21  路  3Comments

samolisov picture samolisov  路  5Comments

fjeremic picture fjeremic  路  3Comments

Leonardo2718 picture Leonardo2718  路  6Comments