Go: cmd/compile: branch elimination opportunites when comparing constants

Created on 2 Mar 2020  路  20Comments  路  Source: golang/go

What version of Go are you using (go version)?

latest

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
linux/ppc64le but I see it in the SSA on x86

What did you do?


Inspecting code for other reasons, found cases where there are constant compares and subsequent branches against those constants, resulting in several unnecessary branches.

What did you expect to see?

No unnecessary branches.

What did you see instead?

Unnecessary branching due to constant compares of known values.

This example comes from the ssa output of bytes.(Buffer).WriteByte. Here are some snippets:

v18 00011 (+107) MOVD 8(R6), R7
v38 00012 (107) MOVD 16(R6), R5
v26 00013 (107) SUB R7, R5, R8
v70 00014 (107) CMP R8, $1
// This could be BLT 36. Block starting at 46 sets R3 to 0 then branches to a CMP R3, $0.
b1 00015 (107) BLT 46 <--- Change to BLT 36

v54 00019 (108) MOVD R4, 8(R6)
// Once the block at 46 is eliminated as a predecessor of this block, then the next 3 instructions are not needed, since the CMPW has only a single predecessor and it is comparing 0 against 1 and the BEQ will never happen.
v45 00020 (108) MOVD $1, R3 <---- Remove
v52 00021 (+265) CMPW R3, $0 <---- Remove
b9 00022 (+266) BEQ 36 <---- Remove

v24 00023 (+269) PCDATA $1, $1
v24 00024 (+269) MOVD 8(R6), R4
v34 00025 (+269) PCDATA $0, $2
v34 00026 (269) MOVD (R6), R5
v88 00027 (269) CMPU R4, R7

v63 00036 (267) PCDATA $1, $0
v63 00037 (+267) MOVD R6, 32(R1)
v81 00038 (267) MOVD $1, R3
v65 00039 (267) MOVD R3, 40(R1)
v66 00040 (+267) CALL "".(*Buffer).grow(SB)
v68 00041 (267) MOVD 48(R1), R7
v36 00042 (269) PCDATA $0, $1
v36 00043 (269) PCDATA $1, $1
v36 00044 (269) MOVD "".b(R1), R6
b11 00045 (267) JMP 23

// The MOVD $0, R3 and JMP 21 could be replaced by JMP 36 since the successor is CMP $0, R3 followed by BEQ 36. Once it is decided that 36 is the successor, I think the MOVD $0, R7 could also go away since the block starting at 36 doesn't use it. That also means the predecessor of this block could branch directly to 36.
v21 00046 (267) PCDATA $1, $0 <--- Remove
v21 00047 (267) MOVD $0, R7 <--- Remove
v25 00048 (267) MOVD $0, R3 <--- Remove
b2 00049 (+265) JMP 21 <---- Remove

This scenario is common in stdlib functions. It looks like the same situation occurs in x86, so it is not specific to ppc64/ppc64le.

@josharian I think I reported this problem last fall and you commented on it, but I can't find the issue where that was done.

NeedsInvestigation Performance

Most helpful comment

I'm working on this.

All 20 comments

Parts of this sound like the shortcircuit pass. Is that working correctly here?

Here is the email that I wrote to @dr2chase this morning. It looks apropos. (Note that the situation I describe in the email occurs in block b9 in the shortcircuit pass of GOSSAFUNC="(*Buffer).WriteByte" go build bytes.)

Hi!

You know way more than me about this stuff, so I have a question for you...

I was investigating why CL 220684 causes cmd/compile to shrink. It
turns out that it enables the shortcircuit pass to work on a few
functions, and that that causes significant (>5% text size)
improvements.

So I went to look at beefing up the shortcircuit pass. As a reminder,
we're dealing with situations like:

p1         p2         (maybe p3, p4, etc.)
  \         /
   \       /
       b
       phi (true, v1)
       if phi s1 else s2
     /     \
    /       \
s1          s2

Since going from p1 to b guarantees that we'll then go to s1, we
rewrite the p1 outbound branch to go straight to s1.

This currently requires that the phi be the only value in b, and that
it have only one use. That way we can disregard it after the CFG
change.

The case I am looking at is one in which there are also other phis in
b. This isn't necessarily a lost cause, depending on where the other
uses of those phis are.

If the use is dominated by s2, but not reachable from s1 without going
through b, then we know that we went from b to s2. And we know that if
we came in to b from p1, then we would have gone from b to s1 instead.
So we know that we came in to b from p2. And therefore we know the
value of the phi is its second arg.

If the use is dominated by s1, but not reachable from s2 without going
through b, then we know that we went from b to s1. And thus, mutatis
mutandis, that we arrived at s1 through the same predecessor we would
have arrived at b though. Therefore we can move the phi value from b
to s1.

So if there aren't any phi uses whose provenance are ambiguous, then
we can rewrite all phi uses, and thus still do the shortcircuit
optimization of moving the edge.

My questions are:

If there a name for the CFG property of being dominated by s2 but not
reachable by s1 without going through b? (It's sort of like being
dominated by an edge rather than a node.) Is there a nice data
structure for calculating it? Or should I be thinking about this in a
different way?

Sorry for the long email. This stuff is easier with a whiteboard. :(

Thanks!

-josh

Sorry not to reply earlier.
I don't know of a name for this offhand, but there could be one.
I might want to do a critical-edges transformation to make it easier to reason about, that way "dominated by an edge" is also "dominated by a node", and we know that s1 does not dominate s2 and vice-versa -- which gives us that each use is either

  1. dominated by s1; move xb := phi(x1,x2) to s1.
  2. dominated by s2; s/xb/x2
  3. dominated by neither, in code that is properly dominated by b
  4. dominated by neither, input to a phi that is not properly dominated by b (might be b).

I think case 3 is potentially a mess, it might require a lot of phi insertion depending on the shape of the graph.

Case 4 depends on the source of the input flow edge; it is some block, either falling into case 1, 2, or 3.
1 and 2 are easy, 3 is a mess.

Slightly off-topic, I recall recently seeing a formulation for SSA where blocks with phis are like "functions" and their predecessors "call" them, passing parameters appropriately. This simplifies the description of case 4 above because it is converted to 1, 2, or 3, as appropriate.

Sorry, David. I didn鈥檛 mean to imply that I had a sub-24-hour SLA on random personal emails asking for help!

The idea of doing a critical edge removal pass first is an intriguing one, thanks!

I'm working on this.

@laboger, where you referring to #33713 in the initial message?

@laboger, where you referring to #33713 in the initial message?

Yes. Not sure if they are the same, i.e., will the same fix resolve both? BTW, this does occur quite a bit in the stdlib functions. I saw it several times without looking too hard.

This is proving to be a rich source of optimizations. The challenge is that every time I try a deeper analysis, it catches more cases and is also slower. Soon (I hope today) I'll start mailing some of the cheap should-definitely-do-this cases, and we can work incrementally from there.

Change https://golang.org/cl/222917 mentions this issue: cmd/compile: rename a local variable in shortcircuitBlock

Change https://golang.org/cl/222918 mentions this issue: cmd/compile: remove loop in shortcircuit

Change https://golang.org/cl/222919 mentions this issue: cmd/compile: more minor cleanup in shortcircuitBlock

Change https://golang.org/cl/222923 mentions this issue: cmd/compile: handle some additional phis in shortcircuit

Since the 1.15 window has passed, I'm going to jot down a few more related things and test cases I've noticed, just to have them written down somewhere.

(1) The shortcircuit pass can be strengthened considerably, at some compiler performance risk, by figuring out the locations of all the uses of the non-control phis. I have CLs locally that play with this.

(2) The phiopt pass could be strengthened. For example, in runtime.atoi, we find code like this:

    neg := false
    if s[0] == '-' {
        neg = true
        s = s[1:]
    }

This should be ripe for the phiopt pass to convert into code like:

  neg := s[0] == '-'
  if neg {
    s = s[1]
  }

But it doesn't because of the second conditions[0] contains a bounds check. In this case, prove later removes that bounds check. But there are other cases in the tree in which it cannot. To find them, at the end of the phiopt pass, add something like:

    for _, b := range f.Blocks {
        for _, v := range b.Values {
            if v.Op != OpPhi {
                continue
            }
            all := true
            for _, a := range v.Args {
                if a.Op != OpConstBool {
                    all = false
                    break
                }
            }
            if !all {
                continue
            }
            fmt.Println(f.Name, b, v)
        }
    }

(3) The shortcircuit pass currently looks for situations in which the phi control value of a block is known for a particular inbound edge because it is a constant along that edge.

But there is a structurally similar scenario that can occur in which a control value of a block is known for a particular inbound edge because its predecessor branched on that exact same control value. For example, runtime.atoi contains this code:

    if !neg && un > uint(maxInt) {
        return 0, false
    }
    if neg && un > uint(maxInt)+1 {
        return 0, false
    }

After CSE, the first thing we do is branch on !neg. And if it is false, we immediately branch again on !neg. We should be able to proceed directly to un > uint(maxInt)+1. That is, we should optimize this code to:

    if !neg {
        if un > uint(maxInt) {
            return 0, false
        }
    } else {
        if un > uint(maxInt)+1 {
            return 0, false
        }
    }

Prove can catch some repeated block controls, but not conditional ones; that's where a modified short circuit pass would help.

(4) We should consider making phiopt part of the fuse pass loops, or at least after prove. cmd/compile/internal/types.(*Type).IsInteger is a good test case. It is:

func (t *Type) IsInteger() bool {
    switch t.Etype {
    case TINT8, TUINT8, TINT16, TUINT16, TINT32, TUINT32, TINT64, TUINT64, TINT, TUINT, TUINTPTR:
        return true
    }
    return false
}

This switch case should be optimized into a single unsigned comparison. But it's too complicated for phiopt. Prove simplifies it, but it's too late for phiopt. Fuse can't handle it. We end up with pointless branching.

We might want to make phiopt a fuse function, like we recently did for shortcircuit, and run it after prove.

(5) Running shortcircuit again later helps. The trick is doing so cheaply enough without having to run a deadcode afterwards. I tried this in CL 229058 and failed. The failure is very interesting, and are the effects of doing that CL piecemeal. I still think it's a good idea, but it needs some thought and investigate before trying again.

That's a bunch of stuff. But it all falls under the umbrella of "look at the CFG and do less branching". Help with any/all of these is of course welcome.

cc @mundaym

A long time ago, I found that compared with gcc, the generation instruction of function math.IsInf is very cumbersome and there are many jumps, see https://godbolt.org/z/n8wWyG and https://godbolt.org/z/GcWh9S. There doesn't seem to be no improvement at the moment. I look forward to making it as concise as the instructions generated by gcc. I am interested in this case. But I won't be free until next quarter.

Hi, I'm doing the third one.

If there are no other volunteers, I will look into the second one. Thank you. 馃槉

Change https://golang.org/cl/244737 mentions this issue: cmd/compile/ssa: optimize the derivable known branch of If block

Change https://golang.org/cl/252937 mentions this issue: cmd/compile/internal/ssa: strengthen phiopt pass

Change https://golang.org/cl/257981 mentions this issue: cmd/compile: optimize the Phi values

This comment is an explanation of CL https://go-review.googlesource.com/c/go/+/244737 @randall77
Take math.IsInf as an example.
func IsInf(f float64, sign int) bool {
return sign >= 0 && f > MaxFloat64 || sign <= 0 && f < -MaxFloat64
}
The IR graph before shortcircuit:
image
(before)
The control value of b2 is v15, a phi op, one of its argument is v11, which is the control value of b1. So if b2 is coming from b1, we know v11 is false, then v15 is false, and we can simply redirect b1 to b5. This is what shortcircuit does. CL_244737 can't do this optimization, because CL_244737 can't infer the value of v15 from v11. From the op of v11, CL_244737 knows the relationship of v10 and v8, from the edge b1->b5, it also knows the v11 is false, but it doesn't know the value of v15 from the above knowledge.
What is missing here is to derive the output value of phi through its inputs. This is exactly what shortcircuit does. So CL_244737 needs to be based on shortcircuit.

After shortcircuit, the graph becomes as follow, then CL_244737 can play a role.
image
(after)
If b5 if coming from b1, then from the control value (b11) of b1, CL_244737 knows that v10 > v8, the control value of b5 is v17, so CL_244737 knows that b5's true branch is the taken branch. So we can redirect b1 to b7 directly.
It needs to be pointed out that prove pass can't handle what CL_244737 does, because b5 has multiple predecessors.

Was this page helpful?
0 / 5 - 0 ratings