Rust: Equality Comparison of u128 produces less ideal assembly

Created on 21 May 2019  路  7Comments  路  Source: rust-lang/rust

Compiled on Stable with Release target.

#[no_mangle]
pub fn foo(a: u128, b: u128) -> bool
{
    a == b
}

#[no_mangle]
pub fn foo2(a: u128, b: u128) -> bool
{
    ((a >> 64) as u64) == ((b >> 64) as u64) && (a as u64) == (b as u64)
}

#[no_mangle]
pub fn foo3(a: u128, b: u128) -> bool
{
    bar((a >> 64) as u64, a as u64, (b >> 64) as u64, b as u64)
}

#[no_mangle]
pub fn bar(a1: u64, a2: u64, b1: u64, b2: u64) -> bool
{
    a1 == b1 && a2 == b2
}

Produces:

foo:
    movq    xmm0, rcx
    movq    xmm1, rdx
    punpcklqdq  xmm1, xmm0
    movq    xmm0, rsi
    movq    xmm2, rdi
    punpcklqdq  xmm2, xmm0
    pcmpeqb xmm2, xmm1
    pmovmskb    eax, xmm2
    cmp eax, 65535
    sete    al
    ret

foo2:
    xor rsi, rcx
    xor rdi, rdx
    or  rdi, rsi
    sete    al
    ret

foo3:
    xor rsi, rcx
    xor rdi, rdx
    or  rdi, rsi
    sete    al
    ret

bar:
    xor rdi, rdx
    xor rsi, rcx
    or  rsi, rdi
    sete    al
    ret

It seems like all three methods should produce the same ASM.

A-LLVM I-slow T-compiler

Most helpful comment

All 7 comments

I'm definitely not an expert on this, but my understanding is that the 128-bit integers are being passed in as 2 64-bit registers each. As such, all 3 functions should be performing equivalent work, so I think all 3 (with optimizations on) should be producing the same instructions, unless the SIMD version is somehow more optimal. At a glance it seems like both more work and more instructions though, so I suspect it's not?

Kind of surprised nobody noticed this bad codegen before...

llc trunk: https://godbolt.org/z/9FqY3w

Seems to be a regression between llvm 4.0.1 and llvm 5.

LLVM-4.0.1:

foo:                                    # @foo
# BB#0:
        xor     rsi, rcx
        xor     rdi, rdx
        or      rdi, rsi
        sete    al
        ret

LLVM-5.0

foo:                                    # @foo
# BB#0:
        movq    xmm0, rcx
        movq    xmm1, rdx
        punpcklqdq      xmm1, xmm0      # xmm1 = xmm1[0],xmm0[0]
        movq    xmm0, rsi
        movq    xmm2, rdi
        punpcklqdq      xmm2, xmm0      # xmm2 = xmm2[0],xmm0[0]
        pcmpeqb xmm2, xmm1
        pmovmskb        eax, xmm2
        cmp     eax, 65535
        sete    al
        ret
                                        # -- End function

I was going to say that perhaps the xmm version has better timing characteristics still, but it does not, even on architectures where movqs are free.

Fixed on nightly: https://godbolt.org/z/1KHG1S

Was this page helpful?
0 / 5 - 0 ratings