Rust: Rust 1.25.0 regressed the performance of encoding_rs's UTF-8 validation on i686

Created on 11 Apr 2018  Â·  34Comments  Â·  Source: rust-lang/rust

When Firefox switched from Rust 1.24.0 to Rust 1.25.0, the win32 performance of encoding_rs's UTF-8 validation function dropped 12.5% when used on ASCII input. encoding_rs's UTF-8 validation function is a fork of the Rust standard library validation function that replaces the ASCII acceleration ALU trick that autovectorizes on x86_64 but not on i686 and works only in the aligned case with explicit SIMD code that deals with both the aligned and unaligned cases.

When the input is all ASCII, the function should stay in either the aligned-case or the unaligned-case inner loop that loads 16 bytes using movdqa or movdqu, respectively, performs pmovmskb on the xmm register and compares the result to zero jumping back to the start of the loop if it is zero.

When compiled for i686 Linux with opt level 2 (which Firefox uses) using Rust 1.24.0, the result is exactly as expected.

Unaligned:

.LBB12_3:
    movdqu  (%edx,%eax), %xmm0
    pmovmskb    %xmm0, %ebp
    testl   %ebp, %ebp
    jne .LBB12_9
    addl    $16, %eax
    cmpl    %ebx, %eax
    jbe .LBB12_3
    jmp .LBB12_5
    .p2align    4, 0x90

Aligned:

.LBB12_7:
    movdqa  (%edx,%eax), %xmm0
    pmovmskb    %xmm0, %ebp
    testl   %ebp, %ebp
    jne .LBB12_9
    addl    $16, %eax
    cmpl    %ebx, %eax
    jbe .LBB12_7
    .p2align    4, 0x90

(Windows wouldn't let me see the asm due to LLVM deeming the IR invalid with --emit asm.)

When compiled with Rust 1.25.0, the result is more complicated:

  1. There are two instances of movdqa and two instances of movdqu suggesting that the first trip through the loop has been unrolled to be a separate copy from the loop proper.
  2. In the actual loop, ALU instructions have been moved around including placing one between the SSE2 instructions.

Both of these transformations look like plausible optimizations, but considering the performance result from Firefox CI, it seems these transformations made performance worse.

.LBB16_1:
    movl    %edx, %ebp
    leal    (%ecx,%edi), %ebx
    movl    $0, %esi
    subl    %edi, %ebp
    cmpl    $16, %ebp
    jb  .LBB16_22
    leal    -16(%ebp), %eax
    testb   $15, %bl
    movl    %eax, 20(%esp)
    je  .LBB16_9
    movdqu  (%ebx), %xmm0
    movl    %edx, 12(%esp)
    xorl    %eax, %eax
    pmovmskb    %xmm0, %edx
    testl   %edx, %edx
    jne .LBB16_7
    movl    24(%esp), %eax
    xorl    %esi, %esi
    leal    (%eax,%edi), %ecx
    .p2align    4, 0x90
.LBB16_5:
    leal    16(%esi), %eax
    cmpl    20(%esp), %eax
    ja  .LBB16_20
    movdqu  (%ecx,%esi), %xmm0
    movl    %eax, %esi
    pmovmskb    %xmm0, %edx
    testl   %edx, %edx
    je  .LBB16_5
.LBB16_7:
    testl   %edx, %edx
    je  .LBB16_12
    bsfl    %edx, %esi
    jmp .LBB16_13
.LBB16_9:
    movdqa  (%ebx), %xmm0
    xorl    %ecx, %ecx
    pmovmskb    %xmm0, %eax
    testl   %eax, %eax
    je  .LBB16_15
    testl   %eax, %eax
    je  .LBB16_19
.LBB16_11:
    bsfl    %eax, %esi
    addl    %ecx, %esi
    jmp .LBB16_14
.LBB16_12:
    movl    $32, %esi
.LBB16_13:
    movl    12(%esp), %edx
    addl    %eax, %esi
.LBB16_14:
    movb    (%ebx,%esi), %al
    jmp .LBB16_24
.LBB16_15:
    movl    $16, %esi
    .p2align    4, 0x90
.LBB16_16:
    cmpl    20(%esp), %esi
    ja  .LBB16_22
    movdqa  (%ebx,%esi), %xmm0
    addl    $16, %esi
    pmovmskb    %xmm0, %eax
    testl   %eax, %eax
    je  .LBB16_16
    addl    $-16, %esi
    movl    %esi, %ecx
    testl   %eax, %eax
    jne .LBB16_11

The asm was obtained by compiling encoding_rs (Firefox uses 0.7.2) using RUSTC_BOOTSTRAP=1 RUSTFLAGS='-C opt-level=2 --emit asm' cargo build --target i686-unknown-linux-gnu --release --features simd-accel and searching for utf8_valid_up_to in the .s file.

A-LLVM I-slow P-medium WG-codegen regression-from-stable-to-stable

Most helpful comment

Verified that cherry-picking the commit fixes the regression.

All 34 comments

Note that I observed something similar on x86-64 (and Firefox has seen a regression on x86-64 too).

The loop is inlined from validate_ascii.

The presumed unrolling of the first loop trip happens before LLVM, but the inner loop (which has to be what's slow to get the 12.5% drop in perf) looks the same on the IR level, so this is probably an LLVM 4.0 vs. LLVM 6.0 thing.

IR from 1.24.0:

; encoding_rs::Encoding::ascii_valid_up_to
; Function Attrs: readonly uwtable
define i32 @_ZN11encoding_rs8Encoding17ascii_valid_up_to17h3e9b04dc68770813E([0 x i8]* noalias nonnull readonly %bytes.0, i32 %bytes.1) unnamed_addr #6 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %0 = icmp ugt i32 %bytes.1, 15
  br i1 %0, label %bb3.i.i, label %bb30.preheader.i.i

bb3.i.i:                                          ; preds = %start
  %1 = add i32 %bytes.1, -16
  %2 = ptrtoint [0 x i8]* %bytes.0 to i32
  %3 = and i32 %2, 15
  %4 = icmp eq i32 %3, 0
  br i1 %4, label %bb5.i.i.preheader, label %bb18.i.i.preheader

bb18.i.i.preheader:                               ; preds = %bb3.i.i
  br label %bb18.i.i

bb5.i.i.preheader:                                ; preds = %bb3.i.i
  br label %bb5.i.i

bb5.i.i:                                          ; preds = %bb5.i.i.preheader, %bb10.i.i
  %offset.0.i.i = phi i32 [ %10, %bb10.i.i ], [ 0, %bb5.i.i.preheader ]
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %offset.0.i.i
  %6 = bitcast i8* %5 to <16 x i8>*
  %7 = load <16 x i8>, <16 x i8>* %6, align 16, !alias.scope !6596, !noalias !6601
  %8 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %7) #7
  %9 = icmp eq i32 %8, 0
  br i1 %9, label %bb10.i.i, label %bb15.sink.split.i.i.loopexit

bb10.i.i:                                         ; preds = %bb5.i.i
  %10 = add i32 %offset.0.i.i, 16
  %11 = icmp ugt i32 %10, %1
  br i1 %11, label %bb30.preheader.i.i.loopexit, label %bb5.i.i

bb15.sink.split.i.i.loopexit:                     ; preds = %bb5.i.i
  br label %bb15.sink.split.i.i

bb15.sink.split.i.i.loopexit37:                   ; preds = %bb18.i.i
  br label %bb15.sink.split.i.i

bb15.sink.split.i.i:                              ; preds = %bb15.sink.split.i.i.loopexit37, %bb15.sink.split.i.i.loopexit
  %.sink35.i.i = phi i32 [ %8, %bb15.sink.split.i.i.loopexit ], [ %15, %bb15.sink.split.i.i.loopexit37 ]
  %offset.1.sink.i.i = phi i32 [ %offset.0.i.i, %bb15.sink.split.i.i.loopexit ], [ %offset.1.i.i, %bb15.sink.split.i.i.loopexit37 ]
  %12 = tail call i32 @llvm.cttz.i32(i32 %.sink35.i.i, i1 false) #7
  %13 = add i32 %offset.1.sink.i.i, %12
  br label %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit

bb18.i.i:                                         ; preds = %bb18.i.i.preheader, %bb23.i.i
  %offset.1.i.i = phi i32 [ %17, %bb23.i.i ], [ 0, %bb18.i.i.preheader ]
  %14 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %offset.1.i.i
  %simd.0.ptr.sroa_cast.i.i.i = bitcast i8* %14 to <16 x i8>*
  %simd.0.copyload.i.i.i = load <16 x i8>, <16 x i8>* %simd.0.ptr.sroa_cast.i.i.i, align 1, !alias.scope !6596, !noalias !6601
  %15 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %simd.0.copyload.i.i.i) #7
  %16 = icmp eq i32 %15, 0
  br i1 %16, label %bb23.i.i, label %bb15.sink.split.i.i.loopexit37

bb23.i.i:                                         ; preds = %bb18.i.i
  %17 = add i32 %offset.1.i.i, 16
  %18 = icmp ugt i32 %17, %1
  br i1 %18, label %bb30.preheader.i.i.loopexit38, label %bb18.i.i

bb30.preheader.i.i.loopexit:                      ; preds = %bb10.i.i
  br label %bb30.preheader.i.i

bb30.preheader.i.i.loopexit38:                    ; preds = %bb23.i.i
  br label %bb30.preheader.i.i

bb30.preheader.i.i:                               ; preds = %bb30.preheader.i.i.loopexit38, %bb30.preheader.i.i.loopexit, %start
  %offset.2.ph.i.i = phi i32 [ 0, %start ], [ %10, %bb30.preheader.i.i.loopexit ], [ %17, %bb30.preheader.i.i.loopexit38 ]
  %19 = icmp ult i32 %offset.2.ph.i.i, %bytes.1
  br i1 %19, label %bb33.i.i.preheader, label %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit

bb33.i.i.preheader:                               ; preds = %bb30.preheader.i.i
  br label %bb33.i.i

bb33.i.i:                                         ; preds = %bb33.i.i.preheader, %bb35.i.i
  %offset.247.i.i = phi i32 [ %23, %bb35.i.i ], [ %offset.2.ph.i.i, %bb33.i.i.preheader ]
  %20 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %offset.247.i.i
  %21 = load i8, i8* %20, align 1, !alias.scope !6596, !noalias !6601
  %22 = icmp slt i8 %21, 0
  br i1 %22, label %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit.loopexit, label %bb35.i.i

bb35.i.i:                                         ; preds = %bb33.i.i
  %23 = add nuw i32 %offset.247.i.i, 1
  %24 = icmp ult i32 %23, %bytes.1
  br i1 %24, label %bb33.i.i, label %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit.loopexit

_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit.loopexit: ; preds = %bb35.i.i, %bb33.i.i
  %_0.0.i.ph = phi i32 [ %bytes.1, %bb35.i.i ], [ %offset.247.i.i, %bb33.i.i ]
  br label %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit

_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit: ; preds = %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit.loopexit, %bb15.sink.split.i.i, %bb30.preheader.i.i
  %_0.0.i = phi i32 [ %bytes.1, %bb30.preheader.i.i ], [ %13, %bb15.sink.split.i.i ], [ %_0.0.i.ph, %_ZN11encoding_rs5ascii17ascii_valid_up_to17h5c1d65b6f83b80ecE.exit.loopexit ]
  ret i32 %_0.0.i
}

IR from 1.25.0:

; encoding_rs::utf_8::utf8_valid_up_to
; Function Attrs: uwtable
define internal fastcc i32 @_ZN11encoding_rs5utf_816utf8_valid_up_to17he6ec0248b61d42c5E([0 x i8]* noalias nonnull readonly %bytes.0, i32 %bytes.1) unnamed_addr #4 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  br label %"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i"

bb1.i.i.i.i:                                      ; preds = %bb73.i
; call core::slice::slice_index_order_fail
  tail call void @_ZN4core5slice22slice_index_order_fail17hd587cac96d801a96E(i32 %94, i32 %bytes.1)
  unreachable

"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i": ; preds = %bb73.i, %start
  %index.0151.i = phi i32 [ 0, %start ], [ %94, %bb73.i ]
  %0 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %index.0151.i
  %1 = sub i32 %bytes.1, %index.0151.i
  %2 = icmp ugt i32 %1, 15
  br i1 %2, label %bb3.i.i, label %bb29.i.i

bb3.i.i:                                          ; preds = %"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i"
  %3 = add i32 %1, -16
  %4 = ptrtoint i8* %0 to i32
  %5 = and i32 %4, 15
  %6 = icmp eq i32 %5, 0
  %7 = bitcast i8* %0 to <16 x i8>*
  br i1 %6, label %bb5.preheader.i.i, label %bb18.preheader.i.i

bb18.preheader.i.i:                               ; preds = %bb3.i.i
  %simd.0.copyload.i46.i.i = load <16 x i8>, <16 x i8>* %7, align 1, !alias.scope !27, !noalias !32
  %8 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %simd.0.copyload.i46.i.i) #10
  %9 = icmp eq i32 %8, 0
  br i1 %9, label %bb23.i.i.preheader, label %bb22.i.i

bb23.i.i.preheader:                               ; preds = %bb18.preheader.i.i
  br label %bb23.i.i

bb5.preheader.i.i:                                ; preds = %bb3.i.i
  %10 = load <16 x i8>, <16 x i8>* %7, align 16, !alias.scope !27, !noalias !37
  %11 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %10) #10
  %12 = icmp eq i32 %11, 0
  br i1 %12, label %bb10.i.i.preheader, label %bb9.i.i

bb10.i.i.preheader:                               ; preds = %bb5.preheader.i.i
  br label %bb10.i.i

bb5.i.i:                                          ; preds = %bb10.i.i
  %13 = getelementptr inbounds i8, i8* %0, i32 %20
  %14 = bitcast i8* %13 to <16 x i8>*
  %15 = load <16 x i8>, <16 x i8>* %14, align 16, !alias.scope !27, !noalias !37
  %16 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %15) #10
  %17 = icmp eq i32 %16, 0
  br i1 %17, label %bb10.i.i, label %bb9.i.i

bb9.i.i:                                          ; preds = %bb5.i.i, %bb5.preheader.i.i
  %offset.0.lcssa.i.i = phi i32 [ 0, %bb5.preheader.i.i ], [ %20, %bb5.i.i ]
  %.lcssa34.i.i = phi i32 [ %11, %bb5.preheader.i.i ], [ %16, %bb5.i.i ]
  %18 = tail call i32 @llvm.cttz.i32(i32 %.lcssa34.i.i, i1 false) #10, !range !40
  %19 = add i32 %18, %offset.0.lcssa.i.i
  br label %bb7.i.sink.split

bb10.i.i:                                         ; preds = %bb10.i.i.preheader, %bb5.i.i
  %offset.043.i.i = phi i32 [ %20, %bb5.i.i ], [ 0, %bb10.i.i.preheader ]
  %20 = add i32 %offset.043.i.i, 16
  %21 = icmp ugt i32 %20, %3
  br i1 %21, label %bb29.i.i, label %bb5.i.i

bb18.i.i:                                         ; preds = %bb23.i.i
  %22 = getelementptr inbounds i8, i8* %0, i32 %27
  %simd.0.ptr.sroa_cast.i.i.i = bitcast i8* %22 to <16 x i8>*
  %simd.0.copyload.i.i.i = load <16 x i8>, <16 x i8>* %simd.0.ptr.sroa_cast.i.i.i, align 1, !alias.scope !27, !noalias !32
  %23 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %simd.0.copyload.i.i.i) #10
  %24 = icmp eq i32 %23, 0
  br i1 %24, label %bb23.i.i, label %bb22.i.i

bb22.i.i:                                         ; preds = %bb18.i.i, %bb18.preheader.i.i
  %offset.1.lcssa.i.i = phi i32 [ 0, %bb18.preheader.i.i ], [ %27, %bb18.i.i ]
  %.lcssa38.i.i = phi i32 [ %8, %bb18.preheader.i.i ], [ %23, %bb18.i.i ]
  %25 = tail call i32 @llvm.cttz.i32(i32 %.lcssa38.i.i, i1 false) #10, !range !40
  %26 = add i32 %25, %offset.1.lcssa.i.i
  br label %bb7.i.sink.split

bb23.i.i:                                         ; preds = %bb23.i.i.preheader, %bb18.i.i
  %offset.147.i.i = phi i32 [ %27, %bb18.i.i ], [ 0, %bb23.i.i.preheader ]
  %27 = add i32 %offset.147.i.i, 16
  %28 = icmp ugt i32 %27, %3
  br i1 %28, label %bb29.i.i, label %bb18.i.i

bb29.i.i:                                         ; preds = %bb23.i.i, %bb10.i.i, %"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i"
  %offset.2.i.i = phi i32 [ 0, %"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i" ], [ %20, %bb10.i.i ], [ %27, %bb23.i.i ]
  %29 = icmp ult i32 %offset.2.i.i, %1
  br i1 %29, label %bb33.i.i.preheader, label %bb5

bb33.i.i.preheader:                               ; preds = %bb29.i.i
  br label %bb33.i.i

bb33.i.i:                                         ; preds = %bb33.i.i.preheader, %bb35.i.i
  %offset.342.i.i = phi i32 [ %33, %bb35.i.i ], [ %offset.2.i.i, %bb33.i.i.preheader ]
  %30 = getelementptr inbounds i8, i8* %0, i32 %offset.342.i.i
  %31 = load i8, i8* %30, align 1, !alias.scope !27, !noalias !41
  %32 = icmp slt i8 %31, 0
  br i1 %32, label %bb7.i, label %bb35.i.i

bb35.i.i:                                         ; preds = %bb33.i.i
  %33 = add nuw i32 %offset.342.i.i, 1
  %34 = icmp ult i32 %33, %1
  br i1 %34, label %bb33.i.i, label %bb5

bb7.i.sink.split:                                 ; preds = %bb9.i.i, %bb22.i.i
  %.sink66 = phi i32 [ %26, %bb22.i.i ], [ %19, %bb9.i.i ]
  %35 = getelementptr inbounds i8, i8* %0, i32 %.sink66
  %36 = load i8, i8* %35, align 1, !alias.scope !27, !noalias !41
  br label %bb7.i

bb7.i:                                            ; preds = %bb33.i.i, %bb7.i.sink.split
  %_12.sroa.1294.1.ph.i = phi i32 [ %.sink66, %bb7.i.sink.split ], [ %offset.342.i.i, %bb33.i.i ]
  %_12.sroa.8.1.ph.i = phi i8 [ %36, %bb7.i.sink.split ], [ %31, %bb33.i.i ]
  %37 = add i32 %_12.sroa.1294.1.ph.i, %index.0151.i
  br label %bb9.i

bb9.i:                                            ; preds = %bb72.i, %bb7.i
  %first.0.i = phi i8 [ %_12.sroa.8.1.ph.i, %bb7.i ], [ %92, %bb72.i ]
  %index.1.i = phi i32 [ %37, %bb7.i ], [ %47, %bb72.i ]
  %38 = zext i8 %first.0.i to i32
  %39 = getelementptr inbounds [256 x i8], [256 x i8]* @_ZN11encoding_rs10utf_8_core15UTF8_CHAR_WIDTH17hf8319a77b9cec8e4E, i32 0, i32 %38
  %40 = load i8, i8* %39, align 1
  switch i8 %40, label %bb5 [
    i8 2, label %bb11.i
    i8 3, label %bb12.i
    i8 4, label %bb13.i
  ]

bb11.i:                                           ; preds = %bb9.i
  %41 = add i32 %index.1.i, 1
  %42 = icmp ult i32 %41, %bytes.1
  br i1 %42, label %bb20.i, label %bb5

bb12.i:                                           ; preds = %bb9.i
  %43 = add i32 %index.1.i, 1
  %44 = icmp ult i32 %43, %bytes.1
  br i1 %44, label %bb25.i, label %bb5

bb13.i:                                           ; preds = %bb9.i
  %45 = add i32 %index.1.i, 1
  %46 = icmp ult i32 %45, %bytes.1
  br i1 %46, label %bb48.i, label %bb5

bb15.i:                                           ; preds = %bb67.i, %bb43.i, %bb20.i
  %index.2.i = phi i32 [ %84, %bb67.i ], [ %59, %bb43.i ], [ %41, %bb20.i ]
  %47 = add i32 %index.2.i, 1
  %48 = icmp eq i32 %47, %bytes.1
  br i1 %48, label %bb5, label %bb71.i

bb20.i:                                           ; preds = %bb11.i
  %49 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %41
  %50 = load i8, i8* %49, align 1, !alias.scope !42, !noalias !43
  %51 = and i8 %50, -64
  %52 = icmp eq i8 %51, -128
  br i1 %52, label %bb15.i, label %bb5

bb25.i:                                           ; preds = %bb12.i
  %53 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %43
  %54 = load i8, i8* %53, align 1, !alias.scope !42, !noalias !43
  %cond9.i = icmp eq i8 %first.0.i, -32
  %55 = icmp ult i8 %54, -64
  %56 = and i8 %54, -32
  %57 = icmp eq i8 %56, -96
  %58 = and i1 %cond9.i, %57
  br i1 %58, label %bb26.i, label %bb30.i

bb26.i:                                           ; preds = %bb37.i, %bb34.i, %bb30.i, %bb25.i
  %59 = add i32 %index.1.i, 2
  %60 = icmp ult i32 %59, %bytes.1
  br i1 %60, label %bb43.i, label %bb5

bb30.i:                                           ; preds = %bb25.i
  %first.0.off101.i = add i8 %first.0.i, 31
  %61 = icmp ult i8 %first.0.off101.i, 12
  %62 = icmp slt i8 %54, 0
  %or.cond76.i = and i1 %61, %62
  %or.cond77.i = and i1 %55, %or.cond76.i
  br i1 %or.cond77.i, label %bb26.i, label %bb34.i

bb34.i:                                           ; preds = %bb30.i
  %cond10.i = icmp eq i8 %first.0.i, -19
  %or.cond78.i = and i1 %cond10.i, %62
  %63 = icmp ult i8 %54, -96
  %or.cond79.i = and i1 %63, %or.cond78.i
  br i1 %or.cond79.i, label %bb26.i, label %bb37.i

bb37.i:                                           ; preds = %bb34.i
  %64 = and i8 %first.0.i, -2
  %65 = icmp eq i8 %64, -18
  %or.cond81.i = and i1 %65, %62
  %or.cond82.i = and i1 %55, %or.cond81.i
  br i1 %or.cond82.i, label %bb26.i, label %bb5

bb43.i:                                           ; preds = %bb26.i
  %66 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %59
  %67 = load i8, i8* %66, align 1, !alias.scope !42, !noalias !43
  %68 = and i8 %67, -64
  %69 = icmp eq i8 %68, -128
  br i1 %69, label %bb15.i, label %bb5

bb48.i:                                           ; preds = %bb13.i
  %70 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %45
  %71 = load i8, i8* %70, align 1, !alias.scope !42, !noalias !43
  %cond.i = icmp eq i8 %first.0.i, -16
  %.off.i = add i8 %71, 112
  %72 = icmp ult i8 %.off.i, 48
  %73 = and i1 %cond.i, %72
  br i1 %73, label %bb49.i, label %bb53.i

bb49.i:                                           ; preds = %bb57.i, %bb53.i, %bb48.i
  %74 = add i32 %index.1.i, 2
  %75 = icmp ult i32 %74, %bytes.1
  br i1 %75, label %bb62.i, label %bb5

bb53.i:                                           ; preds = %bb48.i
  %76 = icmp ult i8 %71, -64
  %first.0.off.i = add i8 %first.0.i, 15
  %77 = icmp ult i8 %first.0.off.i, 3
  %78 = icmp slt i8 %71, 0
  %or.cond86.i = and i1 %77, %78
  %or.cond87.i = and i1 %76, %or.cond86.i
  br i1 %or.cond87.i, label %bb49.i, label %bb57.i

bb57.i:                                           ; preds = %bb53.i
  %cond8.i = icmp eq i8 %first.0.i, -12
  %or.cond88.i = and i1 %cond8.i, %78
  %79 = icmp ult i8 %71, -112
  %or.cond89.i = and i1 %79, %or.cond88.i
  br i1 %or.cond89.i, label %bb49.i, label %bb5

bb62.i:                                           ; preds = %bb49.i
  %80 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %74
  %81 = load i8, i8* %80, align 1, !alias.scope !42, !noalias !43
  %82 = and i8 %81, -64
  %83 = icmp eq i8 %82, -128
  br i1 %83, label %bb64.i, label %bb5

bb64.i:                                           ; preds = %bb62.i
  %84 = add i32 %index.1.i, 3
  %85 = icmp ult i32 %84, %bytes.1
  br i1 %85, label %bb67.i, label %bb5

bb67.i:                                           ; preds = %bb64.i
  %86 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %84
  %87 = load i8, i8* %86, align 1, !alias.scope !42, !noalias !43
  %88 = and i8 %87, -64
  %89 = icmp eq i8 %88, -128
  br i1 %89, label %bb15.i, label %bb5

bb71.i:                                           ; preds = %bb15.i
  %90 = icmp ult i32 %47, %bytes.1
  br i1 %90, label %bb72.i, label %panic7.i, !prof !44

bb72.i:                                           ; preds = %bb71.i
  %91 = getelementptr inbounds [0 x i8], [0 x i8]* %bytes.0, i32 0, i32 %47
  %92 = load i8, i8* %91, align 1, !alias.scope !42, !noalias !43
  %93 = icmp sgt i8 %92, -1
  br i1 %93, label %bb73.i, label %bb9.i

bb73.i:                                           ; preds = %bb72.i
  %94 = add i32 %index.2.i, 2
  %95 = icmp ugt i32 %94, %bytes.1
  br i1 %95, label %bb1.i.i.i.i, label %"_ZN4core5slice74_$LT$impl$u20$core..ops..index..Index$LT$I$GT$$u20$for$u20$$u5b$T$u5d$$GT$5index17he71b553e81d1e1aeE.exit.i"

panic7.i:                                         ; preds = %bb71.i
; call core::panicking::panic_bounds_check
  tail call void @_ZN4core9panicking18panic_bounds_check17h24c830ebecd41baaE({ [0 x i32], { [0 x i8]*, i32 }, [0 x i32], i32, [0 x i32], i32, [0 x i32] }* noalias readonly dereferenceable(16) bitcast ({ { [0 x i8]*, i32 }, i32, i32 }* @panic_bounds_check_loc.C to { [0 x i32], { [0 x i8]*, i32 }, [0 x i32], i32, [0 x i32], i32, [0 x i32] }*), i32 %47, i32 %bytes.1)
  unreachable

bb5:                                              ; preds = %bb29.i.i, %bb35.i.i, %bb67.i, %bb64.i, %bb62.i, %bb49.i, %bb57.i, %bb13.i, %bb43.i, %bb26.i, %bb37.i, %bb12.i, %bb20.i, %bb11.i, %bb9.i, %bb15.i
  %_0.0 = phi i32 [ %37, %bb67.i ], [ %37, %bb64.i ], [ %37, %bb62.i ], [ %37, %bb49.i ], [ %37, %bb57.i ], [ %37, %bb13.i ], [ %37, %bb43.i ], [ %37, %bb26.i ], [ %37, %bb37.i ], [ %37, %bb12.i ], [ %37, %bb20.i ], [ %37, %bb11.i ], [ %37, %bb9.i ], [ %bytes.1, %bb15.i ], [ %bytes.1, %bb35.i.i ], [ %bytes.1, %bb29.i.i ]
  ret i32 %_0.0
}

Code generated by the current nightly looks like the code generated by 1.25.0.

opt-level=3 does not fix this.

cc @rust-lang/wg-codegen

What instruction scheduling model does rustc use by default?

The change looks like it's a result of instruction scheduling trickery, and the LLVM 6.0 release notes say: "Added instruction scheduling information for Intel Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake CPUs."

Notably, when running cargo bench on Haswell outside the Gecko context (but with opt-level=2), I was unable to reproduce the performance degradation on the x86_64 Linux target.

How can I choose an instruction scheduling model without also enabling the instruction set extensions of the named microarchitecture?

My guess (can't check atm, not at a computer) is that instruction
scheduling is chosen by LLVM depending on the target cpu that is specified
in the target files and for x86_64 that is the most conservative option
of... x86_64 (no particular cpu, and definitely not haswell+). My guess
would be that within gecko context flags that may influence this selection
(-Ctarget-cpu -Ctarget-feature etc) are being set to some value that
correspond to a more recent cpu.

On Thu, Apr 12, 2018, 15:15 Henri Sivonen notifications@github.com wrote:

What instruction scheduling model does rustc use by default?

The change looks like it's a result of instruction scheduling trickery,
and the LLVM 6.0 release notes say: "Added instruction scheduling
information for Intel Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and
Skylake CPUs."

Notably, when running cargo bench on Haswell outside the Gecko context
(but with opt-level=2), I was unable to reproduce the performance
degradation on the x86_64 Linux target.

How can I choose an instruction scheduling model without also enabling the
instruction set extensions of the named microarchitecture?

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/49873#issuecomment-380784157,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApc0pZPt3WQVt-9lXWmpvoZZLjHH3cJks5tn0VzgaJpZM4TP4gW
.

As far as I can tell, Gecko doesn't set target-cpu or target-feature using the -C flag.

LLVM seems to have microarchitecture-specific sceduling models from Sandy Bridge upwards, and between LLVM 4.0 and 6.0 Intel engineers landed major rewrites of the Sandy Bridge and Haswell scheduling models citing information obtained from the architects of these microarchitectures.

It's unclear to me where the default x86 and x86_64 scheduling models reside in the LLVM source.

Note that I observed something similar on x86-64 (and Firefox has seen a regression on x86-64 too).

The code generated for x86_64 for the unrolled innermost loop looks closer to expected, though. Aligned case:

.LBB16_14:
    cmpq    %r15, %rdx
    ja  .LBB16_21
    movdqa  (%r11,%rdx), %xmm0
    pmovmskb    %xmm0, %ebx
    addq    $16, %rdx
    testl   %ebx, %ebx
    je  .LBB16_14
    addq    $-16, %rdx
    testl   %ebx, %ebx
    jne .LBB16_12
    jmp .LBB16_17

It seems that for the i686 targets, the base CPU is pentium4. Changing it to pentium-m does not help. However, changing it to x86-64, which is the base CPU for the x86_64 targets results in similar instruction scheduling as when actually targeting 64-bit code. (Changing it to haswell without also changing target-feature results in different instruction selection.)

  1. There are two instances of movdqa and two instances of movdqu suggesting that the first trip through the loop has been unrolled to be a separate copy from the loop proper.
  2. In the actual loop, ALU instructions have been moved around including placing one between the SSE2 instructions.

I checked old nightlies. The change occurred exactly when rustc switched from LLVM 4.0 to 6.0.

Effect 1., the unrolling, happens on the IR level regardless of target-cpu. Effect 2. seems to be tied to CPU model-specific instruction scheduling for Pentium 4 and Pentium M at least, so effect 2. can be avoided by picking a different target-cpu.

Does rustc expose controls for omitting certain LLVM IR transformation passes? How should I go about locating the cause of effect 1.?

Here are the LLVM passes (64-bit target) from the last LLVM 4.0 build of rustc and the first LLVM 6.0 build of rustc.

I'm not familiar with LLVM internals, so the only thing that stands out to me is that the Natural Loop Information pass seems to have moved relative to other passes in some cases and in some cases there are additional Machine Natural Loop Construction passes.

Remedying effect 2 isn't much of a remedy according to Gecko benchmarks.

It seems implausible that the unrolled one trip through the loop could be so slow as to explain the original slow-down, so I assume that the slow-down has to be attributable to the unrolling causing movement of code within the inner loop--specifically causing the index-based loop exit criteria to move to above the buffer content check in the basic block forming the inner loop.

The LLVM 4.0 to 6.0 switch happened between nightly-2018-02-10 and nightly-2018-02-11.

@hsivonen do you have the data necessary to reproduce the runtime regression here? I managed to extract ascii_valid_up_to into a single file so we can hopefully send it to upstream LLVM when we've reduced it enough, but with some cursory testing I'm unable to reproduce the time regression seen here.

Thank you. I minimized the test case even further, tweaked the benchmark to match Firefox memory alignment and iterations, added scripts to run the code easily and managed to reproduce the regression in x86_64 mode on two different microarchitectures.

I filed an LLVM bug.

@hsivonen You might want to attach some LLVM IR samples to the the LLVM bug report - or even a C(++) source, if it's easy enough to do - to make it easier for LLVM developers to test potential fixes.

(In my experience, LLVM developers don't tend to try LLVM changes with custom Rust builds, but they can easily have clang within their LLVM build tree to quickly try various C(++) testcases.)

I've commented with a pure C example to help the LLVM devs dig into

This is caused by the loop rotation pass. Using the IR https://gist.github.com/nikic/d55cb75635a52a29790248b6bcd24be8 running opt -loop-rotate shows the issue. It's somewhat fragile though, e.g. -simplifycfg -loop-rotate won't work. On LLVM trunk -O2 also no longer seems to exhibit the problem.

I haven't tested, but I suspect that backporting https://github.com/llvm-mirror/llvm/commit/4c64dfea37812732f39e68ca444b2eb809d78dcc would fix this regression.

The problem is as follows:

  • Loop rotation does not happen if the (unique) loop latch is loop exiting (with some exceptions).
  • At the point where loop rotation occurs, the loop latch is an unconditional branch from an empty block. Without indirection through this dummy block, the loop latch would be loop exiting.
  • Prior to the commit referenced above, CFG simplification with canonical loop preservation was not able to remove this extra block.

@nikic nice =)

Maybe someone can take a stab at evaluating that?

I can try that later tonight, but I feel like I'm going to come back and it will have been done already hah.

Verified that cherry-picking the commit fixes the regression.

Thank you! I verified that the performance of the test case improved with the latest nightly.

Is this kind of LLVM cherrypick eligible to be nominated for cherrypicking into Rust beta? How does one nominate beta backports?

I've added the beta-nominated tag to https://github.com/rust-lang/rust/pull/50827 which is how we process backports (the compiler team is also tagged to come up during triage)

The backport seems to be causing issues on beta that don't occur on nightly -- the origins of this are not known, perhaps some other fixes that landed in the meantime? (The failing test is myriad-closures, and it fails only on 32-bit windows, afaik -- it may be a memory usage issue?)

@hsivonen how much do you care that this is backported? We're trying to decide how many resources to devote to tracking this down vs fixing other things.

Firefox is not updating its compiler from 1.24 (https://bugzilla.mozilla.org/show_bug.cgi?id=1447116) in part because of this. (@glandium can say more about the other blocker.)

@nikomatsakis myriad-closures is indeed a stress test, and memory usage changes make sense.

Firefox is going to stay on 1.24 until it can jump to 1.28. So, for Firefox, it doesn't matter at all that this makes it to beta.

@glandium where's the discussion/rationale on that? I searched Bugzilla but couldn't find it.

@djc beta is 1.27, and 1.27 doesn't allow to get the allocation size on oom, and that's important to us. 1.28 does.

@nikomatsakis , given what @glandium said about jumping to 1.28 anyway, I no longer care about a backport.

Was this page helpful?
0 / 5 - 0 ratings