Rust: Miscompilation while working on PR #50882

Created on 25 Jul 2018  路  17Comments  路  Source: rust-lang/rust

(Moving this to a separate issue because PR #50882 is already full of noise)
Reduced STR:

  • Take current rust master
  • Apply the following patch:
diff --git a/src/liballoc/alloc.rs b/src/liballoc/alloc.rs
index 84bd275..4df7730 100644
--- a/src/liballoc/alloc.rs
+++ b/src/liballoc/alloc.rs
@@ -172,7 +172,7 @@ pub(crate) unsafe fn box_free<T: ?Sized>(ptr: Unique<T>) {
     // We do not allocate for Box<T> when T is ZST, so deallocation is also not necessary.
     if size != 0 {
         let layout = Layout::from_size_align_unchecked(size, align);
-        dealloc(ptr as *mut u8, layout);
+        Global.dealloc(NonNull::new_unchecked(ptr).cast(), layout);
     }
 }

  • Compile rust against the system llvm (5 or 6)
  • Run libstd tests (x.py test src/libstd --stage 1)

What happens then is that sync::once::tests::wait_for_force_to_finish fails with:

thread '<unnamed>' panicked at 'assertion failed: t1.join().is_ok()', libstd/sync/once.rs:582:9

The disassembly for wait_for_force_to_finish only contains one call to std::thread::JoinHandle::join, instead of the expected two when the code is not miscompiled. That call is followed by a test that jumps to a panic when the result of JoinHandle::join is ... Ok(()):

   93371:       e8 aa fd 07 00          callq  113120 <_ZN41_$LT$std..thread..JoinHandle$LT$T$GT$$GT
$4join17h0fb3b129ec2e38aeE>
   93376:       48 89 c3                mov    %rax,%rbx
   93379:       49 89 d7                mov    %rdx,%r15
   9337c:       48 85 db                test   %rbx,%rbx
   9337f:       74 1b                   je     9339c <_ZN3std4sync4once5tests24wait_for_force_to_fin
ish17h6199051fcaa3ff6aE+0x21c>

That JoinHandle::join returns a Result<(), Box<Any>>, and Ok(()) is represented as (0, whatever). The destination of that jump is the panic code.

At some point, I'm not entirely sure with what state of the tree, it was even better. The error was:

thread '<unnamed>' panicked at 'assertion failed: t2.join().is_ok()', libstd/sync/once.rs:583:9

And there were two calls to std::thread::JoinHandle::join, as expected, but they didn't have the same result handling:

   93c79:       e8 32 c5 07 00          callq  1101b0 <_ZN41_$LT$std..thread..JoinHandle$LT$T$GT$$GT$4join17hc6e7f9bb7d72483aE>
   93c7e:       48 89 c3                mov    %rax,%rbx
   93c81:       49 89 d7                mov    %rdx,%r15
   93c84:       48 85 db                test   %rbx,%rbx
   93c87:       75 5e                   jne    93ce7 <_ZN3std4sync4once5tests24wait_for_force_to_finish17h13c0ef8dd5eb6a3aE+0x267>
(snip)
   93c9f:       e8 0c c5 07 00          callq  1101b0 <_ZN41_$LT$std..thread..JoinHandle$LT$T$GT$$GT$4join17hc6e7f9bb7d72483aE>
   93ca4:       48 89 c3                mov    %rax,%rbx
   93ca7:       49 89 d7                mov    %rdx,%r15
   93caa:       48 85 db                test   %rbx,%rbx
   93cad:       74 1b                   je     93cca <_ZN3std4sync4once5tests24wait_for_force_to_finish17h13c0ef8dd5eb6a3aE+0x24a>

The destination of both jumps is panic code. The first jump, corresponding to t1.join().is_ok() is correct, and the second, corresponding to t2.join().is_ok() is broken, thus the test failure.

Even better: this doesn't happen when compiling with the bundled llvm. It also doesn't happen when extracting the test from libstd and compiling with a faulty compiler. It seems the fact that it's part of libstd, and that most of libstd is compiled along the test, plays a role.

I'm trying to get the corresponding mir and llvm-ir.

A-LLVM

Most helpful comment

It appears the miscompilation doesn't happen after #55426.

All 17 comments

There are, interestingly, tons of differences between the mir between a build that works, and a build that doesn't. Both with rust master + the patch quoted above, the only difference being one was built with system llvm 5, and the other with the bundled llvm. I'm puzzled.

Some more data:

  • I checked out a one-month-old commit, around the time I did the last push to PR #50882 that didn't hit the error, applied the same patch, and tried again with system llvm 5. It was similarly broken.
  • But back then, travis builds were using system llvm 3.9, so I tried that too, and that was not broken.

So this could very well be a case of llvm versions being broken (or compiler integration of said versions).

Even more data:
- With static system llvm trunk built manually: not broken
- With static system llvm 6.0 built manually: not broken
- With shared system llvm 6.0 built manually: not broken
- With shared system llvm 5.0 built manually: not broken

All the above built with clang 5.0. So... tried again with GCC 5.4:
- With shared system llvm 5.0 built manually: not broken

So... this is only happening with Ubuntu builds of llvm (and presumably Debian)?
@sylvestre do you know what's special about them?

Ok, It's too late and I'm tired, all those last attempts were without the patch applied, so they don't count. I'll try again tomorrow.

Not sure. Some ideas:

This reminds me that it would nice to be able to duplicate/"virtualize" lang items for testing purposes, so the old and new standard library (fragments) are completely separated. I should open an issue door that.

So, restarting afresh:

  • With shared system llvm 5.0 built manually with clang 5: broken
  • With shared system llvm 6.0 built manually with clang 5: broken
  • With shared system llvm trunk built manually with clang 5: not broken

So at least this is consistent with the initial results with ubuntu system llvm and bundled llvm. Time for a bisect.

Bisection identified https://github.com/llvm-mirror/llvm/commit/a57b9dfaee5a3485318bf0d1785e0a2e81310f01 as having fixed this. Which leads me to more questions:

  • what does that have anything to do?
  • are there more things that might be miscompiled than this one libstd testcase?
  • what's the path forward here? Apparently, system llvm 5 and 6 can miscompile for whatever reason, and that's what some downstreams (Linux distros, BSDs, ...) would tend to use...

The LLVM patch needs to be backported officially to new 5.* and 6.* patch versions if it fixes miscopilations IMO.

I don't think a new dot release of 5 will happen, 6 is also unlikely.
However, I don't mind taking the patch to debian & ubuntu packages.

Having an upstream bug would help also.

Some more notes:

  • the upstream patch applies cleanly on llvm 6, and fixes the miscompilation. On llvm 5, it has only a few straightforward conflicts, but fails to compile. It looks like it depends on at least https://github.com/llvm-mirror/llvm/commit/f39b8ab08cec19712ac59572d7b4eb8cd758fd87, which has tons of conflicts.
  • beta is using llvm 6, but for some reason, it's not affected. Well, it may actually be affected, but it's not visible on this testcase. Who knows.
  • travis builds for pull requests are using system llvm 5. Those are essentially never going to be fixed if they stay on that version. What do we do about that?

Needless to say, #50882 is blocked until the latter is addressed in some way.

It's not really a matter of what travis provides, it's a matter of what minimal version of system llvm rustc supports, and as of #51899, it's llvm 5.

Interestingly, if the minimal supported version was still 3.9, this would have gone undetected.

Backported on the llvm packages https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/commit/1b9a00759063d68cb8b646df91807043643c33a7
Will update this patch when it is uploaded

It appears the miscompilation doesn't happen after #55426.

Was this page helpful?
0 / 5 - 0 ratings