Rust: Regression in rlua crate tests between rustc 1.23.0 and 1.24.0, only on windows.

Created on 16 Feb 2018  ·  64Comments  ·  Source: rust-lang/rust

[The following explanation of the problem contains a lot of assumptions about the problem's nature, I could be extremely wrong! Take this explanation with a grain of salt]

Wild guess, it has something to do with https://github.com/rust-lang/rust/pull/46833.

So, I'm not sure this is a rust bug exactly, but I wanted to make you aware of a crate regression that's causing me quite a lot of problems. Ultimately, I believe this comes from me doing some things that are not.. exactly definitely kosher, but that I don't have a convenient workaround for.

Ultimately, the problem stems from the Lua C API having error handling fundamentally based around setjmp / longjmp. In order to write a usable wrapper in Rust, without having to write inconvenient, slow C shims, Rust must, at times, call into Lua C API functions that in turn trigger a longjmp, and that longjmp must necessarily pass over Rust frames.

I realize how unsafe this is! However, there are some limitations here that I think make it at least somewhat reasonable to be able to do:

1) There are only Copy types on the rust stack frame being jumped over.
2) Rust is never calling setjmp, only triggering a longjmp through a Lua C API call (such as lua_error).
3) Making this unsafe to do means that it is impossible to wrap a C API such as Lua's without resorting to writing more C.

This last point is unfortunate, and causes me a lot of headaches, and at least one strange practical problem. We at Chucklefish build rlua for consoles, but for consoles the gcc crate does not work, and we have to build Lua out of tree (also to include Lua console specific patches). Needing to include C shims in rlua means that we would also have to copy paste those shims into the console projects, and this is a less than ideal situation.

The reason I think this issue has to do with https://github.com/rust-lang/rust/pull/46833 is that I can actually fix the regression if I simply add an #[unwind] attribute to each rust function that calls lua_error. However, this is only necessary for some reason on windows, and the attribute is currently unstable!

I think it's completely reasonable to need to mark functions as #[unwind] if they will trigger a longjmp, but I'm not exactly sure what to do in the meantime until unwind_attributes becomes stable. Maybe this behavior on windows IS simply a bug? Maybe it's a bug on linux / macos that this doesn't abort? I would especially not like to have to write C shims if ultimately I can get rid of them when unwind_attributes becomes stable.

(Side Note: Chucklefish actually uses the stable rust compiler in production now, so this is especially annoying. We were very excited about the 1.24 release, because it includes incremental compilation and rustfmt-preview while also being stable, and we noticed this bug in trying to switch to the stable compiler on windows.)

Edit: If people have to dig into the rlua source, I'm sorry, it is a tangled nest of "unsafe, unsafe everywhere". The best example to look at is the test_error test, and the method Lua::create_callback_function. Adding the #[unwind] attribute to: callback_call_impl, safe_pcall, and safe_xpcall fixes that test on windows, on unstable.

Edit 2: Edited for clarity, but also I need to point out that I need to do something, because this bug currently makes rlua just basically broken on windows. It's not "just" a test failure, error handling in general will just abort.

C-bug O-windows P-high T-compiler regression-from-stable-to-stable

Most helpful comment

Sorry for the breakage y'all are seeing here @kyren! We've long wanted to improve our ecosystem-wide regression testing to include Windows as well to catch issues like this sooner, but we still have yet to implement that :(

Your explanation of what's happening was quite helpful to me, and it also reminded me of what I believe is an existing bug in Rust! I'm unfortunately traveling right now and not at a Windows computer, but I can try to verify this information when I get back home in a week. In any case though I believe this is actually a fixable bug in Rust without actually adding any new attributes or stabilizing anything. I agree with @nikomatsakis that we'll probably want to have some form of dealing with exceptions at the FFI boundary eventually but the longer we can put that off the better! (in my opinion at least)

The tl;dr; is that I believe we need to implement exception filters in our landing pads. Rust landing pads should, in theory, only run for Rust exceptions. Not for all SEH exceptions.

I'm pretty certain this bug has come up before in the past but I was unfortunately unable to dig up the issue report. The crux of this issue is that code like this:

struct A;

impl Drop for A {
    fn drop(&mut self) {
        println!("hello!");
    }
}

fn main() {
    let _a = A;
    cause_a_segfault();
}

fn cause_a_segfault() {
    unsafe { *(0x1 as *mut u32) = 3; }
}

I think will actually print out "hello!". The bug may be more nuanced than this but I'm pretty certain something like this is a problem, I can try to create an actual reproduction of this sort of issue when I get back home to a Windows computer to try out some various pieces.

Basically what this means is that our "landing pads" are executing for any and all SEH exceptions. This seems like it includes setjmp/longjmp, which is a bug! (a very longstanding bug!) This is part of the whole "we don't want you unwinding into Rust" in that making this work is tricky and whatnot, but it turns out that this is our own problem where there's a whole bunch of ways to "unwind into Rust" on MSVC using SEH that we must deal with.

For those familiar with LLVM I believe the problem looks like this:

fn foo2() {}

#[no_mangle]
pub extern fn bar() {
    foo2();
}

On MSVC this code currently generates IR that looks like:

define void @bar() unnamed_addr #1 personality i32 (...)* @__CxxFrameHandler3 {
start:
; invoke foo::foo2
  invoke void @_ZN3foo4foo217h18e6b2931a4bedc6E()
          to label %bb2 unwind label %funclet_bb1

bb1:                                              ; preds = %funclet_bb1
  call void @llvm.trap()
  unreachable

bb2:                                              ; preds = %start
  ret void

funclet_bb1:                                      ; preds = %start
  %cleanuppad = cleanuppad within none []
  br label %bb1
}

The problem (as I understand it), is that if foo2 happens to call a lua function which unwinds, Rust causes the process to explode via llvm.trap. The fix, I believe, is to fix this line:

  %cleanuppad = cleanuppad within none []

(or something about that line I believe)

I remember that when implementing unwinding on MSVC I never did figure out how to place a filter here on "only run this for Rust exceptions". Other platforms like MinGW which use a custom personality function have filtering enabled which (IIRC) only runs cleanups for Rust-originating exceptions (what we want). On MSVC I never got a custom personality working (LLVM only recognizes a few I think?) so sticking to one of the standard personality functions meant we got something working quickly at least!

Now all is not lost I think! I believe what we may want to do is alter the default codegen for landing pads. We already have implemented a method of "only catch Rust exceptions" as that's why catch_unwind doesn't catch anything but rather just Rust exceptions (or at least I hope I'm remembering correctly that's it's behavior). I'd naively expect that we could mirror that sort of logic for all our landing pads to only run them on Rust exceptions, not on exceptions like longjmp.


Phew! That's quite a lot of information. To summarize, though:

  • To get the benefits of 1.24 I think it's ok @kyren for y'all to use RUSTC_BOOTSTRAP=1 (but don't tell others)
  • I believe this is an existing bug in the Rust compiler which we "should have always fixed" and now have a very good reason to fix. I also believe that the bug shouldn't be too hard to fix as we've got the scaffolding in rustc already, we just need to figure out how to apply it to the other locations.
  • I believe we must ship a fix for 1.25. If the above fix doesn't land in time we should revert #46833, and otherwise I think we can backport the fix above if it works.

All 64 comments

One possibility as to why that it only happens on windows is that the implementation of longjmp on that platform uses the same exception framework that Rust uses for panics.

One possibility as to why that it only happens on windows is that the implementation of longjmp on that platform uses the same exception framework that Rust uses for panics.

That is interesting! That WOULD explain the behavior I'm seeing.

I'm not exactly sure what to do in the meantime until unwind_attributes becomes stable.

It's possible to temporarily unlock unstable features on stable channel with env var (RUSTC_BOOTSTRAP=1), seems acceptable if you have a critical situation like this.

It's possible to temporarily unlock unstable features on stable channel with env var (RUSTC_BOOTSTRAP=1), seems acceptable if you have a critical situation like this.

That's incredibly helpful, thank you!

Here is a summary of the current situation as I understand it: https://github.com/chucklefish/rlua/issues/71#issuecomment-366462764

I may have gotten some details wrong, if I have feel free to yell at me point that out.

I guess when they say it's a "C API" they are not kidding. :p

Would it make sense to make an standalone wrapper, called "Universal API for Lua, with C calling conventions"?

I guess when they say it's a "C API" they are not kidding. :p

Lua's API is not my favorite :(

Hmm. I think there's a good case to be made for disabling https://github.com/rust-lang/rust/pull/46833 -- or at least making it opt-in -- until there are stable attributes to opt-out. @wycats points out that this likely to affect Helix as well. I would definitely prefer not to have public packages relying on RUSTC_BOOTSTRAP=1 -- it's a recipe for undermining the whole Rust stability system.

Re-reading the comments on that PR, I see that I wrote this:

Note though that this is a change in behavior -- albeit only quasi-defined behavior -- and it feels like it ought to go through the RFC process. Still, it'd be good to have a working implementation so that we can do a crater run and assess possible impact.

But this got overlooked.

I would definitely prefer not to have public packages relying on RUSTC_BOOTSTRAP=1 -- it's a recipe for undermining the whole Rust stability system.

I knew that was wrong when I did it, I'll forget this hack exists :P

Note though that this is a change in behavior -- albeit only quasi-defined behavior -- and it feels like it ought to go through the RFC process. Still, it'd be good to have a working implementation so that we can do a crater run and assess possible impact.

But this got overlooked.

Something I wanted to point out, is that even if a crater run was done, it wouldn't have caught my issue at least because as I understand it they're only done on linux?

If there was some kind of donation box for running crater on windows I would contribute to it!

@kyren I believe crater actually tests windows these days, but I could be wrong.

Also, cross-posting from the chucklefish/rlua repo:

I'm trying to understand this a bit better -- it seems like the problem is pretty specific to SEH. That is, ordinarily, longjmp wouldn't trigger this panic, precisely because it doesn't unwind in a structured fashion, but rather just directly pops off the frames. In other words, we can't really "catch" a longjmp normally. But I guess that in windows longjmp is "exception-handling aware". That means, if I understand correctly, that ironically SEH is probably the one environment where a longjmp would actually be safe in Rust, since we would run dtors like normal. But precisely because we can catch it, it's also the one environment where longjmp will panic. Maybe I'm missing some subtlety though. @alexcrichton may remember more about the details of SEH.

Can anyone verify if this is correct?

I think in my ideal world, we would not require any attributes -- rather, we would have a way to skip the panic in the case where Rust is using the native unwind facility (but you would opt-in to that, probably, similar to #[repr(C)]). I suppose that is what the #[unwind] attribute basically says, but it seems a bit wrong. (That is, it says "this function uses Rust's unwind abilities", but I think I'd rather that you declare how Rust should implement unwinding...? Have to think about it.)

I talked a bit more with @wycats about the Helix situation, which is somewhat different, though very much related:

Ruby also sometimes longjmps. @wycats wanted some form of attribute that could be used to label functions that longjmp so that the compiler could verify that there is nothing on the stack with a dtor. This seems like it would also help @kyren in verifying this claim that they made:

There only Copy types on the rust stack frame being jumped over.

At this point, it seems pretty clear to me that:

  • There is a need to address exceptions-at-FFI-boundaries and longjmp specifically more holistically. It's sad that it exists, but it does, and plenty of C APIs use it, so we should have a real story here.

    • It would be better if the "proof burden" around longjmp doesn't fall entirely on the user, imo.

    • It actually isn't obvious that those two things would be entangled, but they are, thanks to SEH.

  • https://github.com/rust-lang/rust/pull/46833 is a noble effort, and probably part of the solution we ultimately want, but too simplistic.
  • We probably want some kind of champion to work through a design that covers the various use cases and push it through the RFC process. I would think that we want the following:

    • help people deal with longjmp in those environments where it does not run dtors

    • allow interop with "native" unwind mechanisms where possible (e.g., in C++)

    • this will always be a portability question as well

In the short term, I think we should amend #46833 so that the "panic wrapper" uses an (unstable) opt-in -- e.g., #[unwind(never)] or something -- so that we unbreak people.

Is it possible to adjust the implementation to only catch Rust-generated unwinding? I seem to remember that we did something like this in the past to not conflict with C++ exceptions but I may be misremembering.

I believe crater actually tests windows these days, but I could be wrong.

I'm sorry to say that it doesn't :( off the top of my head it's going to require a reasonable amount of work as well.

Sorry for the breakage y'all are seeing here @kyren! We've long wanted to improve our ecosystem-wide regression testing to include Windows as well to catch issues like this sooner, but we still have yet to implement that :(

Your explanation of what's happening was quite helpful to me, and it also reminded me of what I believe is an existing bug in Rust! I'm unfortunately traveling right now and not at a Windows computer, but I can try to verify this information when I get back home in a week. In any case though I believe this is actually a fixable bug in Rust without actually adding any new attributes or stabilizing anything. I agree with @nikomatsakis that we'll probably want to have some form of dealing with exceptions at the FFI boundary eventually but the longer we can put that off the better! (in my opinion at least)

The tl;dr; is that I believe we need to implement exception filters in our landing pads. Rust landing pads should, in theory, only run for Rust exceptions. Not for all SEH exceptions.

I'm pretty certain this bug has come up before in the past but I was unfortunately unable to dig up the issue report. The crux of this issue is that code like this:

struct A;

impl Drop for A {
    fn drop(&mut self) {
        println!("hello!");
    }
}

fn main() {
    let _a = A;
    cause_a_segfault();
}

fn cause_a_segfault() {
    unsafe { *(0x1 as *mut u32) = 3; }
}

I think will actually print out "hello!". The bug may be more nuanced than this but I'm pretty certain something like this is a problem, I can try to create an actual reproduction of this sort of issue when I get back home to a Windows computer to try out some various pieces.

Basically what this means is that our "landing pads" are executing for any and all SEH exceptions. This seems like it includes setjmp/longjmp, which is a bug! (a very longstanding bug!) This is part of the whole "we don't want you unwinding into Rust" in that making this work is tricky and whatnot, but it turns out that this is our own problem where there's a whole bunch of ways to "unwind into Rust" on MSVC using SEH that we must deal with.

For those familiar with LLVM I believe the problem looks like this:

fn foo2() {}

#[no_mangle]
pub extern fn bar() {
    foo2();
}

On MSVC this code currently generates IR that looks like:

define void @bar() unnamed_addr #1 personality i32 (...)* @__CxxFrameHandler3 {
start:
; invoke foo::foo2
  invoke void @_ZN3foo4foo217h18e6b2931a4bedc6E()
          to label %bb2 unwind label %funclet_bb1

bb1:                                              ; preds = %funclet_bb1
  call void @llvm.trap()
  unreachable

bb2:                                              ; preds = %start
  ret void

funclet_bb1:                                      ; preds = %start
  %cleanuppad = cleanuppad within none []
  br label %bb1
}

The problem (as I understand it), is that if foo2 happens to call a lua function which unwinds, Rust causes the process to explode via llvm.trap. The fix, I believe, is to fix this line:

  %cleanuppad = cleanuppad within none []

(or something about that line I believe)

I remember that when implementing unwinding on MSVC I never did figure out how to place a filter here on "only run this for Rust exceptions". Other platforms like MinGW which use a custom personality function have filtering enabled which (IIRC) only runs cleanups for Rust-originating exceptions (what we want). On MSVC I never got a custom personality working (LLVM only recognizes a few I think?) so sticking to one of the standard personality functions meant we got something working quickly at least!

Now all is not lost I think! I believe what we may want to do is alter the default codegen for landing pads. We already have implemented a method of "only catch Rust exceptions" as that's why catch_unwind doesn't catch anything but rather just Rust exceptions (or at least I hope I'm remembering correctly that's it's behavior). I'd naively expect that we could mirror that sort of logic for all our landing pads to only run them on Rust exceptions, not on exceptions like longjmp.


Phew! That's quite a lot of information. To summarize, though:

  • To get the benefits of 1.24 I think it's ok @kyren for y'all to use RUSTC_BOOTSTRAP=1 (but don't tell others)
  • I believe this is an existing bug in the Rust compiler which we "should have always fixed" and now have a very good reason to fix. I also believe that the bug shouldn't be too hard to fix as we've got the scaffolding in rustc already, we just need to figure out how to apply it to the other locations.
  • I believe we must ship a fix for 1.25. If the above fix doesn't land in time we should revert #46833, and otherwise I think we can backport the fix above if it works.

Such a wonderful, detailed explanation, thank you @alexcrichton!

To get the benefits of 1.24 I think it's ok @kyren for y'all to use RUSTC_BOOTSTRAP=1 (but don't tell others)

I'm totally fine with this hacky fix until 1.25! If you're okay with it, I'd like to do a release of my crate with this hack for the time being, and then when 1.25 hits remove it and just require rustc >= 1.25. However, let's be realistic, I'd obviously prioritize the entire rust ecosystem over my one little crate, so if you think that's a problem I don't have to do that. This is not such a problem for Chucklefish, since we have sort of a chucklefish_hax branch anyway where this hack could live, so I'm fine if it's just too gross to release publicly with that hack.

I'm actually relieved really, because the Lua C API had always been in this "legal gray area" w.r.t. Rust. @nikomatsakis suggested changes regarding having the compiler help about Drop types being on the stack, and inter-operating with other unwinding mechanism obviously sound really great and helpful, but even just knowing that calling into the Lua C API should be supported at all is plenty for right now :D

For the record, I think having a public API that relies on something like longjmp or C++ exceptions is really, exceptionally gross (pun intended), but I'm glad that Rust is attempting to deal with the unfortunate reality of it. I'm completely fine with it though if the finer details of it take a while to work out.

@kyren
Not sure if it will be of any help in this situation, but you can actually longjmp on MSVC without performing unwinding by using two-argument int __intrinsic_setjmp(jmp_buf buf, void* frame_address) with NULL as a second argument instead of standard setjmp.
I've seen it used for jumping from jitted code living on a separate stack (you can't unwind between stacks) - people basically used #define setjmp(buf) __intrinsic_setjmp((buf), NULL) on top of their code.

@nikomatsakis said:

Ruby also sometimes longjmps. @wycats wanted some form of attribute that could be used to label functions that longjmp so that the compiler could verify that there is nothing on the stack with a dtor.

For what it's worth, I think this space deserves a more full-fledged RFC. As @nikomatsakis pointed out, Helix has similar but not identical concerns, and it would be great to get a conversation about longjmps and "unwinding past the FFI boundary" into one place so we can work out the requirements and solve them holistically.

@alexcrichton do you object to reverting this change while we sort it out?

My take is that we should revert now but possibly bring the feature back with @alexcrichton's filtering suggestion -- although it's not entirely obvious to me that filtering is what end users will want, either. (But it's probably a decent default.)

@kyren do you think (by any chance) it would be possible to make a standalone regression test?

This is my interim plan. I have PR to revert on stable (https://github.com/rust-lang/rust/pull/48378) and will prepare a similar PR for beta. Then I have a more expansive diff that reverts but allows one to opt back in to the wrapper by doing #[unwind(panics)]. That is not meant to be a permanent design, but a "holding pattern" while we figure out best plan.

This PR adds the explicit #[unwind(...)] attributes for master: https://github.com/rust-lang/rust/pull/48380

Currently on pc-windows-msvc Rust uses the C++ personality function which in my opinion is actually better than filtering to only Rust exceptions. Until that PR which aborted on FFI boundaries, users could have a C++ exception unwind across Rust code and Rust destructors would fire like normally, and vice versa. I really hope we can keep and eventually commit to having this interop available in a stable form.

Note that the C++ personality function does not trigger landing pads on all SEH exceptions. Things like access violations and division by zero and illegal instructions trigger a separate class of SEH exceptions which have a different personality function.

@alexcrichton LLVM implementation of SEH stuff does not allow for specifying “filter expressions” – code that would allow to filter out non-Rust “exceptions”. That being said, such filtering in landingpad scheme cannot be expressed either and is instead done within the personality function.

The only option I see to properly fix this is feeding LLVM our own personality function somehow. Cursory look at the LLVM codebase seems to indicate that we’d need to classify our own personality properly here, after adding one more here. Otherwise, a custom personality will abort LLVM.

@retep998

I really hope we can keep and eventually commit to having this interop available in a stable form.

I agree, this seems like a use case that is worth supporting.

Also, please please please, when we do figure out a better way to handle exception interop with other languages, can we finally add a way to handle system SEH exceptions like access violations and so on?

@kyren do you think (by any chance) it would be possible to make a standalone regression test?

Okay, I spent an embarrassingly long time trying to come up with a standalone reproduction and failing miserably. After much trial and error, I think I figured it out:

Version 1, does not abort:

#[link(name = "a", kind = "static")]
extern "C" {
    fn do_catch(c: unsafe extern "C" fn());
    fn do_throw() -> !;
}

fn main() {
    unsafe extern "C" fn callback() {
        do_throw()
    }

    unsafe { do_catch(callback); }
}

Version 2, aborts:

#[link(name = "a", kind = "static")]
extern "C" {
    fn do_catch(c: unsafe extern "C" fn());
    fn do_throw() -> !;
}

fn main() {
    unsafe fn inner() {
        do_throw()
    }

    unsafe extern "C" fn callback() {
        inner()
    }

    unsafe { do_catch(callback); }
}

Version 3, does not abort:

#![feature(unwind_attributes)]

#[link(name = "a", kind = "static")]
extern "C" {
    fn do_catch(c: unsafe extern "C" fn());
    fn do_throw() -> !;
}

fn main() {
    unsafe fn inner() {
        do_throw()
    }

    #[unwind]
    unsafe extern "C" fn callback() {
        inner()
    }

    unsafe { do_catch(callback); }
}

Version 4, does not abort:

#[link(name = "a", kind = "static")]
extern "C" {
    fn do_catch(c: unsafe extern "C" fn());
    fn do_throw() -> !;
}

fn main() {
    unsafe extern "C" fn inner() {
        do_throw()
    }

    unsafe extern "C" fn callback() {
        inner()
    }

    unsafe { do_catch(callback); }
}

I don't have any good theories as to why this is. If I substitute the longjmp-ing do_throw with a call to panic, suddenly Version 1, 2, and 4 will all abort.

You can see the WIP regression test here.

Edit: The fact that extern "C" functions do not seem to exhibit the bad longjmp catching behavior and the fact that this behavior differs for regular panics is really making me doubt my understanding of the issue.

Note: The C++ standard says (n3690 18.10/4): The function signature longjmp(jmp_buf jbuf, int val) has more restricted behavior in this International Standard. A setjmp/longjmp call pair has undefined behavior if replacing the setjmp and longjmp by catch and throw would invoke any non-trivial destructors for any automatic objects.

MSVC may or may not choose to implement setjmp/longjmp as having the same behavior of catch/throw. This means that using the C++ personality function for Rust introduces undefined behavior into Rust in a way that is beyong Rust's control. It should be possible to implement SEH for Rust using entirely Rust specific exceptions and handlers, though, that's more work.

I started an FCP request to merge the various "revert PRs" that I have pending.

triage: P-high

Folks,
I'm quite sad to see this improvement being backed out, without people even trying to keep me in the loop and asking for my opinion about it.

FWIW, the proper fix is either to make #[unwind] stable, or to make a new ABI (e g "C-with-unwind") which does the same. Or possibly to mark all extern C fn as unwindable, although I'm not sure if that will have any implications that I can't foresee right now.

In any case, the fix is not to regress back to the UB which just happened to work for rlua by accident.

Our plan right now is to revert on beta/stable since it's clear that the abort behavior is going to break crates in practice, and then implement one of your suggestions, or possibly @alexcrichton's suggestion -- which sounds plausible if there are no implementation difficulties.

As @nikomatsakis said above the PR that you wrote was great, but seems like not quite the behavior we want -- and as it is today, we're not ready to stabilize unwind or any other attribute immediately. While rlua is depending on UB, it does work in practice today -- and we should avoid breaking crates without any good workarounds.

Sorry for not bringing you into the loop earlier!

@diwic

without people even trying to keep me in the loop and asking for my opinion about it.

argh, I thought I cc'd you! I sincerely apologize for this. Was acting a bit "under the gun".

I want to emphasize what I wrote earlier: This feature is awesome, we just have to move a bit slower and make sure we enable it in the right way, so as to ensure existing users have ways to keep their code working on stable (and are given time to migrate).

I'm quite sad to see this improvement being backed out

Note also that I very intentionally did not back out the code, just made it opt-in for the time being. =)

@nikomatsakis @Mark-Simulacrum

Apologies accepted. Please keep me in the loop for further discussions about this matter.

What you're not considering are the people who now start relying on the new improved behaviour, where they, if happy with the abort behaviour, now remove their catch_unwind for improved performance, code size, and code readability. With reverting, you're now about to cause UB for these people, which is a quite severe regression if you're asking me.

This feature is awesome, we just have to move a bit slower and make sure we enable it in the right way, so as to ensure existing users have ways to keep their code working on stable (and are given time to migrate).

Hmm, but I don't think there is a way to give people time to migrate other than the normal beta cycle, which give people 6 weeks to try their crates against beta. We also did a crater run against beta, and a few crates were changed accordingly to avoid relying on UB.

Note also that I very intentionally did not back out the code, just made it opt-in for the time being. =)

Noted. I also noted that "opt-in" means rebuilding the compiler...

we're not ready to stabilize unwind or any other attribute immediately. While rlua is depending on UB, it does work in practice today -- and we should avoid breaking crates without any good workarounds.

So, if you insist on this being more important than avoiding UB, can we at least do some damage control? I e, if this only is a problem for the MSVC target, why should all other targets have to suffer?
Also consider that the change to mark all extern C fn as unwindable (i e, no nounwind attribute in the LLVM codegen for these functions) does not require stabilizing any attribute.

What you're not considering are the people who now start relying on the new improved behaviour, where they, if happy with the abort behavior, now remove their catch_unwind for improved performance, code size, and code readability. With reverting, you're now about to cause UB for these people, which is a quite severe regression if you're asking me.

I'm hoping that we can release 1.24.1 early enough that they'll be able to revert relevant commits, if they exist, fairly quickly. I do agree that ideally we'd avoid regressing like this, but I don't know of a great solution for that. If we can come up with something that will permit rlua to work and not reintroduce UB, then I'm happy to hear it -- as I understand it, no such (simple) solution exists that avoids stabilizing new functionality.

Noted. I also noted that "opt-in" means rebuilding the compiler...

I agree this is unfortunate, but I don't think we have a great way of avoiding this today.

I e, if this only is a problem for the MSVC target, why should all other targets have to suffer?

We considered making the patches MSVC-only but I'm concerned about diverging on the behavior here between platforms since it'll be somewhat harder for users to write code that works everywhere that way since catch_unwind will still be necessary but only on MSVC, which is kind of not nice.

Also consider that the change to mark all extern C fn as unwindable (i e, no nounwind attribute in the LLVM codegen for these functions) does not require stabilizing any attribute.

I'm not clear what you mean here. I probably don't quite know enough about LLVM to know what effect this'll have; I was sort of under the impression that this is similar to what our current set of patches does.

I'm hoping that we can release 1.24.1 early enough that they'll be able to revert relevant commits, if they exist, fairly quickly. I do agree that ideally we'd avoid regressing like this, but I don't know of a great solution for that. If we can come up with something that will permit rlua to work and not reintroduce UB, then I'm happy to hear it -- as I understand it, no such (simple) solution exists that avoids stabilizing new functionality.

I really regret not discovering this issue before the 1.24 release so this could be discussed properly beforehand. I'm seriously kicking myself, because I actually had internal bug reports that directly pointed at this issue, but I just didn't understand them until it was too late.

I know I said this before, but just to reiterate -- if this seems like something that can be solved properly in the next Rust stable, I'm okay with relying on sweet hacks for a few weeks and just waiting it out. I'm obviously going to prioritize the rust ecosystem at large over one crate, or maybe a couple of crates that have to deal with ridiculous C APIs. I don't know if other people affected by this feel the same way, however.

If a potential solution is further away than that though, I'm not exactly sure what I will do. Full honesty, I probably will not have the time or energy to modify rlua to the extent required to fix it for.. a while. Probably, I would just make rlua require nightly for the time being and wait for either a solution to stabilize, or for me to develop the energy and motivation to work around it. If you're interested, you can read me rant and complain about how hard it is in the rlua issue page, but it's ultimately not impossible.

if it's UB, should you be allowed to rely on EITHER behaviour?

Why not make it random which functions get an abort and which functions don't, and have it change every time you recompile?

Okay. So having read @alexcrichton 's comment a few times I think that means that for 1.25, the behaviour for MSVC is that foreign exceptions (including longjmp) will pass right through our landingpads and not be affected by the new abort landing pad. To my understanding, this would allow rlua to keep its current design, with the same restrictions (only Copy types on the stack etc), while keeping the new abort behaviour for Rust exceptions/panics. Is this correct? If so, this seems to be a reasonable way forward.

As for 1.24.1, we can choose to disable the patch and cause UB for people relying on the new behaviour, or keep things working for people who relied on UB to actually do something specific. The release team seems to have made its decision about this already, so I guess we just have to agree to disagree on that part. :-/

@Mark-Simulacrum

Noted. I also noted that "opt-in" means rebuilding the compiler...

I agree this is unfortunate, but I don't think we have a great way of avoiding this today.

Unless we want to implement a --make-unwinding-into-FFI-UB-again switch, but that's perhaps overkill...

We considered making the patches MSVC-only but I'm concerned about diverging on the behavior here between platforms since it'll be somewhat harder for users to write code that works everywhere that way since catch_unwind will still be necessary but only on MSVC, which is kind of not nice.

Indeed. This was meant as some kind of band-aid solution for 1.24.1, not something permanent.

Also consider that the change to mark all extern C fn as unwindable (i e, no nounwind attribute in the LLVM codegen for these functions) does not require stabilizing any attribute.
I'm not clear what you mean here. I probably don't quite know enough about LLVM to know what effect this'll have; I was sort of under the impression that this is similar to what our current set of patches does.

We mark all extern C fn as nounwind in our LLVM IR, which means that unwinding through them is UB. Exactly why we do this? I'm not sure. But if the abort-on-panic behaviour is reverted, we could consider removing this mark as well. From a practical standpoint, I'm not sure how much difference this will actually make on the generated code, as the UB that rlua relies on seems to have worked in practice.

From what I understand the new behaviour is still UB? So it shouldn't be relied on.

@SoniEx2

From what I understand the new behaviour is still UB? So it shouldn't be relied on.

The new behaviour is to abort when unwinding from Rust to FFI. Aborting is not UB.

I mean to say "unwinding from Rust to FFI and vice-versa is UB", so the compiler is allowed to abort on it, and even change the behaviour between patch releases.

(Arguably it's even allowed to change the behaviour between compiler executions.)

So having read @alexcrichton 's comment a few times I think that means that for 1.25, the behaviour for MSVC is that foreign exceptions (including longjmp) will pass right through our landingpads and not be affected by the new abort landing pad.

This is right. Probably not all the foreign exceptions for the first pass, but exception types used specifically to implement longjmp as well as system exceptions (access violations, etc).


It is true that, considering current compiler implementation, rlua is invoking undefined behaviour, however it has already been established that this is a compiler bug that makes this undefined, rather than usage of longjmp. Unwinding across a FFI boundary as a result of the usual language exceptions is still and will continue to be undefined (until we decide on providing some compatibility).

@nagisa

It is true that, considering current compiler implementation, rlua is invoking undefined behaviour, however it has already been established that this is a compiler bug that makes this undefined, rather than usage of longjmp.

I disagree. What has been stated is that, at least pre 1.24, unwinding from and to FFI is UB. If some C library - let's call it MSVC - decides to implement longjmp as unwinding, then it follows that longjmp-ing across Rust is UB, too.

Continuing my thoguhts from earlier here's some more investigation. I wanted to see what C++ did and how it approached longjmp/exceptions and whatnot. To that end I found out two primary things:

  • It looks like when using noexcept then cl.exe does not abort and programs using longjmp don't trigger noexcept checking. When using Clang, however, programs using longjmp trigger the noexcept logic, causing the program to abort. I'm not sure if this is a bug.

  • It looks like when a C++ object has a destructor both cl.exe and Clang will run the destructor as longjmp runs past the frame.

This may mean that my previous thinking of "let's just add a filter" may not be a great idea to apply universally. It looks like neither cl.exe nor Clang add filters to cleanups related to drop. It may or may not be a bug in clang that noexcept triggers on longjmp (will wait to hear back from them). Note that noexcept is particularly interesting here (I think) as we've basically implemented the exact same feature for Rust and it's what's causing this regression (only for us we'd like it opt-out rather than C++'s opt-in).

If Clang agrees that its behavior on noexcept is a bug then we can mirror the solution implemented in Clang and I believe that'll fix this issue. Otherwise we can probably follow the footsteps of cl.exe and implement a filter only for our "noexcept-like-behavior" on extern functions.

After some discussion on IRC it was concluded that Clang's behavior here is not a bug so we're on our own here for a solution.

I'm currently investigating adding custom personality functions for exception handling on MSVC so we can just skip all non-Rust exceptions at the source. Failing that I think we'll want to alter the method of abort-on-panics-in-extern to look like the "catch" part of the try intrinsic rather than using a standard cleanup pad. This latter solution means we'd run Rust destructors on longjmp which I'd personally prefer not to do, but it's better than ceasing working! In any case, I'll see how the custom personalities go.

I've opened https://github.com/rust-lang/rust/pull/48567 which should fix the regression here. I would be pretty uncomfortable backporting that to beta/stable, but if we revert the "abort on panic through extern function" change on beta/stable then I think we should be able to entirely fix the regression and still get all the benefits of the original change!

@kyren the revert for this is now currently on the beta channel, if you get a chance can you test to make sure it works?

@kyren the revert for this is now currently on the beta channel, if you get a chance can you test to make sure it works?

As of rustc 1.25.0-beta.4 (62ad16c7d 2018-02-26), rlua tests pass on windows! The revert seems to have 100% fixed the problem.

I've opened #48567 which should fix the regression here. I would be pretty uncomfortable backporting that to beta/stable, but if we revert the "abort on panic through extern function" change on beta/stable then I think we should be able to entirely fix the regression and still get all the benefits of the original change!

I've been following the "saga" haha, and I know you're now working on #48572 now instead, but I'm hopeful that you can get that change into beta / stable since that seems to be the best of all worlds.

This bug had a lot more tendrils than I expected, thank you guys so much for all the work you've been doing on this!

As for 1.24.1, we can choose to disable the patch and cause UB for people relying on the new behaviour, or keep things working for people who relied on UB to actually do something specific.

FWIW, this requires reverting a bunch of changes to gtk-rs and related crates which are now running into UB with 1.24.1 due to this being reverted.

@sdroege

this requires reverting a bunch of changes to gtk-rs and related crates

Say more? Did they have their own panic guards which were removed?

this requires reverting a bunch of changes to gtk-rs and related crates

Say more? Did they have their own panic guards which were removed?

@nikomatsakis Yes, exactly that :) But fortunately not in any released version yet (unless there was a release of some crate outside of the gtk-rs organization that I'm not aware of). It was planned to release this really soon now though...

I don't really know the state of it or whether it would be appropriate for a stable point release, but it would be nice if #48572 could be merged instead of a straight rollback, to avoid missing out on the feature for another rust stable version.

I do feel a bit guilty for triggering the reverted behavior, not having to worry about panic unwinds at extern function boundaries is quite nice, even if I'm not currently relying on it.

We plan to discuss this in the core team meeting tonight. Given #48572, we may forego the revert. @kyren, is it ok w/ you to wait until the next beta to have a fix? (I'm not sure how I feel about backporting #48572 to beta, but I guess it's ok -- backporting to stable seems too risky to me, personally.)

@kyren Did you try @petrochenkov 's suggestion to change longjmp to __instrinsic_longjmp ? If so, did it work?

@kyren
Not sure if it will be of any help in this situation, but you can actually longjmp on MSVC without performing unwinding by using two-argument int __intrinsic_setjmp(jmp_buf buf, void* frame_address) with NULL as a second argument instead of standard setjmp.
I've seen it used for jumping from jitted code living on a separate stack (you can't unwind between stacks) - people basically used #define setjmp(buf) __intrinsic_setjmp((buf), NULL) on top of their code.

So I finally got around to trying this and I can confirm that this DOES work, thank you very much for the suggestion!

Since I basically have a fix for this, I'm not in a huge rush to have this change in the rust stable compiler, but this is only speaking for me. This issue probably causes problems for other people that have to deal with at least any of Lua or Ruby or maybe libpng, depending on how strict they are about handling errors properly or how deep into the APIs they go.

I would LIKE it if the fix was in the next stable version of Rust, but even then it's not critical because like I said, I have a fix. The main thing I really needed from all of this was an assurance that longjmp APIs are at least supposed to be compatible with Rust in general, which just practically saves me a lot of heartache. If #48572 lands in nightly and the regression test for longjmp stays long term, but I have to wait two rust stable cycles to remove the now small rlua hack, that's certainly not the end of the world. The only way this would be a problem were if the fix for longjmping APIs ended up requiring an unstable feature that was a very long way off, or the hacky fix I have broke for some other reason in the interim, or if #48572 or similar was not merged, or if there was another decision that similarly changed direction so that I can't reliably call into a longjmp based API. If the fix did make it into beta though that would certainly feel a lot more "certain" to me and allow me to relax about it a lot more :D

@kyren Did you try @petrochenkov 's suggestion to change longjmp to __instrinsic_longjmp ? If so, did it work?

I was typing this reply as I saw your comment :D

triage: leaving assigned to niko but it might be good for him to delegate to alex or someone else if there's someone else taking point on this now.

Given #48572, we may forego the revert.

1.24.1 was released with this reverted now.

So, we talked about this in the core team meeting, and -- as you can see -- decided to revert after all. It was a bit of a tough call, but we felt like when in doubt, we ought to bias towards "back off and try again more carefully". The plan is to land @alexcrichton's change in #48572, which should enable us to re-enable the abort-guards by default. We are also keeping the explicit #[unwind] attributes I added in https://github.com/rust-lang/rust/pull/48380 (#[unwind(aborts)] and #[unwind(allowed)]), though those remain unstable.

(I still think we ought to address the interop question in a more thorough way, however.)

@sdroege I do apologize for whatever churn this caused gtk-rs and anyone else!

@nikomatsakis No problem, "git revert" is cheap and there was no release yet. All options were bad, let's just hope it sticks the next time :)

Since the revert made it into 1.24.1, this is now fixed! Thanks for the great work, closing!

Was this page helpful?
0 / 5 - 0 ratings