Julia: Wrong stacktrace shown when rethrowing an exception after catching another one

Created on 11 Jan 2017 · 25Comments · Source: JuliaLang/julia

function foo()
    try
        error("error in foo")
    catch e
    end
end

function bar()
    try
        error("error in bar")
    catch e
        foo()
        rethrow(e)
    end
end

bar()

produces the misleading

ERROR: error in bar
Stacktrace:
 [1] foo() at ./REPL[1]:3
 [2] bar() at ./REPL[2]:5

This just hit me in a really wicked situation: The stacktrace actually lead me to a line erroneously throwing the reported exception. I fixed it (to throw the correct exception), but the same error kept showing, leaving me completely puzzled for quite some time...

error handling

Source

martinholters

Most helpful comment

all that's required

Hah, I should have known better :-) . I started prototyping a solution, but restoring the exception can't happen inside jl_pop_handler / jl_eh_restore_state because that occurs before the code in the catch block (duh!). There's also some careful gc rooting required.

I now think we should generalize the exception_in_transit to be a stack of (exception,backtrace) states in transit. If you throw Exception2 while handling Exception1, you'd like to know about Exception1 as the root cause while further handling Exception2. This would mean jl_throw needs to push the new exception onto the stack. The lowering of catch blocks would need a final step added to pop the exception stack back to the state it was when the try was entered. I think I can see a trick to give JL_CATCH the same behavior.

To address the task switching problems, the current exception stack would need to be saved and restored during context switching which means more copies of bt_data floating around.

If we did all this, I believe catch_backtrace() can be made reliable and there would be no measurable performance penalty outside a catch block.

c42f on 15 Jan 2018

❤1 👍1

All 25 comments

We should somehow clarify that if you care about the backtrace of an exception, you need to call catch_backtrace immediately.

JeffBezanson on 13 Apr 2017

I've been trying to figure out why the backtraces are mostly corrupted when my Retry.jl package is used. I have always assumed that my naive macro hacking was messing up the line number metadata in the AST somehow. I spent a long time chasing that down, but I now think the root of the problem is this issue.

It seems there are three problems: the backtrace can be empty; the rethrown backtrace can be wrong; or the rethrown error and backtrace can be wrong.

The minimal cases I've boiled it down to follow:

`test1()` -- works

_Julia Version 0.7.0-DEV.2098 (2017-10-10 11:37 UTC)_

function f()
    throw("Error in f()!")
end

function test1()
    try
        f()
    catch e
        if e == "foo"
            println("Got foo!")
        end
        rethrow(e)
    end
end

ERROR: LoadError: "Error in f()!"
Stacktrace:
 [1] f() at /Users/sam/julia/v0.5/Retry/src/extest.jl:2
 [2] test1() at /Users/sam/julia/v0.5/Retry/src/extest.jl:8
...
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50



md5-030ecba154a2af0ddd85287fa1f06262

catch!
ERROR: LoadError: "Error in f()!"
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50

### `test3()` -- Wrong stacktrace

```julia
function isfoo(x)
    try
        x.code == "foo"
    catch e
        false
    end
end


function test3()
    try
        f()
    catch e
        if isfoo(e)
            println("Got foo!")
        end
        rethrow(e)
    end
end

ERROR: LoadError: "Error in f()!"
Stacktrace:
 [1] isfoo(::String) at /Users/sam/julia/v0.5/Retry/src/extest.jl:29
 [2] test3() at /Users/sam/julia/v0.5/Retry/src/extest.jl:43
...
in expression starting at /Users/sam/julia/v0.5/Retry/src/extest.jl:50



md5-eaba1d49975ac58dc4697fe857c3c82d

ERROR: LoadError: type String has no field code
Stacktrace:
[1] isfoo(::String) at /Users/sam/julia/v0.5/Retry/src/extest.jl:29
[2] test3() at /Users/sam/julia/v0.5/Retry/src/extest.jl:43
```

samoconnor on 15 Nov 2017

@JeffBezanson I'm not sure how to apply your advice that "you need to call catch_backtrace immediately".

I see that the runtime calls catch_backtrace() in _start().

It seems like maybe I need to do tmp = catch_backtrace() at the top of catch and set_global_catch_backtrace(tmp) before rethrow() to undo any backtrace corruption that happens in the catch block, but I can't see anything like set_global_catch_backtrace.

I found that exec_program in repl.c does a manual save and restore of the ptls->bt_data.

Maybe record_backtrace should save the old ptls->bt_data before overwriting it if there is already a current ptls->exception_in_transit ?

samoconnor on 15 Nov 2017

Based on reading the issues references above, and looking at the code,
It seems that rethrow only works if the catch block does not:

cause a task switch,
throw an exception or
call anything that might cause a task switch or an exception.

Is that correct? @vtjnash @yuyichao @Keno @s2maki

samoconnor on 15 Nov 2017

related to #20784 ?

13947

samoconnor on 15 Nov 2017

I've definitely seen the case for test4 cause the wrong exception to be thrown, and the missing stack trace problem from test2. Our code that exhibits this behavior definitely seems to fit these examples.

s2maki on 27 Nov 2017

I think the fastest way to fix this is to add a method of rethrow that accepts a backtrace; that will at least allow it to be preserved manually via

catch e
    bt = catch_backtrace()
    ...
    rethrow(e, bt)
end

JeffBezanson on 27 Nov 2017

Here's a patch for 0.6:

--- a/src/task.c
+++ b/src/task.c
@@ -576,6 +576,16 @@ JL_DLLEXPORT void jl_rethrow_other(jl_value_t *e)
     throw_internal(e);
 }

+JL_DLLEXPORT void jl_rethrow_other_bt(jl_value_t *e, jl_value_t *bt)
+{
+    size_t btsz = jl_array_len(bt);
+    size_t sz = btsz > JL_MAX_BT_SIZE ? JL_MAX_BT_SIZE : btsz;
+    jl_ptls_t ptls = jl_get_ptls_states();
+    memcpy(ptls->bt_data, jl_array_data(bt), sz * sizeof(void*));
+    ptls->bt_size = sz;
+    throw_internal(e);
+}
+
 JL_DLLEXPORT jl_task_t *jl_new_task(jl_function_t *start, size_t ssize)

With wrapper:

rethrow(e, bt::Array{Ptr{Void},1}) = ccall(:jl_rethrow_other_bt, Union{}, (Any,Any), e, bt)

This can be added to 0.7 as well though the patch will be significantly different. In fact we should probably fully deprecate rethrow to throw (allowing 1 or 2 arguments).

JeffBezanson on 28 Nov 2017

👍1

Naively it seems possible to fix this and make catch_backtrace/rethrow reliable by saving and restoring the backtrace data appropriately - somewhat similarly to jl_apply_with_saved_exception_state but with various complications.

Is it worth considering this approach or is there some fundamental reason why it won't work?

c42f on 14 Jan 2018

The problem is that collecting the backtrace is expensive, so you only want to do that if you're actually gonna save it. It is quite possible however to have some sort of magic symbol that the compiler looks for to see if it needs to save the backtrace (which could but need not be syntactically exposed). @JeffBezanson said there was an issue about that at some point, but I couldn't find it.

Keno on 14 Jan 2018

I spent a few hours poking around in the runtime. The backtrace may be expensive, but it looks like the cost of recording it is unconditionally borne inside jl_throw. The only way to prevent that is to somehow have a flag set in the error handler state. A syntactic hint like this old proposal: https://gist.github.com/vtjnash/9cd53434cc7f7e2f7a29 would allow that.

But that seems like a distraction - what I'm proposing is that if there's a current backtrace, we should record it when entering a try block, and restore it when leaving. Seems like a bit of pointer shuffling may be all that's required in jl_enter_handler/jl_pop_handler? And it should only hurt if you've already thrown an exception.

For context, I'm just mulling over what to do about Jeff's comment at https://github.com/JuliaLang/julia/pull/25370#discussion_r160065706 and wondering if a solution to the underlying problem is within reach.

c42f on 14 Jan 2018

all that's required

To address the task switching problems, the current exception stack would need to be saved and restored during context switching which means more copies of bt_data floating around.

If we did all this, I believe catch_backtrace() can be made reliable and there would be no measurable performance penalty outside a catch block.

c42f on 15 Jan 2018

❤1 👍1

@c42f a stack of backtraces is what I had in mind https://github.com/JuliaLang/julia/issues/19979#issuecomment-344448380 ...

Maybe record_backtrace should save the old ptls->bt_data before overwriting it if there is already a current ptls->exception_in_transit ?

samoconnor on 19 Jan 2018

Yes, a stack of backtraces is a good idea. That would let us keep catch_backtrace and rethrow, and fix finally as well. rethrow() is also more efficient than having the user (or compiler) call catch_backtrace() and pass the result manually, since that requires allocating more julia objects to reflect the data safely.

JeffBezanson on 20 Jan 2018

👍1

This reminds me of an old niggling thought: can we do better that the catch + rethrow API?
Can we do it in a way that discourages accidental loss (non-rethrowing) of important errors?
Can it be done in a way that makes requests for type-based catch redundant?

peek e as an alternative to catch e but with no need to rethrow. i.e. the error still propagates up the stack.

try
    ...
peek e
    @warn "my code caused an error" e
end

catch e where automatic rethrow is the default unless delete!(e) is called

try
    ...
catch e
    e isa IOError && e.code == :FOO && delete!(e)
end

samoconnor on 20 Jan 2018

@samoconnor that's an interesting idea; I think the key thing you're doing with the peek is capturing state from the scope where the peek occurs, and communicating it to an outer scope.

I think a better way to do this might be to add a new exception to the stack - that is, instead of the peek, to simply have:

try
    ...
catch
    throw(MyException("my code caused an error"))
end

Notably, because this throw occurs before the end of the catch block, the previous exceptions in the stack are preserved. So in the outer scope you have the full context, but you can also capture and transport arbitrary information in MyException.

In general I think the pattern of logging an exception and rethrowing it isn't great: Somewhere at an outer scope the exception will also be caught and reported in some way (probably via logging but without rethrowing). Logging and rethrowing just leads to duplicate error information due to several catch/log/rethrows down the stack. This is very confusing in log files. I'd say it's much better to preserve the information and make it available as a full stack of exceptions at outer scope instead.

c42f on 25 Jan 2018

I now think we should generalize the exception_in_transit to be a stack of (exception,backtrace) states in transit.

Hi @c42f,
Have you been able to make any progress towards this?
I have a protocol layer stack where there are multiple layers that need to: catch errors, inspect error type, clean up layer-local resources, then retrhow. As things stand, the stack traces are almost always corrupted or missing by the time the exceptions reach the application layer. As things stand, I often find myself commenting-out all the intermediate try/catches to get a clean look at the original stack trace. This is especially frustrating because of catch catching all exception types. I have this hierarchy of fine grained IO exception handlers, but I have to disable it all if I make a typo in some code that leads to an argument error, or and undefined var error or a bounds error.

It would be _really awesome_ to have rethrow working :)

(This reminds me of: https://github.com/JuliaLang/julia/pull/15906)

samoconnor on 19 Feb 2018

It's an interesting problem but I haven't made any implementation progress. My original motivation was to solve https://github.com/JuliaLang/julia/pull/25370#discussion_r160065706, but in the end it didn't seem that fixing catch_backtrace() would help much. (Not to detract from the host of other good reasons to do this.)

c42f on 22 Feb 2018

I had a 24 hour plane journey so I started looking into this in more detail. Here's some notes I made for myself about how exceptions and backtraces are implemented in the runtime (perhaps could become devdocs if there's interest?)

Capturing backtraces

In the runtime, backtraces are recorded by calling the low level function
jl_unw_stepn. jl_unw_stepn can fill in two C arrays:

ip which is an array of native instruction pointers mixed with interpreter
state where necessary. This array tells us which code was being executed
at the time of the backtrace.
sp which is an array of pointers to native stack frames containing the
program data. This optional array tells us which data was in use at the
time of the backtrace.

For use from julia code, jl_unw_stepn is wrapped by jl_backtrace_from_here
in order to return arrays of julia types. For use internally by the runtime,
jl_unw_stepn is called directly from various functions. For instance,
throwing an exception using jl_throw records the backtrace into a raw ip
buffer attached to the thread local storage. This is a bare minimum of raw
information for efficiency and needs to be further converted into julia-level
information when calling catch_backtrace and associated functions.

Format of the raw instruction pointer array

The julia runtime executes programs as a mixture of interpreted and compiled
code. Therefore, a comprehensive instruction pointer array must contains a
mixture of interpreter program counters and native instruction pointers:

Interpreter program counters are coded as a triple consisting of a special
sentinel value (uintptr_t)-1, followed by a jl_value_t pointer to the
MethodInstance being interpreted, and an integer program counter indexing
into the code for that method.
Native program instruction pointers are a single void pointer to the native
code being executed.

To get stack traces to work when interpreted and compiled code are mixed
requires a bit of black magic to detect the interpreter frames using the raw
instruction pointer, and extract the interpreter state from the native stack.
This is done by always entering the interperter via a special thunk
enter_interpreter_frame, whose code address can be known to the runtime. The
stack layout in this hand written assembly function is also carefully
controlled so that the interpreter state can be extracted in a well defined
way.

Exception handling

Try-catch mechanism

Julia exception handling is built on top of the C setjmp/longjmp mechanism
with some additional logic in jl_eh_restore_state to restore the state of the
task when a longjmp occurs. This additional logic includes:

Removing GC roots in abandoned frames by popping the GC stack
Restoring task local GC state
Unlocking any mutexes which were held in the abandoned frames
Restoring the signal mask

The state to be restored must be recorded at the try site and be accessible at
the catch site both in C and in julia. In both cases it's ultimately stored on
the stack in a jl_handler_t which is initialized by jl_enter_handler and
consumed by jl_eh_restore_state.

The C mechanism is as follows:

Use of the JL_TRY macro introduces a local variable of type jl_handler_t,
initializes it before calling setjmp, and calls jl_eh_restore_state if no
exception is thrown.
JL_CATCH adds the else clause for entry when an exception triggers a
longjmp. This includes a call to jl_eh_restore_state which uses the
local jl_handler_t variable to restore task local state.

The julia codegen mechnism essentially mirrors the C version, but is a little
more complicated as lowering, code generation, and an LLVM optimization pass
are involved. Briefly, lowering turns try/catch into
Expr(:enter) and Expr(:leave) in the lowered linearized code. These are
then translated during LLVM codegen into calls to the C runtime functions given
above, by way of an intermediate dummy function @julia.enter and the LLVM
pass LowerExcHandlers. (Note that local jumps caused by break and @goto
also abandon a try block and are also lowered to Expr(:leave).)

Throwing exceptions

From julia

When an exception is raised with jl_throw:

A backtrace is recorded into the thread local state
Some GC safe point stuff happens which I don't understand
The exception object is recorded in the thread local state
The jl_handler_t for the current task is used to get the approriate
jmp_buf which will reset execution to the current exception handler up the
stack.

From the runtime

The runtime can raise exceptions using jl_throw; this works the same way as
from julia.

There's also some special circumstances which require extra care:

Stack overflows
Divide by zero errors
User initiated interruptions (^C)
Attempts to write into read-only memory.

These are caught by the runtime using signal handlers (for example, stack
overflows are detected by protecting a memory page on the known stack boundary,
and catching the SIGSEGV which results if the stack spills onto this page).
Crucially, all code inside a signal handler must be async-signal-safe - for
example, we cannot allocate memory using malloc or call any function which
does so.

c42f on 2 Jul 2018

To follow on from the above, a possible implementation plan:

First, fix the task-vs-thread local state bugs in the current implementation. This should give experience going to the next step:

[x] Fix thread local exception problems. Either
- [x] Move exception_in_transit from thread local to task local storage in _jl_task_t, or
- [x] Leave as is, but persist the exception state into the task right after the catch, or during task switching.
  
  What's the difference here? There's two things:
- A potential perf difference, trading task switching perf for task startup time and memory usage.
- Putting exception state directly on the task, we can't throw exceptions before the root task is initialized. Is this a problem?
[x] Fix thread local backtrace buffer. See above.

To tie things off, replace the task-based exception/backtrace with a stack:

[x] Add stack of (exception,backtrace) pairs to the task, perhaps based on a linked list? Allow the first node to be reused so that perf doesn't crater when try is entered but no exception occurs?

c42f on 2 Jul 2018

I've been slowly poking away at this. It looks reasonably easy to preserve the exception and backtrace data during task switching - that's just a few changes to the runtime C code.

The more tricky problem is to insert an exception stack pop intrinsic (which I've currently called Expr(:pop_exc)) into the lowered code and more particularly to get it flowing through the various layers of optimization correctly. The desired lowering is something along the lines of

Before()
try
    InTry()
catch
    InCatch()
end
After()

lowers to

  Before()
  enter c1
  InTry()
  leave 1
  goto e1
c1:
  leave 1
  InCatch()
  pop_exc    #<- Needs to pop exception stack back to the state at associated `enter`
e1:
  After()

The state of the stack at enter can be recorded in the jl_handler_t and will be dynamically available when the associated leave executes. We need a little of this state available in the associated pop_exc.

I've got three ideas:

I've tried storing the label of the associated enter in the pop_exc expression but it ended up feeling slightly unnatural to propagate that label during SSAIR processing, particularly since it's not used as a jump target.
Another idea is to use a token which is yielded by the enter, and make pop_exc take that token as an argument. As I understand it, this is the kind of thing that the LLVM Token type was introduced for, and would then have a nice lowering to LLVM IR before we get to the LowerExcHandlers pass.
We could just re-deduce the association from the nesting context in LowerExcHandlers, provided the IR isn't reordered by any optimizations. I'm starting to think this would be the easiest as LowerExcHandlers already has similar assumptions.

@keno any thoughts?

c42f on 20 Jul 2018

It turns out that of the ideas above, producing a token from enter plays extremely nicely with the implicit SSA variables in SSAIR, but also is easy to work with in both the interpreter and in codegen. So I'm going with that for now.

c42f on 27 Jul 2018

Here's a PR which fixes this: #28878

c42f on 24 Aug 2018

Fixed in #28878

c42f on 24 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Implicit "end" for file-wide modules.

manor · 3Comments

+(x::T,y::T) where {T} not parsing correctly

musm · 3Comments

Broken booleans as numbers?

TotalVerb · 3Comments

Literal NaNs in ASTs

Keno · 3Comments

The devdocs page for functions is outdated

ararslan · 3Comments

Julia: Wrong stacktrace shown when rethrowing an exception after catching another one

Most helpful comment

All 25 comments

test1() -- works

13947

Capturing backtraces

Format of the raw instruction pointer array

Exception handling

Try-catch mechanism

Throwing exceptions

From julia

From the runtime

Related issues

`test1()` -- works