Instead of maintaining our own GC stack frames â which is slow and easy to mess up, likely accounting for a large portion of our tough-to-track-down core Julia bugs in the past years â we've occasionally talked about doing conservative stack scanning. That is, scanning the stack and conservatively assuming that anything that happens to look like a pointer to a valid heap-allocated Julia object actually is.
fewer segfaults
might retain objects that should be freed
The second con seems like it can be avoided with the same kind of rooting we do now, except only needed at the ccall entry point. The chances of the first con happening â i.e. random values in the stack pointing valid heap objects â seems pretty slim. There might be situations like when you take a pointer to an array and then use it â but in that case you probably _want_ the array to remain rooted while someone has a pointer to it. Are there other potential cons that I'm not seeing?
Related: https://github.com/JuliaLang/julia/pull/10725, https://github.com/JuliaLang/julia/issues/10317.
cc: @carnaval, @vtjnash, @JeffBezanson
The only thing I know about conservative vs precise stack scan is based on this and I quote.
...... Language implementations with automatic memory management often start out using exact rooting. At some point, all the careful rooting code gets to be so painful to maintain that a conservative scanner is introduced. The embedding API gets dramatically simpler, everyone cheers and breathes a sigh of relief, and work moves on (at a quicker pace!)
Sadly, conservative collection has a fatal flaw: GC things cannot move. If something moved, you would need to update all pointers to that thing â but the conservative scanner can only tell you what might be a pointer, not what definitely is a pointer. Your program might be using an integer whose value happens to look like a pointer to a GC thing that is being moved, and updating that âpointerâ will corrupt the value. Or you might have a string whose ASCII or UTF8 encoded values happen to look like a pointer, and updating that pointer will rewrite the string into a devastating personal insult.
I know that we are propably even further from compacting but I'd like to know what the experts on these think...
Our large amount of C interop makes it fairly unlikely that we'll ever be able to do moving GC.
It's not clear that we need to move objects. If we did, it would already require a lot of work, since the current rooting API assumes non-moving. That article seems to assume that moving is required for generational GC, but that isn't strictly true. You only need it for compacting or for a copying _implementation_ of generational gc.
For us it's possible the conservative stack scan would be a bigger win than moving objects, especially given how important C interop is.
To the pros I would add that we would be able to do many optimizations we do for bitstype on immutable types with no additional work. That's huge.
On the moving debate : we will never be able to do a fully relocating collector, because of C interop as Stefan said. However, it would be perfectly ok to do a so called "mostly copying" scheme, and we would need conservative stack scanning (to pin "C" objects) anyway.
The thing I don't agree on is that the chance of something on the stack looking falsely like a pointer is slim. The problem is that the stack is never cleaned up, so as it grows back you can end up overwriting parts (because of padding, alignment) of what was a pointer before. If you overwrite the LSB part you are on the contrary almost guaranteed to point inside a (different) julia object.
On the implementation side, the part taking care of pooled object is already "done" in my PR (which is where I encountered the partial pointer stuff btw). It's the easy one though because they are fixed size. To do this for "big" objects we need a datastructure to query if a pointer falls into a range of memory.
We can only properly evaluate false retention once we have a full implementation, and measure memory use in real workloads. Until then I'm not worried.
As we already discussed, I'm not worried too much about the memory usage of false retention, but about the angry "my finalizer was not run after gc()" crowd. Anyway, you're right, we would need to implement big object conservative scanning first to see it in real life.
Well, you know how I feel about that. The only real solution is to use finalizers less and rely on them less.
@JeffBezanson If you don't have reliable finalization, how do you handle externally managed memory reliably? Since C interop is so important, you need some way to be able to call C release code before something is GCed. Or am I missing something?
@carnaval I agree that there'd be a high probability of seeing false positives of things looking like valid pointers still on the stack.
"high probability" of seeing at least one when the programs run, sure. Now the thing which is hard to evaluate is : how long would false positive remain on the stack (after all if it is freed at the next call who cares), and with what density.
Depends a lot on the application... if you have processes whose lifetime is measured in months or years, with the amount of stack fluctuating a lot, I think it might be more of an issue...
How reliable would it be to insert some mark on the stack to distinguish managed code and unmanaged code?
@carnaval Is the "almost moving" scheme a compacting collector that pins everything on the stack?
You can mix conservative and precise stack frames, but the point of this is to get rid of the gcframe entirely, both in the runtime and the generated code. We can also have conservative scanning of C frames and precise scanning of julia frames using the llvm statepoint thing (more work).
The mostly copying stuff is a general technique, you just have a part of the roots you are not sure about. It can be the whole stack or only a part of it, or even some heap objects you wouldn't know the layout of (although we don't need this).
The canonical mostly copying (I know of) is http://www.cs.cornell.edu/home/fms/mcc/bartlett.html if you want to use copying as a marking segregation ("replacing" the marked bit / mark queue). Copying has other uses : compaction, generation segregation, ...
@carnaval Thanks for the explaination (and the pointer). I didn't know that the rooting can be fine grain classified like this before.
My feeling before was that compacting and the growing ability in LLVM to identify GC roots are the two things that are incompatible with conservative stack scan. Good to know that both of them are still possible. (Although they are still as hard to implement as they currently are now...)
It's important that if we choose to do conservative stack scanning now we get all of the benefits immediately and can do more work to get back partial ability to do moving GC â which we're not even doing right now.
I totally agree and I find it very promising the first time @carnaval mentioned it to me.
It's just that when I want to find more detailed study about it, the internet is flooded with moving collector and there's not even many places mentioning non-moving generational collector.....
The thing I don't agree on is that the chance of something on the stack looking falsely like a pointer is slim. The problem is that the stack is never cleaned up, so as it grows back you can end up overwriting parts (because of padding, alignment) of what was a pointer before. If you overwrite the LSB part you are on the contrary almost guaranteed to point inside a (different) julia object.
I'm trying to get a better understanding of this concern. Is the issue that you might have a pointer stored in a register or stack frame and then reuse the low byte or something like that and end up with the same register holding something that still looks like a pointer but points at something else entirely?
I should mention that at some point I had talked with @carnaval about wanting the ability to have interior pointers into an object pin the whole object (particularly useful with arrays). That would obviously make it _much_ more likely that some accidentally pointer-like value would pin an object since you don't have to point exactly at the head of the object â pointing anywhere inside of it pins it. But I think the benefits of conservative GC outweigh that, so I'd be willing to forgo interior pointers for conservative stack scanning.
In that case, are we sure that the compiler optimization will not optimize a store of the original pointer out if we do not explicitly root it?
Right, that's another concern â can we force LLVM to store pointers in a recognizable form on the stack?
But then this doesn't feel so much different from the current approach. I would expect a naive implementation to have the similar performance hit with the current rooting. It is probably a different story with LLVM gc support but I guess that's not even on the table at this point....
And we would also need to do the same with our c/c++ code.
You have to conservatively scan registers too.
jsc (WebKit's JS engine) actually uses a Bartlett-style collector with LLVM, so it can be done. Regarding false retention, their blog post says (under "Integrating LLVM with garbage collection"):
Retention of some dead objects is of course possible, and we can tolerate it so long as it corresponds to only a small amount of space overhead. In practice, weâve found that so long as we use additional stack sanitization tricks â such as ensuring that our compilers use compacted stack frames and occasionally zeroing regions of the stack that are known to be unused â the amount of dead retention is negligible and not worth worrying about.
But Julia workloads are probably a bit different from JS workloads.
Yep, you have to scan registers too. Then if you are careful to check for interior pointers (because llvm might transform a begining ptr + offset into a single pointer) I don't see what kind of transformation could hide the pointers from us.
LLVM would have to actively "encrypt" those on the stack, which while a legal transform, I doubt is common.
because llvm might transform a begining ptr + offset into a single pointer
Excellent point; I hadn't thought of that. That is indeed very troubling.
And I guess we gave to check for minus offset as well because of the tag?
Well any pointer inside the memory block keeps it alive. The only problem is for objects which consists only of their tag where the pointer might look like it's keeping alive the next object in the pool. Those are only singleton instances right ? Then they are kept alive by the type object anyway.
Another question that just comes to my mind just now is that if we don't explicitly root local variables anymore, is it possible that the object becomes unreachable in LLVM (and makes LLVM decides to override the pointer) before it gets out of scope in julia code causing the finalizer to be called too early?
This is probably not an issue if we have better support for managing scope-local resources and do not provide any ganrentee when is the finalizer is going to be called #11207.
I don't think it's an issue, we make no guarantees on when object are finalized except that you can never access the object after it has been finalized. In particular there is no mention of scope anywhere.
So basically, if we want to do conservative stack scanning, we need to support interior pointer pinning, so we might as well use it, right? But then we have to deal with the fact that it's relatively easier for a random pointer-like value to pin an object accidentally by pointing anywhere into it. But there's no particular concern about a pointer to an object accidentally being "tweaked" to point into another object.
Another con to consider: We will not be able to move stacks. This would (prematurely in my opinion) preclude some concurrency models.
Go's elegant support for concurrency never requires the user to declare stack sizes for "go routines", which are essentially concurrent coroutines. The Go run-time starts stacks small (8 KB if I remember right) and grows a stack by moving it to a bigger chunk of memory. This is straightforward to do with precise GC information for the stack. Go deals with C interop by always running C code on a thread's original stack, and not allowing C code to call Go code.
Go deals with C interop by always running C code on a thread's original stack, and not allowing C code to call Go code.
I really don't think we want to/can do that â being able to pass Julia callbacks to C APIs is a killer feature.
Maybe we can lighten up the Go constraint by just making sure (via escape analysis) that addresses of stack locations are not passed to C code. E.g., not stack-allocate objects whose address might be seen by C code.
I agree that Julia callbacks to C APIs is critical (I was using that from the beginning of my Julia experiences). I hope that Arch's idea can work.
Just so that we don't forget : @vtjnash stumbled upon the dual stack ("safe stack") support in LLVM while scrolling LangRef. We could hack this to have it use the safe stack only for gc pointers (and spills of gc pointers in registers) so that we have a conservative scanner which is almost not conservative. Don't know how crazy this is.
@carnaval Do you have a pointer to the feature or elaborate a little more? I only find a link to this clang doc
and spills of gc pointers in registers
Does this mean that the original pointers will be spilled and we don't need to check for interior pointers anymore? (assuming we still explicitely root them in c code)
SafeStack support landed two weeks ago in llvm upstream.
http://reviews.llvm.org/rL239761
http://reviews.llvm.org/rL240473
http://dslab.epfl.ch/proj/cpi/
:-1: from me. The main difficulty is that it makes a fully moving collector impossible. Some advanced GC algorithms require this.
Furthermore, LLVM now has safepoint support. This allows for precise GC via stack maps compiled by LLVM. This should be faster than conservative collection.
I've been experimenting with conservative stack scanning in the Julia GC. I've implemented something that so far seems to work ok for single threads and has basic threading support [1]. It's not currently a pull request since it's still a work in progress (some help with figuring out the remaining problems would actually be great) and I see that this ticket is still marked "decision" anyway.
Performance seems to be so far unchanged; as I've only added new code that is called on top of the existing one, that is not surprising (it shouldn't have a huge impact). One goal is (obviously) to eliminate GC stack frames as a result and gain performance benefits from that. I haven't (yet) dug more deeply into that: for the time being, my goal was primarily to have solid evidence that this is feasible in Julia without reengineering the GC.
Would this be the right place to have a discussion about where to go from here? Is there still generally an interest in doing this, if the expected performance benefits materialize?
I'll note that there are two reasons why I've been working on that. One is that we seem to have some performance problems that seem to be related to the current explicit GC frames. The other is that I'm part of a large external project [2], where we would like to integrate Julia with various computer algebra systems, such as GAP [3]; the specific goal is to have Julia actually handle the memory for part or all of those systems. Having conservative stack scanning as an option would greatly facilitate this.
I can talk more about our application if anyone is interested, but it's not really important here (except that we are also interested in the related issue of having support for marking foreign objects in the Julia GC, for which I can open another ticket to discuss [4]).
Note also that some of the technical details are discussed briefly in doc/src/devdocs/stackscan.md.
[1] https://github.com/rbehrends/julia (branch: rb/conservative-stackscan)
[2] https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html
[3] https://www.gap-system.org/
[4] We are generally interested in having support for Julia's GC to handle foreign objects directly where possible in order to eliminate the performance cost of additional indirections and allocations.
I strongly disagree with having a conservative stack scan unless we've exhausted all possible other implementations.
Having conservative stack scanning as an option would greatly facilitate this.
Why is that?
support for marking foreign objects in the Julia GC
An integrated way to do that is to declare corresponding julia type.
@rbehrends: first of all, welcome â it's cool that you've already had some success with trying out conservative stack scanning. Please don't take @yuyichao's thumbs down as anything but a disagreement with the means; it's merely a technical judgement that conservative stack scanning is not the best way to achieve shared performance goals. @yuyichao: you've mentioned on a number of occasions that you don't believe conservative stack scanning is necessary or the best way forward, but I don't really understand what your intended alternative is. So that we can get everyone on the same page and not have people like @rbehrends wasting their efforts, could you possibly write up a short description of your plan for addressing this issue? Perhaps we can redirect some of these efforts towards making that happen then.
I don't really understand what your intended alternative is
What issue?
Your intended alternative to conservative stack scanning â what is it? A written plan for this issue would be welcome.
Your intended alternative to conservative stack scanning.
I mean alternative to conservative stack scanning to fix what?
And for the "pros" in the first post
much simpler
Not necessarily. We also need to make sure finalizer are not run too early since objects can have uses invisible to LLVM.
faster (#11508, #11284)
Minimum after Keno's PR.
makes stack allocation much easier (#8134)
immutables with references trivially become as efficient as reference-free immutables
As mentioned in stackoverflow.com/questions/43932741/stack-allocation-of-isbits-types-in-julia this is the easiest part of anything. So any "effort" should go to the typeinf and codegen part.
fewer segfaults
With aggressive testing and static analysis tool I'm confident enough about the current rooting that I think random compiler transformation hiding pointers might cause more segfault than whatever missing root we might still have.
What I'm saying is that we need a public document somewhere that lists these issues and provides a plan of how they are all going to be handled. I'm not saying that the information doesn't exist somewhere in the universe, but it is currently scattered across the internet and the heads of a half dozen people. It should be in one place where people outside the "inner circle" can see it, react to it, and most importantly help make progress towards it.
There is no public doument related to this since this is not in the plan anywhere AFAICT. In fact, the stack overflow response is the first time in months that I remembered that we still have an issue that we shouldn't do.
The multiple issues linked in this issue all have their own issue/prs for discussion.
faster (#11508, #11284)
Well, exactly those, and the TODO would be to just revive the PRs. I think I have a pretty detailed explaination about what I did in the PRs and can certainly response to questions there in case anyone want to do it.
makes stack allocation much easier (#8134)
Explained in that PR and the linked PR at https://github.com/JuliaLang/julia/pull/12205 (only the last commit in the current commit list in that PR is real) which doesn't need conservative scanning. What needs to be done there in the GC really depends on what codegen want to do so there's no independent description of what's the necessary changes needed in the GC. Similar to other issues, anyone that want to work on that issue should just comment there instead.
A possible implementation of it would just be
diff --git a/src/gc.c b/src/gc.c
index 371919569b..1eb827e50b 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -1809,8 +1809,16 @@ stack: {
else {
new_obj = (jl_value_t*)gc_read_stack(&rts[i], offset, lb, ub);
}
- if (!gc_try_setmark(new_obj, &nptr, &tag, &bits))
+ if (gc_ptr_tag(new_obj, 1)) {
+ new_obj = (jl_value_t*)gc_ptr_clear_tag(new_obj, 1);
+ nptr = 0;
+ tag = (uintptr_t)jl_typeof(new_obj);
+ bits = GC_MARKED;
+ meta_updated = 1;
+ }
+ else if (!gc_try_setmark(new_obj, &nptr, &tag, &bits)) {
continue;
+ }
i++;
if (i < nr) {
// Haven't done with this one yet. Update the content and push it back
immutables with references trivially become as efficient as reference-free immutables
This is basically not true..... I realize this is not actually linked but this is https://github.com/JuliaLang/julia/pull/18632 FWIW, I don't think we actually have a consistent full picture of this matter yet.
fewer segfaults
And not true.
So basically it's only the existance of this issue that might be confusing people. The related issues are not particularly well documented since it usually take someone to actually trying implementing it to see what the issue actually is and what needs to be done which is also why a lot of the linked issues are actually PRs.
Given that @StefanKarpinski agrees to close this elsewhere I'll close this to reduce further confusion.
And reading through the issues/prs I linked above again, I realized that the issue of stack allocation in general that I know about is already pretty well documented in https://github.com/JuliaLang/julia/pull/18632, which also have a reference from the easier case at https://github.com/JuliaLang/julia/issues/6080#issuecomment-268859420 . Anyone that want to help with stack allocation should read that / comment there.
As for conservative scanning itself, I don't think we want to do that unless there's no other way to do something we want. https://github.com/JuliaLang/julia/issues/11714#issuecomment-301764461 . As an alternative GC, I don't think it's practical to have that unless someone want to co-maintain and regularly test the non-shared part.
@rbehrends, another welcome and thank you for delving into internals! :)
@rbehrends Thanks for working on this. Glancing through your patch, it seems to be very high quality. This should be useful for comparing approaches to improving the GC. There are a lot of unknowns here, so having actual implementations will help a lot in moving things forward.
First of all thanks @rbehrends for posting that patch. It does seem like a very good patch (plus it has documentation!). However, I do agree with @yuyichao that conservative stack scanning is not a good way forward. I do think we should be able to do whatever we need with precise rooting, especially after #21888. Frankly, I'm not entirely convinced that it is possible to implement conservative stack scanning 100% correctly in the presence of arbitrary llvm optimizations (e.g. it might only store a derived pointer, or it may reuse some bits of a pointer, etc.).
It looks like the issues @rbehrends is raising here are
JL_GC_PUSH
everywhere?Interop --- this seems to be the key one. Is there anything that can be done other than conservative scanning or manual JL_GC_PUSH everywhere?
Running our GC root placement pass on clang-generated C code. Separate heaps with different GC semantics (e.g. refcount or no-GC). Depending on the language the other code is written in, whatever metaprogramming that language has available (e.g. C++ could probably use smart pointers to emulate manual GC pushes/pops).
For more context, let me be explain the concrete problem that I'm interested in trying to solve, which is better interoperability with foreign languages for some non-trivial use cases.
Julia interoperability with foreign languages is pretty straightforward if the underlying object can be directly represented by a Julia type. However, this is not always the case: examples are C unions or containers with internal free lists. One real example that I'm dealing with at the moment involves dealing with an object representation where tag bits are used to distinguish between pointers and immediate data: most machine words can contain either pointers or immediate data and the tag bits are used to figure out which it is.
The simple way to interface with such a foreign object is to write a wrapper object in Julia, keeping the foreign object's representation opaque. However, this comes with the following performance implications:
This overhead is generally acceptable when you're dealing with large, long-lived objects, but for small and short-lived objects, it becomes significant.
Additionally, if you also have references going back from foreign objects into Julia, there is a risk of having either memory leaks (if you keep Julia objects alive from foreign code) or dangling pointers (if you don't).
It would therefore help if one could use jl_gc_alloc()
to directly allocate such objects. However, using the Julia GC in such a fashion directly from foreign code currently has its own performance and software engineering implications (assuming you could properly mark these objects).
Specifically, whenever using objects in foreign code, you're forced to deal with the JL_GC_PUSH*/POP
family of macros, as there's no way to portably and easily determine GC roots for C/C++/Fortran stack frames [1]. This is an issue even for foreign code handling Julia objects or foreign objects that can be cleanly represented as Julia types, of course.
This is where conservative stack scanning comes into play. The goal here is to (a) get rid of the performance penalty of manually maintaining GC stacks in foreign code and (b) not having to rewrite a large existing codebase; in particular, (b) often makes conservative stack scanning the only realistic option for retrofitting GCed allocation to such a codebase.
I'll note that it is not actually necessary to scan the entire stack conservatively for this purpose. One could (at least in theory) combine precise stack scanning of Julia stack frames with conservative scanning of foreign stack frames.
[1] There are alternative approaches to using Henderson's method for precise stack scanning in an uncooperative environment, such as lazy pointer stacks, but they're generally even more difficult to implement manually.
Specifically, whenever using objects in foreign code, you're forced to deal with the JL_GC_PUSH*/POP family of macros, as there's no way to portably and easily determine GC roots for C/C++/Fortran stack frames [1].
This could be done automatically in C++ and it doesn't solve the general issue since there can be global roots from C/C++. The more standard way to deal with this AFAICT is a (simulated) ref-count interface. That's what both jlcore and pypy does.
If the C/C++ code base can be recompiled with Clang, then it's possible that better integration with Julia's GC could be achieved without rewriting the code. @Keno's comment is referring that, I think.
Running our GC root placement pass on clang-generated C code.
I fear this is not always practical. A few examples:
uintptr_t **
, which reference polymorphic objects, but the referenced type can in the general case only be determined by dispatching through a type tag at run time. On top of that, if the low bits are set, the same memory word can also represent an integer or finite field element.Separate heaps with different GC semantics (e.g. refcount or no-GC).
This carries with it the aforementioned wrapper overhead and cannot deal properly with cycles that involve both Julia and foreign objects.
I think I need to emphasize here that I'm not simply talking about Julia calling self-contained C code (for which this is a workable approach), but also calling back to Julia and/or embedding Julia in another piece of software. This leads to Julia code -> foreign code -> Julia code call chains and Julia object -> foreign object -> Julia object reference chains.
I'm personally pretty agnostic about whether to use conservative or precise stack scanning for a programming language; there are pros and cons to both approaches and many trees have died for research papers arguing the respective merits. :) But for advanced C interoperability, the story tends to be different.
P.S.: I would also like to apologize if I inadvertently upset an apple cart. My intent wasn't to start a major argument.
P.S.: I would also like to apologize if I inadvertently upset an apple cart. My intent wasn't to start a major argument.
No worries, this seems like a healthy debate. I don't think anyone has any objection to having the ability to do conservative GC in Julia for C code that it's interoperating with. That's actually a rather different and fairly independent concept from using conservative GC for Julia's own collection.
Incidentally, we started out using the Boehm collector but had some troubles with it (possibly our own fault), switched to a simple mark-and-sweep collector and used that for quite a long time. It would be interesting to explore the option of using the Boehm collector for the C code we call.
Can somebody convince me that it's even possible to do conservative stack scan correctly in the presence of arbitrary compiler optimizations? I'm not convinced that that is the case.
It feels like the best we can do it provide hooks into the marking phase of the GC and allow external integrations to do whatever marking they can by telling our GC where all the julia objects it knows about are. How it does that doesn't really have to be our business.
Also, @rbehrends: we're generally open to any ways we can improve interop with C code â including cases like the union types you're working with, and adding hooks into our GC. Of course, I suspect you need solutions that work now, not solutions that we can provide in a future release, so it's understandable if that's not quite sufficient in the short term. The GAP project is really exciting and I personally happy to do what we can to make sure it's a successful application of Julia.
Can somebody convince me that it's even possible to do conservative stack scan correctly in the presence of arbitrary compiler optimizations? I'm not convinced that that is the case.
I'm convienced that compilers are allowed to spill partial registers and break it. Also, callee save registers makes partitioning the stack scanning very hard.
For providing hooks into the GC, I've thought about it a lot but at least I do not want to have that now. At least not a stable one. The concurrent requirement for it can also change.
And another issue for C interoperability is write barrier, which basically means that we can't support that usage in general (edit: i.e. you'll need to patch the C code). You'll have to treat part of the heap differently and I'm not sure if that's even possible to do without also affecting the write barrier code in julia directly....
Also, speaking of better C inteoperability in terms of memory management, we've already talked about this for the buffer type. (i.e. treating C object as a buffer, similar to unsafe_wrap
). We could have a list of these objects (effectively a separate heap) which would minimize the allocation/finalizer overhead.
Another option for C interop wrt to memory allocation is using something like LD_PRELOAD to replace libc's malloc and free with Julia-aware versions. I've used a similar approach in the past to replace libc blocking I/O functions with versions that are implemented in Julia and yield to the scheduler. It would be interesting to explore having an entire replacement libc that plays nicely with Julia:Â malloc and free that integrate with our GC, and blocking functions that yield to the scheduler. You could load arbitrary shared libraries and have them automatically play nice with Julia.
That's exactly what my mentioning of write barrier is based on. It has been tried for gmp and it doesn't work because you need to get GMP to trigger the write barrier.
It's probably a moot point now, but doesn't the Mono GC rely on LLVM not applying arbitrary optimisations that break conservative scanning? At least LLVM might provide an option to disable optimisation passes that break such assumptions.
If you have control over the compiler/compiler options, you can pretty much do whatever you want (including adding an IR pass that adds automatic rooting) The question was whether one can take an arbitrary compiled C code and use conservative stack scanning to have it root julia objects that it uses. Both @yuyichao and I think that this is in fact not possible due to partial register spilling.
Can somebody convince me that it's even possible to do conservative stack scan correctly in the presence of arbitrary compiler optimizations? I'm not convinced that that is the case.
No, because arbitrary compiler optimiatizations can break conservative stack scanning. This has been known since the early days:
The counter-argument here is based on pragmatism, i.e. that a lot of existing computing infrastructure would go up in flames if such a thing happened and that the authors of LLVM and GCC know that. Ruby (on which, inter alia, GitHub relies), WebKit, and Blink (Chrome's rendering engine) all depend on the soundness of conservative stack scanning, at least in part.
The generally bigger problem with conservative stack scanning (let alone entirely conservative GC) tends to be that on 32-bit architectures with large heaps and large objects, the risk of false positives can become pretty high (which is why the Boehm GC, inter alia, uses blacklists to mitigate that risk).
I assume partial register spilling is where the compiler splits a register into pieces that are smaller than a pointer and saves to the stack to free up a partial register, e.g. EAX on a x86_64 machine. I tried to look up at what optimisation level the gcc and clang compilers do this, without success.
The term to search for is "subregister liveness tracking". Looks like the only architecture where LLVM currently does it is PowerPC. However, that's just for the spilling part of it. LLVM will happily insert an inttoptr, and a later pass will happily split an i64 into two i32s for you. Sure, maybe it doesn't happen particularly often in practice, but there are no semantic guarantees WHATSOEVER that you'll be able to find all your pointers.
Of course, I suspect you need solutions that work now, not solutions that we can provide in a future release, so it's understandable if that's not quite sufficient in the short term. The GAP project is really exciting and I personally happy to do what we can to make sure it's a successful application of Julia.
Thank you, though obviously I also understand that your priority has to be the long-term health of Julia.
I don't actually need a solution right now (we're on what is essentially a 12-year project) and GAP is just one part of the whole thing where interoperability is a concern (which is why I'd like to focus on more general issues of interoperability), but we're at a stage where we're developing a roadmap so we need to figure out our options and their respective viability.
It feels like the best we can do it provide hooks into the marking phase of the GC and allow external integrations to do whatever marking they can by telling our GC where all the julia objects it knows about are. How it does that doesn't really have to be our business.
This is something that I had actually thought about. A minimal set of hooks that permitted this would be the following (I think):
bigval_t
instances.The first three would be generally good ideas to have for better C interop, I think â and I'm planning on writing up a separate feature request for those â, but the last two would expose internals of Julia and would at a minimum have to come with a big "can be deprecated at any time" warning label (any such external library would have to depend on some implementation details of the Julia GC, anyway).
Alternative ways to declare safepoints
What more alternative do you need? There's jl_gc_safepoint
and I don't see why we need a different kind. (Or I guess if you mean alternative relative to allocation then jl_gc_safepoint
is the one)
A minimal set of hooks
You also need to patch your code to issue write barrier.
What more alternative do you need? There's jl_gc_safepoint and I don't see why we need a different kind.
The current safepoint mechanism (again, correct me if I'm wrong) only allows for one point in time to be marked as a safepoint, not an interval. Like JIT-compiled code in the JVM, it will do a forced load from a specific page that triggers a segfault and the signal handler will then proceed to report to the GC thread that it is no longer working on any GCed memory. However, this does not account for cases such as long-running computations in third-party libraries or I/O where you cannot inject regular calls to jc_gc_safepoint()
. For this, you want a mechanism where you say "we're not doing anything that interferes with GC between point A and point B".
You also need to patch your code to issue write barrier.
Obviously. I was talking about what was needed on the Julia side of things, and jl_gc_wb()
and jl_gc_wb_back()
are already in julia.h
.
For this, you want a mechanism where you say "we're not doing anything that interferes with GC between point A and point B".
That's gc state transition and we do have that with jl_gc_safe_enter
jl_gc_safe_leave
and jl_gc_wb() and jl_gc_wb_back() are already in julia.h.
What I mean is that you cannot do this without patching the C code. Not even with conservative stack scanning.
That's gc state transition and we do have that with
jl_gc_safe_enter
jl_gc_safe_leave
Ah, thanks for the correction.
What I mean is that you cannot do this without patching the C code. Not even with conservative stack scanning.
Yes, I am well aware of that.
That's exactly what my mentioning of write barrier is based on. It has been tried for gmp and it doesn't work because you need to get GMP to trigger the write barrier.
It's unclear who you're replying to here.
It's unclear who you're replying to here.
The comment right above it. In particular, the first sentence which is what the comment is suggesting AFAICT
Another option for C interop wrt to memory allocation is using something like LD_PRELOAD to replace libc's malloc and free with Julia-aware versions.
Yes, I am well aware of that.
OK, so this why I don't think that's a good reason for doing conservative stack scan. FWIW, there's usually many more heap mutations than local variables (obviously there's more stack mutations but the JL_GC_PUSH<n>
macros only needs to be called on variables creation, and not on each mutation) so adding the wb is the hard part. It's also much easier to find missing roots using either static analysis or stress testing. We've done both with good results. WB errors are much harder to catch.
Other requests all more or less make sense,
An ability for foreign code to supply its own roots (which would be nice to have even with precise stack scanning).
This is what a simulated ref counting interface is used for. A way to add callbacks to supply root is fine too but I think a ref counting API is easier to use.
Alternative ways to declare safepoints (this has little to do with conservative stack scanning as such and more with the needs of foreign code, which may be unable to inject safepoints at arbitrary locations). If there's already such a way that I've overlooked, feel free to correct me.
So we already do. Incidentally, this is also why I don't want to have an API for the next one right now, since this is the part that still needs a lot of tweaking.
The aforementioned marking of foreign objects.
This actually hooks much deeper into the GC than any other ones. It's in principle doable but certainly not via a stable API anytime soon.
Registering the allocation and deallocation of bigval_t instances.
Actually not sure what this one means.
Hooking into the marking of tasks.
I assume this is either (1) or conservative scanning. So this would be no or covered....
For what it is worth, GAP already has issues with GMP (a dependency). Their solution at present is to reallocate GMP integers before each call to the mpz interface so that no reallocation occurs within calls to GMP. It's probably technically feasible to do a similar thing with MPFR, but it is not a general solution for all C libraries, obviously.
OK, so this why I don't think that's a good reason for doing conservative stack scan. FWIW, there's usually many more heap mutations than local variables (obviously there's more stack mutations but the
JL_GC_PUSH<n>
macros only needs to be called on variables creation, and not on each mutation) so adding the wb is the hard part.
I have to disagree here, based on my own experience with working with code that uses manual insertion of write barriers. Importantly, the proper use of write barriers is amenable to bespoke static verification even in C code (bespoke, because you may have adjust your tools for the specifics of how a particular piece of C code represents data), which incurs no runtime penalty other than the write barrier itself. This is different from explicitly exposing local variables.
This is what a simulated ref counting interface is used for. A way to add callbacks to supply root is fine too but I think a ref counting API is easier to use.
Both approaches have their pros and cons, obviously, but a callback has the advantage that the roots can be computed lazily, i.e. if they get updated more frequently than collections occur. In any event, I think this may be a discussion for a different place, once I get to write up a feature request for it.
I have to disagree here, based on my own experience with working with code that uses manual insertion of write barriers. Importantly, the proper use of write barriers is amenable to bespoke static verification even in C code (bespoke, because you may have adjust your tools for the specifics of how a particular piece of C code represents data), which incurs no runtime penalty other than the write barrier itself. This is different from explicitly exposing local variables.
Runtime penalty is not what I'm talking about. I'm only talking about
(b) not having to rewrite a large existing codebase; in particular, (b) often makes conservative stack scanning the only realistic option for retrofitting GCed allocation to such a codebase.
which is the only correctness issue that doesn't already have a clear solution that doesn't require conservative scanning and I'm just saying that conservative scanning can't do that either so that's not a good reason to do it.
For performance, I'll repeat that we should not do conservative scanning unless we've tried all other solutions.
Runtime penalty is not what I'm talking about. I'm only talking about
(b) not having to rewrite a large existing codebase; in particular, (b) often makes conservative stack scanning the only realistic option for retrofitting GCed allocation to such a codebase.
which is the only correctness issue that doesn't already have a clear solution that doesn't require conservative scanning and I'm just saying that conservative scanning can't do that either so that's not a good reason to do it.
Very well. There are three distinct issues here:
As for performance, conservative stack scanning avoids separate GC frames entirely.
Regarding verification, the correctness of write barriers is largely a local property (and can in the absence of tools be also ascertained manually through code review). In order to ensure that a sufficient subset of local variables is put on the GC stack along all possible program paths, you need global analysis, probably combined with interprocedural dataflow analysis. This would have to work with multiple source files and function pointers.
I believe that ease of adaptation is also better (by which I don't mean it's trivial). Recall that we only need a write barrier when we're writing a pointer to a heap object. Assuming we can supply GC roots, we do not need a write barrier for assigning to a global or thread-local variable; this already reduces the amount of code that we have to deal with. Also, the write barrier only has to be performed before the next safepoint, which gives us a fair amount of leeway.
For local pointer variables we have to do more; recall that I'm not just talking about calling C code from Julia, where Julia references only occur as arguments at defined entry points of a module, but C -> Julia -> C -> Julia call chains and where foreign objects or global variables can contain Julia references. This means I cannot just guard the entry points to a C module, but â assuming that I don't want to use JL_GC_PUSH*()
for every single local pointer variable â will have to possibly worry about every pointer read from a global variable, thread-local variable, or from the heap (this is necessary because even if you trace them as roots, the contents of those locations can change before the next GC, at which point the only live reference is on the stack). This, in essence, has the complexity of a read barrier, on top of which we still have to deal with the actual write barrier.
For performance, I'll repeat that we should not do conservative scanning unless we've tried all other solutions.
I don't disagree insofar as Julia code itself is concerned. But I am talking about the performance of foreign code here, which is a separate issue.
Regarding verification, the correctness of write barriers is largely a local property (and can in the absence of tools be also ascertained manually through code review). In order to ensure that a sufficient subset of local variables is put on the GC stack along all possible program paths, you need global analysis, probably combined with interprocedural dataflow analysis. This would have to work with multiple source files and function pointers.
Local variable rooting is also pretty local since they have defined behavior at function call site. write barrier is also not as local since there are cases we remove write barrier due to known old/young objects.
Also, the write barrier only has to be performed before the next safepoint, which gives us a fair amount of leeway.
This is not going to be the case.
assuming that I don't want to use JL_GC_PUSH*() for every single local pointer variable
From local information, you can do only a subset of it.
When it comes to GMP and MPFR, we've considered having these types implemented as native Julia objects instead and only calling out to the libraries for complex operations. And of course, there's a ton of performance to be gained by immediate representation of smaller values â which presumably is what you're using the C-union between pointer and immediate values with tagging for. We should really work towards having efficient Julia-native support for this and only call out to GMP/MPFR when necessary. That would eliminate a lot of these problems in a much cleaner and more efficient way.
Local variable rooting is also pretty local since they have defined behavior at function call site.
Unfortunately, that is often insufficient. Without looking at callers, any argument to a function can be the sole live reference to an object, which means that if you're using purely local analysis, at a minimum all pointer arguments need to end up in GC frames. A simple example is:
f(pop(&global_stack));
Generally, we can eliminate JL_GC_PUSH*()
if we can prove that another variable keeps the object alive or if a garbage collection will not occur within the scope of the local variable. In the general case, purely local analysis is insufficient for that.
write barrier is also not as local since there are cases we remove write barrier due to known old/young objects.
Write barriers are also relatively rare, as they are only need for pointer writes to the heap (for example, GAP, which has its own generational GC with manually inserted write barriers, has about 700 write barriers in some 160 KLOC worth of C code). It is also less of a constraint for the optimizer, as a write barrier can be just another read+write to the same part of the heap that was already updated (often the same cache line), whereas GC frames interfere with register allocation and live variable analysis. And if there's a performance problem along a hot path with write barriers, that can still be manually tuned.
I also note again that precise stack scanning doesn't help with write barriers, which is an orthogonal issue that arises on top of finding GC roots stored in local variables. Even if we adopted precise stack scanning for foreign code, the insertion of write barriers would still be necessary.
f(pop(&global_stack));
This is invalid.
Even if we adopted precise stack scanning for foreign code, the insertion of write barriers would still be necessary.
Right. And even if you implement conservative scanning, the insertion of write barriers would still be necessary too so
not having to rewrite a large existing codebase
Isn't true.
Hmm. I believe that we may be simply having some miscommunication here.
not having to rewrite a large existing codebase
Isn't true.
By rewrite I literally meant rewrite, as in wholesale alterations; I don't mean that no alterations are necessary, just that they are kept at a manageable level. Not needing any changes at all would be unrealistic: even dropping in a fully conservative GC (such as the Boehm GC) cannot generally be done entirely without changes; you may have to worry about zeroing a free list to prevent leaks, for example.
The rationale behind this is the software engineering cost involved, which need not be zero, just reasonable. Of particular interest is the case where one controls the upstream version, too, and Julia integration is essentially a configuration flag.
I can see how this could be read differently, but that's not what I was getting at.
By rewrite I literally meant rewrite, as in wholesale alterations
OK.
Still. I think https://github.com/JuliaLang/julia/issues/11714#issuecomment-302065356 is a better way forward since it is way too early to expose any part of the julia GC to a public API.
@rbehrends: my understanding from what you said previously was that for at least some of your dependencies, literally no changes could be made â down to the shared binary library level. If that's not the case, that certainly opens up many more options.
On Wed, May 17, 2017 at 11:45:19AM -0700, Stefan Karpinski wrote:
When it comes to GMP and MPFR, we've considered having these types implemented as native Julia objects instead and only calling out to the libraries for complex operations. And of course, there's a ton of performance to be gained by immediate representation of smaller values â which presumably is what you're using the C-union between pointer and immediate values with tagging for. We should really work towards having efficient Julia-native support for this and only call out to GMP/MPFR when necessary. That would eliminate a lot of these problems in a much cleaner and more efficient way.
That is what GHC does in its integer-gmp library
@StefanKarpinski It's not that we're dealing with a single situation. We will likely have to deal with both dependencies where we indeed can't change a thing (GMP has been mentioned) and where we have to work around this limitation, but we'll also have code that's custom-written for interacting with Julia where we can do whatever we want; for yet other dependencies we may have different stakeholders with different (and possibly conflicting) goals involved.
But that's why I wanted to keep focused on the technical aspects of FFIs, with a looks towards having some flexibility in how to deal with interoperability. Having more or more flexible options generally means that one is less constrained by backwards compatibility concerns, software engineering constraints, or (in the worst case) project politics.
But that's why I wanted to keep focused on the technical aspects of FFIs, with a looks towards having some flexibility in how to deal with interoperability. Having more or more flexible options generally means that one is less constrained by backwards compatibility concerns, software engineering constraints, or (in the worst case) project politics.
Having to provide more flexible options also means that julia itself will have to be constrained by all of the above. As a recent example, I couldn't have done https://github.com/JuliaLang/julia/pull/21590 without breaking public API or having to emulate it if any part of the GC marking internals are exposed.
For interfacing with C, I believe the best option is to use them like C.
There are C code that are writen for multiple languages/users (GMP for example) and there are others that can rely on / link to julia runtime. For the foreseeable future I believe the vast majority of the libraries will not link to libjulia.
The C libraries that are already writen are usually writen with C memory management in mind and are meant to be used that way so I believe we'll get the largest benefit if we can provide better support for that and that's exactly https://github.com/JuliaLang/julia/issues/11714#issuecomment-302065356
For C libraries that can be modified and adapted to julia should do that by incrementally porting the C code to julia. Works are being done to make that easier (e.g. https://github.com/JuliaLang/julia/pull/21912, which will make it much easier to declare and manipulate nested C structures). Even for libraries that can't link with julia, if it has reasonable low level API we should be able to port away from it's high level APIs and implement some of those in julia. This is effectively what we want to do with BigInt
Both of the two ways does not rely on deep changes to the language and can benefit all users instead of one particular use case.
On the other hand, if we try to efficiently support the memory management of one particular C library, it'll put a lot of constraint on what we can do/change in julia. Note that this is talking about literally the lowest level implementation in julia since the GC is the only part of the runtime that doesn't call another other major parts (obviously other than true external dependencies). There's also usually no standard ways to do all kinds of advanced tricks (e.g. tagged pointers) in C so having direct support of it would not be useful for FFI in general unless we decide to also make use of it in julia itself for other reasons.
We are quite different from many other scripting languages like python in that you should write glue code in julia and not C. This is possible due to the JIT and is also why pypy prefer cffi rather than cpyext. If there's something missing that makes it harder to wrap C code efficiently, the preferred way to fix it is to support that in julia and not exposing more C apis (i.e. the two ways mentioned above).
@yuyichao I think our project has quite different requirements than interfacing with a C library. Gap is written in C, but it is itself a garbage collected (interpreted) programming language with additional dependencies. A large amount of effort has already been expended by @rbehrends to write bespoke tooling to insert write barriers in the Gap codebase (@rbehrends is the principal architect of HPC Gap).
You've indicated that it may be too early to formalise any Julia GC API. But my gut feeling is that we may still be better offering to contribute to the design and implementation of an experimental/unsupported interface in a branch, rather than abandon entirely the approach @rbehrends has been steering us towards.
@StefanKarpinski Our project is a collaboration between existing projects, including those of us who are already implementing in Julia. We don't have control over the repository of Gap or its dependencies. Some of their key developers are excited about Gap<->Julia interop and visited us for a week to hack on some potential ways to do that.
it is itself a garbage collected (interpreted) programming language with additional dependencies
So it cannot be used in C?
@yuyichao That is one of the things we investigated with the Gap developers when they visited recently. It turns out that there is a project called libgap, which will make it possible to call into Gap from C, essentially. We prototyped calling this from Julia (without GC considerations for now), and calling Julia from Gap (again, in a simple way, without GC considerations). Most of the functionality in Gap is actually implemented in the Gap language, not the C kernel of Gap. This is one of the things that makes Julia so attractive for them. Gap is not JIT compiled, but interpreted, in a very dynamic way. Julia would be fantastic as a Gap library language for serious developers.
I should probably add that Gap is only one of a number of languages that we will make use of in our collaboration. Another is Singular, where @rbehrends has again invested much expertise and time into concurrency for us. There is also Polymake, which is basically a hacked version of Perl 5. So Gap should be seen as an example of something more general, not a singularity.
We do also have a number of large C libraries that we are also using, so we of course care about good C interop too. So the suggestions you and @StefanKarpinski are making with regard to C are also of interest to us.
This might be a bit out there, but have you considered writing a Gap interpreter in julia?
@Keno Yes, in fact a Gap developer actually started working on such a thing, but unfortunately abandoned it.
At present we haven't allocated funding to do such a thing for Gap, but we have allocated funding to do such a thing for Singular, which is a much simpler language. The real problem with Gap is the exceptionally dynamic nature of its type system. It doesn't map 1-1 with Julia. The much more interesting problem, for them and us, appears to be using Julia as a library language.
Having said that, there is great interest in rewriting a mathematical language, called Cap, which is currently implemented in Gap, in Julia. That will likely go ahead, from what I heard most recently.
Sorry to get right off topic here.
This might be a bit out there, but have you considered writing a Gap interpreter in julia?
This and related approaches (such as JIT-compiling GAP code or having a statically typed and compiled extension of the GAP language) have been considered in the past, but so far, this has gone nowhere.
Challenges would be:
I'll also note that if it were just about GAP <-> Julia interoperability, we could simply ship a patched Julia with GAP (enabled to support conservative stack scanning). But the Oscar platform is intended to be more modular and extensible than that, so we'd like to get interop right.
(Separately, there's still the current issue of performance related to GC frames within Julia proper, which is an independent concern, but my understanding is now that these issues will be addressed through LLVM-supported precise stack scanning.)
Most helpful comment
I've been experimenting with conservative stack scanning in the Julia GC. I've implemented something that so far seems to work ok for single threads and has basic threading support [1]. It's not currently a pull request since it's still a work in progress (some help with figuring out the remaining problems would actually be great) and I see that this ticket is still marked "decision" anyway.
Performance seems to be so far unchanged; as I've only added new code that is called on top of the existing one, that is not surprising (it shouldn't have a huge impact). One goal is (obviously) to eliminate GC stack frames as a result and gain performance benefits from that. I haven't (yet) dug more deeply into that: for the time being, my goal was primarily to have solid evidence that this is feasible in Julia without reengineering the GC.
Would this be the right place to have a discussion about where to go from here? Is there still generally an interest in doing this, if the expected performance benefits materialize?
I'll note that there are two reasons why I've been working on that. One is that we seem to have some performance problems that seem to be related to the current explicit GC frames. The other is that I'm part of a large external project [2], where we would like to integrate Julia with various computer algebra systems, such as GAP [3]; the specific goal is to have Julia actually handle the memory for part or all of those systems. Having conservative stack scanning as an option would greatly facilitate this.
I can talk more about our application if anyone is interested, but it's not really important here (except that we are also interested in the related issue of having support for marking foreign objects in the Julia GC, for which I can open another ticket to discuss [4]).
Note also that some of the technical details are discussed briefly in doc/src/devdocs/stackscan.md.
[1] https://github.com/rbehrends/julia (branch: rb/conservative-stackscan)
[2] https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html
[3] https://www.gap-system.org/
[4] We are generally interested in having support for Julia's GC to handle foreign objects directly where possible in order to eliminate the performance cost of additional indirections and allocations.