V: V Concurrency Considerations

Created on 23 Feb 2020  Â·  82Comments  Â·  Source: vlang/v

Transferred from a document posted here in these documents by @cristian-ilies-vasile:

V concurrency high level design

After a carefully consideration of proposed concurrency models and suggestions expressed on discord v-chat channel the high level design is based on GO language model (message passing via channels) which in turn is a variation of communicating sequential processes.

I did read papers on actor model but seems that coders do not use all the primitives provided by language and resort to threads and queues (see Why Do Scala Developers Mix the Actor Model paper).

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

Because there are many words used more or less interchangeably like coroutine, goroutine, fiber, green threads etc I will use the term _green thread_.

Key points

  • the language will provide primitives able to deal with green threads – the green threads monitor/scheduler will be part of the language
  • there will be no yield() points in code; the scheduler can suspend / resume / block / cancel any green thread in a preemtive mode
  • based on memory model the green thread will be stackless or stackful
  • one green thread could spawn n others green threads
  • a green thread must be reentrant
  • the scheduler could start a new instance of a green thread in order to accommodate load bursts (elastic computing)
  • Green threads communicate using message passing (channels) instead of shared variables

Open points

Feature Request

Most helpful comment

@cristian-ilies-vasile finding and fixing needle-in-a-haystack thread synchronization heisenbugs is not equivalent to a paradigm which can never create those bugs. The latter would be provably 100% free from these bugs at compile-time. I urge you please do not repeat the design mistake of encouraging the use of thread synchronization primitives. I urge you to only allow access to these in unsafe{} code. Safe code should be provably safe at compile-time. As I explained herein and in the companion thread #1868 there are shared data paradigms which are provably safe at compile-time such as immutable shared data. The Vlang compiler could enforce it by for example only allowing threads to share references to immutable shared data.

Just as you would not allow pointer arithmetic in safe code, you should not allow in “safe code” a paradigm famous for creating insidious heisenbugs because it is well known to not be safe at all and to _always_ proliferate insidious bugs. I believe Eric Raymond (re-)coined or first used (in 2014) the term “defect attractor” to describe intentional means to create defects in software.

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits. It’s impossible to thoroughly reason about thread synchronization because it is a dynamic synchronization problem and Turing-complete programs can’t be proven to halt (the Halting problem). Thus these _unseeable_ heisenbugs are not equivalent in genre to other kinds of bugs which can be in many cases reasoned about and prevented (or at least discovered) by human brains and _eyeballs_ or some formal methods, _without even running the program_.

Additionally as you presumably know that thread synchronization defeats massively multicore (which has arrived already for servers) due to Amdahl’s law (especially in conjunction with amplified contention due to increasing numbers of contending threads). Really you want to embrace immutability, it is the only viable direction going to the future. And Golang does not have any capability for expressing immutability at this time, so this can be a significant advantage for Vlang.

Another reason to favor “future-proof” immutability is what I wrote in the companion thread:

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

I have contemplated other benefits to immutability.

P.S. I have not read about Weave yet.

EDIT: Eric Raymond wrote which supports my warning:

While the ability to manage lots of concurrency fairly elegantly does pull Go in the direction of what goes on on a modern multi-core processor, CSP is a quite abstract and simplified interface to the reality of buzzing threads and lots of test-and-set instructions that has to be going on underneath. Naked, that reality is unmanageable by mere human brains – at least that’s what the defect statistics from the conventional mutex-and-mailbox approach to doing so say to me.

All 82 comments

I very very much want to see channels in V! They're one of the best ideas in Go; they're both simple and powerful, which is rare.

@ylluminate See related discussion: https://github.com/vlang/v/issues/1868

@elimisteve
It's there.
Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

@ylluminate @cristian-ilies-vasile @elimisteve thank you for summarizing the high level goals/requirements - they're pretty much what I envision :wink:. I'll keep an eye on this to see how will everything evolve.

An amendment:

the green threads monitor/scheduler will be part of the language

should read:

the green threads monitor/scheduler will be part of the language, but will be extensible by the programmer to allow changing the default scheduling behavior

@dumblob
I updated the document shared on google drive with your comment.

https://golang.org/doc/go1.14#runtime
_Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection._

@cristian-ilies-vasile that's definitely interesting (see specification), but it's still by far not fully preemptible (therefore it's called asynchronously preemptible, because it actually still just inserts safe points, but now starting from go 1.14 they'll be inserted in many more places but still carefully enough to not increase the overhead much - actually they tuned it so that the overhead is in majority of cases not measurable - though in some cases the overhead seems to be enormous as seen from this benchmark and it's equivalent in plain C which doesn't suffer from any such bottlenecks).

It's also good to note here, that the implementation of "asynchronously preemptible" in go is really tricky and is definitely not universally applicable (hehe) to other languages (e.g. because of the problem on Windows which they kind of "solved" for go semantics and internals).

Though I generally like the go design with "safe points" - I deliberately didn't go into details like this when describing the "interleaving between concurrency and parallelism" as outlined in https://github.com/vlang/v/issues/1868#issue-489433130 because it's a lot of work with "little benefit" (as you see even go lang tackles preemptiveness - and to date still just partially - first now after many years of development by hundreds of engineers with many of them full time go devs).

Not sure if this adds any value to this conversation but ScyllaDB is build on top this async framework https://github.com/scylladb/seastar
There is also this Redis clone that uses it https://github.com/fastio/1store

Maybe instead of fibers etc. this kind of library solution could be used and added to vlib?

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

@pquentin thanks for the pointer - didn't know about the Trio concurrency library (and the "nursery concept"). It's a very good approach to concurrency programming indeed. In the context of V I have these observations:

  1. I'm curious how long running "background" tasks would be implemented in V as authors of the nursery concept recommend using yield to make a generator, but there are no generators/yield in V

  2. Assuming my proposal gets implemented in V, nursery objects might be "simulated" using the pluggable scheduler from my proposa. The scheduler state can be tinkered with, i.e. read from and written to, any time pretty much as nursery objects has to be passed around though it'll be less convenient to traverse the one global graph in the scheduler state than to have explicit nursery object from each "split" where spawning happened. That said my proposal allows for cheap abstraction in form of an additional API modeled after nurseries, but doesn't enforce it.

    I think the only thing missing is to specify, that the pluggable scheduler has to handle "exceptions" (i.e. optionals "raised" by return error() and panics "raised" by panic() if panic() will stay in V). @cristian-ilies-vasile could you please add to your overview?

    Care would have to be taken also when implementing the automagical (de)spawning of reentrant go routines together with nurseries as that actually means dynamically at arbitrary times change the nursery objects (maybe making nursery objects just "live views" of the pluggable scheduler state?).

  3. One thing to contemplate is enhancing go routines to immediately return something. E.g. a handle which could then provide methods like value<t>() t which would be a blocking method returing exactly what the routine would return if it wasn't executed with prepended go (including optionals, of course), running() bool which would check in non-blocking manner whether the routine already finished, wait( s float ) to wait for completion with a given timeout, etc. Such handle would need to support reentrancy though, so the methods above would need to distinguish which instance we're talking about :wink:.

    This might ease an implementation of the nursery concept on top of the proposed primitives (with the chance of them getting included into V standard library and maybe even as built-in).

@pquentin Interesting concept, but preventing goroutines from being spawned without permitting execution of the main thread/goroutine is _extremely_ limiting and would prevent a huge percentage of the value gained from having them in the first place.

The argument made against goroutines is honestly a severe straw man, as the normal and common way to achieve a nursery-like pattern in Go is to use a WaitGroup, or to pass a "done channel" (done := make(chan struct{})) to each function or method spawned in a goroutine, enabling each to signal when it's done doing its work.

The fact that general solutions exist in the ecosystem is eventually admitted in the middle of the 4th footnote:

[Edit: I've also been pointed to the highly relevant golang.org/x/sync/errgroup and github.com/oklog/run in Golang.]

That said, the point about stack traces only going back to the point where the goroutine was launched, rather than going all the way back to main, is an interesting one I hadn't realized :+1:.

I disagree. The argument against unstructured Go is not a straw man, just like the arguments against "goto" a generation years ago weren't straw men. Enforced structure allows new ways of reasoning about your code, thus enabling you to code the Happy Eyeballs algorithm in 40 lines of code instead of 400. This directly translates to fewer bugs.

Yes, cool, the ecosystem has solutions, but if it's way easier to not use them (e.g. because they require a heap of boilerplate in every function call, can't handle cancellation, and whatnot) they're ultimately not helpful. In Trio, "waiting for your child tasks" and "leaving the scope of a nursery" is exactly the same thing and requires zero lines of code (just the end of a block, in Python that's a de-indent), which again helps reduce the bug count and reduces the cognitive load on the programmer.

@dumblob The yield thing is just a convenient wrapper to package "run some code that returns a value, SOME_CODE, run some more code to clean up no matter how SOME_CODE exited" in a single function. The caller says with wrapper() as value: SOME_CODE, and Python guarantees that the cleanup code always runs when you leave the scope of that with block.

In current V (or Go) you'd use a defer block for cleanup, but that's potentially buggy – you must never forget the defer line, otherwise you have a resource leak, an unreleased lock, or whatever. Clean-up should be encapsulated in the object – the caller (i.e. the person writing the calling code) shouldn't have to think about whether, let alone how, to do it. Contrast Python's

    with open("/some/path") as file:
        write_to(file)

with

    file := os.open('/some/path')
    defer { file.close() }
    write_to(file)

IMHO that second line shouldn't exist. Not for files, not for locks, and not for nurseries.

In Go, starting a goroutine is as easy as it gets – but cleaning it up is not and needs to be done manually, esp when error handling is involved. Making coroutine creation slightly more difficult (by requiring a scoped nursery object to attach them to) is a no-brainer when the dividend you get from this is trivial cleanup.

@elimisteve

preventing goroutines from being spawned without permitting execution of the main thread/goroutine

I don't understand that sentence.

In current V (or Go) you'd use a defer block for cleanup, but that's potentially buggy – you must never forget the defer line, otherwise you have a resource leak, an unreleased lock, or whatever.

Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper defer. And having no guarantees totally defeats the purpose of the whole nursery concept :wink:.

Having no guarantees is detrimental to a whole lot of other constructs too. Locks for instance, or not leaving a file handle open when I'm done with using it.

So that's not an argument against nurseries, that's an argument for statically-bounded-visibility of objects – with a destructor that is guaranteed to be called exactly once. As V doesn't seem to have that concept, perhaps the first step should be to add it.

Having no guarantees is detrimental to a whole lot of other constructs too. Locks for instance, or not leaving a file handle open when I'm done with using it.

I don't think it's comparable. I argue, that everything from unsafe {} in V (e.g. locks) can be easily avoided by other safe constructs V offers and thus I won't discuss nor assume that in this discussion (that's the whole point of existence of unsafe{} in V). What's left? Not much. Actually I think only files and then a group of infinite number of semantically high-level cases which we can safely ignore in this discussion, because they're not solvable by any technology, but only by the programmer.

Regarding files, yes, that's undoubtedly a hole in the safety. On the other hand I argue, that open files (be it virtual unix files or any other files) are by far not that detrimental as concurrency issues.

@medvednikov what about putting file handling also into unsafe{}? It's not that ridiculous as it sounds as file system tree hierarchy is getting slowly "replaced" by other DB-like interfaces - especially in the context of Web (WebAssembly as well as WASI is agnostic to the concept of filesystem hierarchy and support for filesystem hierarchy is just plain optional module pretty much like any other DB API module etc.).

In summary, implementing a destructor-like concept in V wouldn't IMHO solve anything (there have been several discussions in this issue tracker and elsewhere - feel free to join them as this thread is about something else).

Back to the topic. Because I like the nursery concept, I'm asking for ideas how to implement the "background" long running tasks with guarantees, but without introducing yield in V and a bit more syntactically readable than https://godoc.org/github.com/oklog/run#example-Group-Add-Context (assuming larger graph of such "actors"). Anyone?

@dumblob
initial requirements
_the scheduler could start a new instance of a green thread in order to accommodate load bursts (elastic computing)_
comments on https://discordapp.com/channels/592103645835821068/592842126304477184
o the green thread monitor will spawn a 2nd consumer if first cannot cope with the load.
o how exactly will the scheduler/monitor know what to spawn?
If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel? Also what would happen, if the system does not have enough resources for the new consumer?

  • how exactly will the scheduler/monitor know what to spawn?

The scheduler is pluggable, so it's up to the programmer. If you're asking for defaults, then I'd follow the principle only good defaults determine success. So I'd say we really want to provide basic scaling ("elasticity") by default, so we should provide some scheduler working moderately well for common scenarios (e.g. average destop PCs, average laptops, average smartphones, average small/medium servers).

If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel?

I think the default behavior (as mentioned above) should be without any custom spawning function - the scheduler will first look whether the go routine is "eligible" for elasticity (I envision this through the reentrancy flag) and then will look for every channel (the ring buffer) used in the go routine which is also used in other go routine (thus "connects" them while forming dataflow programming pattern - imagine e.g. NoFlo or Node-RED) and then mux (multiplex) the channels (inputs & outputs) accordingly.

Also what would happen, if the system does not have enough resources for the new consumer?

By default I wouldn't do anything (except in debug mode I'd log this information if it wasn't logged before in the last minute or so).

@dumblob consider a situation, where due to a temporary peak in load, it cannot keep up, so a channel/buffer/queue is filled to the max. The scheduler decides to launch a new worker, which increases the load even further, and so on and so forth => Things come to a grinding halt.

Adding a positive feedback loop, should not be done without understanding its limits and the behavior of the system in the degraded cases. I personally think, that it can not be done in the general case. I agree that this kind of elastic scaling may be useful for some cases, where you know that the feedback will be limited by external factors, but I think the default should be to do nothing, except may be log that the buffer was found to be full, and do not try to add new consumers automatically.

consider a situation, where due to a temporary peak in load, it cannot keep up, so a channel/buffer/queue is filled to the max. The scheduler decides to launch a new worker, which increases the load even further, and so on and so forth => Things come to a grinding halt.

This is just one of many situations. The default scheduler should be pretty conservative. There are also many other measures the default (i.e. a simple one) scheduler can (shall) take into account - but in this case it's quite easy as the scheduler knows everything about the whole graph (unlike in many other technologies where one doesn't know the whole state) and connections between nodes - imagine you know the load of all buffers thus the min/max flow between any two nodes (preferably a go routine instance acting as pure source(s) and a go routine instance being a pure sink(s)) is easily observable and thus the whole system perfectly tunable as the underlying graph theory and approximation algorithms are out there.

My hint to use full buffers as an important information for the scheduler was meant primarily as a trigger for reevaluating the situation, not as the only measure :wink:.

Adding a positive feedback loop, should not be done without understanding its limits and the behavior of the system in the degraded cases. I personally think, that it can not be done in the general case. I agree that this kind of elastic scaling may be useful for some cases, where you know that the feedback will be limited by external factors, but I think the default should be to do nothing, except may be log that the buffer was found to be full, and do not try to add new consumers automatically.

That's the sole reason why reentrant go routines exist. Only reentrant go routines will take part on elasticity - any other instance of a go routine (i.e. non-reentrant instance) will not get any additionally spawned/stopped instance from the scheduler. So it's again 100% in the hands of the programmer and V in that case doesn't handle a general case you seemed to be afraid of :wink:.

To clarify, I'm not proposing having a cool conservative scheduler supporting elasticity already in V 1.0. The reentrant go routines can be implemented the very same way as non-reentrant ones (i.e. without elasticity) for half a decade and it'll be fine. What I'm proposing though is the semantics, which I find important for now and the upcoming decades and which thus must be part of V 1.0.

discord quote
_how exactly will the scheduler/monitor know what to spawn?
If it needs to do that in a general way, does that mean that you would have to register a custom spawning function to each channel? Also what would happen, if the system does not have enough resources for the new consumer?_
@dumblob
I think that could be done in a transparent way for coder. The green thread monitor could know which GT (green thread) is feed by which channel. At every preemptive allocation/reallocation the monitor could compute few smart statistics and check that the FIFO grow at a higher rate than the GT could process.
the GT should have an option indicating that the coder will allow GTM (Green Thread Monitor) to scale automatically the load or not.

@cristian-ilies-vasile that's basically how I envision that (though not that simplified :wink:) with the only difference, that you've split my concept of a general scheduler into two parts - a scheduler and a separate monitor (I'd though rather leave them technically together as both work on the very same data and splitting them will undoubtedly lead to performance decrease).

@dumblob
No, the GTM (green thread monitor) contains the scheduler. In fact is/will be the same piece of code with 2 definitions attached.
I placed a sketch of testing concurrency document on the shared folder
https://drive.google.com/drive/folders/1LwsfHBlEdOEIf2IT7SjijKoJic7mN6Zj

Understanding Real-World Concurrency Bugs in Go
https://songlh.github.io/paper/go-study.pdf

No, the GTM (green thread monitor) contains the scheduler. In fact is/will be the same piece of code with 2 definitions attached.

Ok, I'm fine with that :wink:. One minor thing is though the naming - I'm deliberately trying to not use any terminology resembling cooperative multitasking (e.g. "green thread"), because it's highly misleading (for the future as well as for the current state of things) as it implies many guarantees which the full preemption which I advocate for in V can not offer.

@cristian-ilies-vasile could you rename the Green Thread Monitor to something way more generic? Maybe V Thread Monitor (and V Thread Scheduler for the scheduler API) or so?

Understanding Real-World Concurrency Bugs in Go
https://songlh.github.io/paper/go-study.pdf

Yeah, that's a good source of issues with the Go semantics (especially the behavior of Go channels to be precise). I think V can learn from that (though many of their decisions have been done due to performance reasons) and improve on that - the full preemptiveness I'm strongly suggesting for V should help with some of them too.

Btw. for Go is this paper not any more valid after introducing the non-cooperative goroutine preemption in Go 1.14.

OK, here are new descriptions:

V Light Thread Monitor
V Light Thread Scheduler - for the scheduler API

Seems that AI could help with subtle deadlocks errors.
https://medium.com/deepcode-ai/deepcodes-top-findings-9-deadlocks-836f240ad455

Like a say on discord
It can be fun to implement a kind of distributed dataset for multi-thread operations.
In a far future

https://github.com/microsoft/verona/blob/master/docs/explore.md

Concurrent Ownership
In Project Verona, we are introducing a new model of concurrent programming: concurrent owners, or cowns for short. A cown, pronounced like "cone", encapsulates some set of resources (e.g. regions of memory) and ensures they are accessed by a single thread of execution at a time.

@cristian-ilies-vasile yep, I've already seen Verona and this principle. IMHO it's build upon similar ideas as in the nursery concept discussed above (but with different syntax and focusing on data - in terms of individual variables - instead of the concurrency graph).

I find the concept neat though there are also doubts:

  1. it's new and I haven't seen any heavy concurrent code written using this concept to get a better overview about advantages and disadvantages (would be cool to port some heavy concurrent Go code bases to Verona to get a better picture)
  2. it seems really verbose (something V tries to avoid)
  3. it can be added to V later (e.g. after 1.0) - either as a library or an intrinsic/syntax - and won't conflict with the other concurrency/parallelism concepts discussed

@dumblob
How much overhead in terms of kilobytes/megabytes a multi threading run time will add to binary compiled code.

How much overhead in terms of kilobytes/megabytes a multi threading run time will add to binary compiled code.

Assuming dynamic libraries (on Windows DLLs and on nix systems *pthread), then V itself will be just few kbytes bigger than without a multithreading runtime. For static build, that'll be a different story as e.g. pthreads are quite big. For -bare build, it'll depend mostly on the capabilities of the underlying HW (MCU, uC, ...), but that's usually very small, so again at max few kbytes.

Generally I don't feel this is of a concern (there are way bigger modules/parts of V). Is there anything specific you had in mind @cristian-ilies-vasile ?

I was thinking that the overhead will be larger, few megabytes, but if would be around 500 Kbytes it is OK from my side.
_"Is there anything specific you had in mind"_
We have this discussion here but I am not quite sure if this type of concurrency, with a dedicated run time will be accepted as part and parcel of the language (like in GO).

We have this discussion here but I am not quite sure if this type of concurrency, with a dedicated run time will be accepted as part and parcel of the language (like in GO).

Well, from my point of view, there is basically no runtime for this. Moreover the scheduler, which will be basically the only runtime, is pluggable and the default one should fit under 1000 SLOC with flying colors, so it's really tiny. It's similar to Go - there the only big part of the runtime is the garbage collector, but V has none, so no need to worry :wink:.

The idea is to pursue your dream, isn't it?
How Not to Land an Orbital Rocket Booster

@cristian-ilies-vasile I don't follow.

Mind it's not a rocket science (though thanks for the rocket icon under my post above :wink:) - historically there has been thousands of programmers writing more or less advanced schedulers (imagine all those kernels, microkernels, exokernels, etc., network device firmwares, HPC apps, language VMs etc.).

It's really ubiquitous (and many of these schedulers are pluggable - if you're running Linux, you can try it out immediately and choose among tens of schedulers for different subsystems on the fly back & forth).

@wanton7 cool, even Java world moves slowly forward ;) Note though, Loom is a nearly perfect copy of what the modern Go threads do (see https://github.com/vlang/v/issues/3814#issuecomment-592643754 above) with the same pros & cons.

Pros are mainly high performance and quite good scaling. Cons are mainly about "it works well only in pure Go apps which do not use/call anything from the outside world". It's because both Go and Loom try to get away from time sharing at all cost (their schedulers don't do time-based preemption at all) which is doable only and only if you have 100% control over the source code or byte code of your application. Which would mean to not use any non-pure-V library in the future which is insane IMHO.

Thus in my proposal I'm sacrificing a tiny bit of performance just to allow seamless and totally troublefree use of any non-pure-V library directly from V.

Note also we're talking about V language semantics - the implementation under the hood might change every day if needed and due to a pluggable scheduler, the programmer is actually free to implement her own scheduler possibly immitating Go/Loom behavior if she is confident her app doesn't use any non-pure-V libs.

Fairness in Responsive Parallelism
https://dl.acm.org/doi/pdf/10.1145/3341685

@dumblob
It was a conversation today on discord channel related to various V's things, and we can start working on coroutines ahead of scheduled V 0.3 version!

From Folklore to Fact: Comparing Implementations of Stacks and Continuations
https://kavon.farvard.in/papers/pldi20-stacks.pdf

https://github.com/hnes/libaco

https://github.com/baruch/libwire

https://swtch.com/libtask/

https://github.com/halayli/lthread

https://github.com/Tencent/libco

https://byuu.org/library/libco/

libco is a cooperative multithreading library written in C89.
https://byuu.org/projects/libco

Protothreads are extremely lightweight stackless threads designed for severely memory constrained systems
http://dunkels.com/adam/pt/
https://en.wikipedia.org/wiki/Protothread
a c continuation library inspired by Adam Dunkel's ProtoThread(With malloc)
https://github.com/matianfu/FUNK
Protothreads (coroutines) in C99
https://github.com/zserge/pt
c-block are extremely lightweight macros designed for eliminate callback hell, inspired by Duff's device and Protothreads. c-block provide sequential flow of control that is similar to the await of c#
https://github.com/huxingyi/c-block

lthread is a multicore/multithread coroutine library written in C
https://github.com/halayli/lthread/blob/master/docs/intro.rst

An Implementaion of Coroutines for C
https://github.com/spc476/C-Coroutines

libdill: Structured Concurrency for C
http://libdill.org/documentation.html

Fibers: the Most Elegant Windows API
https://nullprogram.com/blog/2019/03/28/

o debugging / visualizations of Light Threads Graph, events, states and actions.
I remember that Mars Rover has suffered serious glitches due to so called "priority inversion" bug [1], [2]

I think that for testing/validating phases we should be able to obtain these details in a human readable layout. For few items the .dot language suffice (https://www.graphviz.org/), but for large graphs is ineffective.

Other options are
//1 use the railroad diagrams
https://www.bottlecaps.de/rr/ui
https://github.com/GuntherRademacher/rr

//2 use graph visualization application able to deal with thousands of items.
https://gephi.org/

[1] https://www.rapitasystems.com/blog/what-really-happened-to-the-software-on-the-mars-pathfinder-spacecraft
[2] https://www.slideshare.net/jserv/priority-inversion-30367388

An interesting paper on how to deal with saturated locks.
Generic Concurrency Restriction
https://labs.oracle.com/pls/apex/f?p=LABS:0::APPLICATION_PROCESS%3DGETDOC_INLINE:::DOC_ID:1078

An interesting share by @cristian-ilies-vasile:

http://uu.diva-portal.org/smash/get/diva2:1363822/FULLTEXT01.pdf (backup)

In summary from their Comparison section (_8.3 Result_):
table

Strengths:

Encore

  • Subord capability
  • Pass _this_ of an immutable object to an actor
  • Split object into parts
  • Option to use unsafe code

Pony

  • Tag capability
  • Mutable reference allowing only immutable aliases
  • Converting capabilities, including converting any reference to opaque
  • Functions can have restrictions
  • Default capabilities for each data type
  • Same class can be created with different capabilities

Rust

  • Global write aliases
  • Reference which can switch between being mutable and immutable for each call
  • Option to use unsafe code
  • Default capabilities
  • Same data structure can be created with different capabilities

Weaknesses:

Encore

  • Not possible to recover removed traits or functions (in the future some traits and functions can be recovered by _packing_)
  • Only _linear_ objects can become immutable
  • The allowed conversions (with _consume_ or _unpack_) is complicated and confusing, for example _read_+_local_ cannot be split while _read_+_linear_ can
  • The capability _unsafe_ is unsafe and can cause race-conditions, data-races, deadlocks, data corruption, crash of program and memory leaks
  • One cannot unpack an object in all ways one can consume an object, even though not all objects can be repacked (in theory)

Pony

  • One can only pass _this_ as the capability _box_, meaning it cannot be changed for that alias for any capability
  • _This_ of _val_ cannot be sent to actors

Rust

  • _Shared mutable data_ can easily become unsafe if not used correctly and can then cause race-conditions, data-races, deadlocks, data corruption, crash of program and memory leaks

Concurrency Kit
Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems.
http://www.concurrencykit.org/

Lockfree Algorithms
Design and implementation of scalable synchronization algorithms
http://www.1024cores.net/home/lock-free-algorithms

Liblfds, a portable, license-free, lock-free data structure library written in C.
https://www.liblfds.org/

Speaking of C/C++ libs, I'd recommend rather newer stuff - IMHO the most performant and at the same time easiest to use is Weave. It's even undeniably faster than e.g. multithreading libraries hand-tuned by many Intel devs (who work on them for more than a decade). It also scales much better than any other multithreading and HPC lib I know of.

And the best of it? Weave has no platform-specific hacks (neither code nor assembly nor any restrictions), so it performs that well on all exsting platforms supported by a C compiler.

Because Weave is written in Nim, you can easily get its C source or manually transpile it to V (Weave uses quite some metaprogramming Nim offers, but because we're talking built-in preemptive "subthreads" here for V, it should translate to "compiler magic" because that's V's answer to metaprogramming).

Weave uses the newest research findings and has many empirically designated constants fitting current & near future mobile, desktop and HPC systems. Therefore it's the most performant multithreading library and at the same time has probably the smallest code base among all these libs.

I think V could copy the SPSC and MPSC channels' (ring buffers') implementations as well as some constants and ideas from the work-stealing runtime Weave implements. Those are the core parts designating how the whole system will perform.

V Channels
@krolaw has published the first version of Go like channels and selects for V.
https://github.com/krolaw/vchan

https://preshing.com/20120612/an-introduction-to-lock-free-programming/
Another introduction of lock-free-programming to get a feel for it.

@dumblob you seem up-to-date with libs in this space. It seems like the atomic features are needed for lock free things to work. I am just doing some proof of concepts for basic stuff like increment, decrement numbers. Think I got most covered using https://gist.github.com/nhatminhle/5181506 but I am trying to get a solution for TCC using atomic operations. Can you or someone else put me in the right direction? Thinking if we build a good foundation of these function that we can build on-top-of later

@helto4real wow, thanks for such effort!

I think for TCC I'd look at MUSL stdatomic because it provides fallback functions (using locks, of course) if the compiler doesn't support atomics. MUSL does use Linux futexes, but it might give you a notion how to implement it for Windows.

For -freestanding (bare metal) this might be yet another challenge - but for now I'd focus on these major platforms.

@helto4real wow, thanks for such effort!

I think for TCC I'd look at MUSL stdatomic because it provides fallback functions (using locks, of course) if the compiler doesn't support atomics. MUSL does use Linux futexes, but it might give you a notion how to implement it for Windows.

For -freestanding (bare metal) this might be yet another challenge - but for now I'd focus on these major platforms.

Thanks that is a good tip! Will do!

If one goes to oracle labs (https://labs.oracle.com/) there is a search field on top of the page
Searching for concurrency or locks or threads will get back a couple of nice papers.

For hard to find C bugs (undefinded behaviour, use after free, memory leaks etc), there is a way based on new Oracle's AST interpreter (Truffle), run time compiler (Graal) and virtual machine called graalvm.
There is also an Clang IR interpreter named Sulong, so is perfectly resonable to execute low-level/unsafe languages (C, C++) on GraalVM

The simplified workflow is:
.v source(s) code -> |v compiler| .c source(s) code -> |clang compiler + linker| -> IR code -> |Sulong| -> |Truffle| -> |Graal compiler| -> Machine Code

Papers:
Sulong and Thanks For All the Bugs / http://ssw.jku.at/General/Staff/ManuelRigger/ASPLOS18.pdf
LLVM IR IN GRAALVM: MULTI-LEVEL, POLYGLOT DEBUGGING WITH SULONG / https://llvm.org/devmtg/2019-04/slides/TechTalk-Kreindl-LLVM_IR_in_GraalVM.pdf
SULONG An experience report of using the "other end" of LLVM in GraalVM / https://llvm.org/devmtg/2019-04/slides/TechTalk-Schatz-Sulong_an_experience_report.pdf

A short list with C/C++ source code static analyzers, just in case.
http://frama-c.com/ (the software can run the Mthread plug-in - but this plug-in is cover by a proprietary licence)
http://cppcheck.sourceforge.net/
http://svf-tools.github.io/SVF/
https://cpachecker.sosy-lab.org/index.php
https://phasar.org/phasar/
https://github.com/nasa-sw-vnv/ikos

@dumblob
Hello. If you have time please take a look at this presentation related to Threading on the Mill CPU
Talk by Ivan Godard – December 5, 2017, at Facebook
http://millcomputing.com/blog/wp-content/uploads/2017/12/threading.02.pptx

NOTE: you need Microsoft PowerPoint or a good emulator like LibreOffice.
Thank you.

https://discord.com/channels/592103645835821068/592106336838352923/727539427144106015

I want to propose the block keyword, the usage will be block variable_name or block(variable_name) like already cast operation ie f64(x_point)
From this point in time, the compiler is aware that a former mut or atomic variable becomes read only, so it is safe to have n pointers to that object, passed down to thread which in turn just consume data.

I did propose a way to "cast" mut capability to a read only / const one.

mut age := 20
println(age)
age = 21
println(age)
block age or block(age)
// now age is immutable

age = 22 // error rised by compiler "age is immutable, cannot assign value 22 to age"

Optimistic concurrency with OPTIK

Abstract

We introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. OPTIK relies on the commonly-used technique of version numbers for detecting conflicting concurrent operations. We show how to implement the OPTIK pattern using the novel concept of OPTIK locks. These locks enable the use of version numbers for implementing very efficient optimistic concurrent data structures.

https://dl.acm.org/doi/pdf/10.1145/3016078.2851146
https://dcl.epfl.ch/site/optik

This is a very nice talk explaining the goroutines and scheduling in go. It also explains why lock-free not was chosen. Also how they solved the problems around syscalls and coroutines.

https://www.youtube.com/watch?v=-K11rY57K7k

I did some research and found an interesting approach for avoiding locks to be hammered in excess by threads and improve scalability.

Presentation / Generic Concurrency Restriction
https://labs.oracle.com/pls/apex/f?p=LABS:40150:0::::P40150_PUBLICATION_ID:5413

Paper / Avoiding Scalability Collapse by Restricting Concurrency
https://arxiv.org/abs/1905.10818

In this paper, we introduce GCR (generic concurrency restriction), a mechanism that aims to avoid the scalability collapse. GCR, designed as a generic, lock-agnostic wrapper, intercepts lock acquisition calls, and decides when threads would be allowed to proceed with the acquisition of the underlying lock. Furthermore, we present GCR-NUMA, a non-uniform memory access (NUMA)-aware extension of GCR, that strives to ensure that threads allowed to acquire the lock are those that run on the same socket.

@dumblob
hello,
For elasting computing / autoscaling capability we need to compute statistics for each thread and channel at least.

I find this paper
Circllhist -- A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring
https://arxiv.org/abs/2001.06561

The circllhist histogram is a fast and memory efficient data structure for summarizing large numbers of latency measurements. It is particularly suited for applications in IT infrastructure monitoring, and provides nano-second data insertion, full mergeability, accurate approximation of quantiles with a-priori bounds on the relative error.
Open-source implementations are available for C/lua/python/Go/Java/JavaScript.

Might this circllhist data-structure help keep track of latency over time for each thread?
Thank you.

Circllhist

Wow, that's a cool structure, I like it (I've been looking for something like that for I/O scheduling for block storage offering rather linera than parallel access).

I don't think though it fits anything we're seeking here. It's still quite bulky (https://github.com/circonus-labs/libcircllhist/blob/18f1b97603757d5e65baefa65b90d6ec182e8a30/src/circllhist.c#L152-L155 ) for our purposes, it won't scale enough with number of go routines (my goal would be tens of millions of go routines on a modern 4-core big.LITTLE ARM SoC for smartphones) and 99% of information it provides would probably stay completely unused during scheduling.

If we'll ever need anything "advanced" regarding "history", then the first thing I'd look into would be some kind of Kalman filter.

But I'm pretty certain we won't need anything else than what we already have to achieve satisfactory performance in V 1.0 - i.e. time measurement (given by scheduler invocations) and current filling of channels. Imagine heuristic like a simple "flood" algorithm but iteratively deepening from the most stalled places where one scheduler invocation would do next flood algo iteration only if the situation didn't get much better and would do flood restart if the change did seem to worsen the overall state).

Overall it seems the hundreds of links posted here directly or indirectly slowly more or less converge to links & ideas in the main thread. Which is generally a good sign if it doesn't cause too much duplication (e.g. the Vyukov presentation, resources about green threads, etc.).

Hereby let me encourage everybody to carefully look at all links in the main thread before posting any new links here to avoid (sort of detrimental) thoughts like "the other thread is old, let's focus only on this new stuff here because there is not much time".

@dumblob
I did promote on discord V's channels always both tickets and people have been invited to subscribe and express ideas or concerns on both, so do not worry :)

Going back to many links, this is a deliberate action from my side, trying not to pollute original entry and add here more or less valuable resources.

Bounding data races in space and time – part I
https://blog.acolyer.org/2018/08/09/bounding-data-races-in-space-and-time-part-i/

Bounding data races in space and time – part II
https://blog.acolyer.org/2018/08/10/bounding-data-races-in-space-and-time-part-ii/

I did some research regarding NIM’s multi threading engine named Weave
https://github.com/mratsim/weave
Embracing Explicit Communication in Work-Stealing Runtime Systems
Designing an ultra low overhead multithreading runtime for Nim

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing.
Quote: _Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect._

I know that V fans are working in the gaming industry - so main coders should decide what type of concurrency is best suited for language and applicability domains.
We can pursue Weave goals or if low latency is paramount then a full preemptive solution should be designed.

However Weave is freaky fast
_On matrix multiplication, the traditional benchmark to classify the top 500 supercomputers of the world, Weave speedup on an 18-core CPU is 17.5x while the state-of-the-art Intel implementation using OpenMP allows 15.5x-16x speedup._

Posting in hurry.

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing.
Quote: Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect.

This is not updated - Weave actually has an optional centralized queue to bring in certain degree of fairness more than acceptable for game and similar use cases. @mratsim, could this be rephrased on the Picasso/Weave sites/docs/wikis?

full preemptive solution should be designed.

You know me already :wink: - I'm definitely a proponent of fully preemptive solution to make V flawlessly interoperable with anything in the world (all those C/C++/FFI/... libraries).

There are "tags" (like [inline] etc.) in V which could later be used as programmer covenant with the compiler, that the given routine doesn't need to be preempted and if all parallel routines will be tagged like that, then a non-preemptive scheduler can be used instead of the default preemptive one. But that's a very specific case and I'm confident it shouldn't be part of V 1.0 (even the tiniest embedded chips have counter with interrupt and thus allow full preemptiveness which should be cheap with the proposal in the last paragraph of https://github.com/vlang/v/issues/1868#issuecomment-634516673 ).

Maybe we can borrow some Weave’s design decision (channel “plumbing”, work stealing) in the preemptive run time.
Also splitting potential go fn() workload in 2 main partitions:
a) cpu bound / aka sensitive to throughput
b) io bound + 3 levels of priority (Low, Medium & High) / aka sensitive to latency
is not a weird idea anymore. We can address specific issues in the best manner for each.

For Those About To Rock
The concurrency’s discussions topic on Discord has been launched!
https://discord.com/channels/592103645835821068/743149495818387476

I don’t do Discord so I have no access to what is being discussed there.

  • a green thread must be reentrant

I presume this is some attempt to achieve data race safety for data that is referenced by more than one green thread?[I found out it is some idea for duplication of a thread presented by @dumblob)

This is quite onerous in terms of what a type system will need to prove to fulfill this guarantee. Why can’t each green thread behave as if it is a sequential green thread and then the data race safety issue can be handled separately for shared data, such as by making it immutable or implementing some reference capabilities with ownership for the shared data such as for the Pony language?

  • Green threads communicate using message passing (channels) instead of shared variables

No shared memory access between threads? If yes, then that will be very limiting.

With massively multicore arriving now, and the inevitable cost to sharing memory accesses across a communication bridge between a cluster of cores, there be a need for limiting sharing of memory to a cluster of cores. There may still be a use case for sharing memory within a cluster?

If not sharing memory then what is the reason for the reentrant stipulation?

mux/demux green threads on real threads how to?

Golang presumably has a huge investment in this tuning. There are issues such as starvation pressure versus latency. This is very tricky to get right. Should you not be trying to lift much of the implementation from Go instead reinventing the wheel with your presumably much more meager resources? Google is presumably running Go on thousands of servers so has a significant incentive to invest in the runtime.

Here’s some links to a collection of the research, thoughts and notes I collected on M:N concurrency:

https://github.com/keean/zenscript/issues/17#issuecomment-416825734

https://github.com/keean/zenscript/issues/17#issuecomment-280900711

https://www.reddit.com/r/scala/comments/ir8ygb/what_is_missing_in_scala_ecosystem/g57axaz/?context=6

  • how to catch programs bugs (Heisenbugs) before the code is shipped to prod systems?

Are you asking what the data race safety model should be?

If you are referring to bugs due to multithreaded contention, don’t go there. It’s an insoluble, hornet’s nest. Opt for a lockfree model. There will always be lock contention at some higher-level semantic but then that is not your concern.

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

I need to read this next. I will post what I have for this comment first.

It also explains why lock-free not was chosen.

I will need to watch that, thanks.

Edit: I found the reference at ~11:30 juncture in the video. Okay so it’s apples-to-oranges in terms of what I am thinking should be lock-free. Dmitry is referring to the scheduler for the M:N user-threads (aka green threads). I am referring to the API that the Vlang programmer interfaces with. I am saying the the Vlang programmers should be prevented or strongly discouraged from using low-level thread synchronization primitives. And instead should employ other lockfree means for (avoiding) contention over shared resources. I elaborated in the other thread on this topic.

I also found this:

https://stackoverflow.com/questions/13102874/is-gos-buffered-channel-lockless

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

@shelby3 thanks for reaching out to us!

No rush here, V parallelism is work in progress so read slowly and carefully both threads (this one and https://github.com/vlang/v/issues/1868 ) and all referenced resources & links recursively. Then think of it for a few days, come back, clarify or adjust your concerns expressed above and we'll be more than happy to discuss open questions (if there are any :wink:) and accept PRs.

There should be enough time (look at the frequency with which both threads progress and correlate with commits in master), so don't hurry, we'll wait. Thanks.

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

Now I remember I already incorporated that into my analysis in 2018:

https://gitlab.com/shelby/Zer0/-/issues/11#spawning-goroutines-posited-to-be-as-harmful-as-goto

https://github.com/keean/zenscript/issues/17#issuecomment-359338947

The structured concurrency is applicable to when you want to explicitly express short-lived, concurrent operations which are considered to be children which should complete before the algorithm in the parent thread continues. Because in this case there are data race issues to contend with as the forked paths may return in any timing order.

Whereas, for goroutines which are to be long-lived entities no longer owned or subservient to the thread which spawned them, then structured concurrency does not apply and only appropriate data race safety restrictions on shared data apply.

IOW, I suggest we need both go and the nursery concept.

I read an interesting, 2020 issued, down to Earth paper on how to identify rare and hard to detect errors & bugs in multi-threading code (locks, unlocks, mutexes, semaphores, atomics etc)
Is very fast and effective to catch bugs, see page 4 of paper.
Sthread: In-Vivo Model Checking of Multithreaded Programs / https://hal.inria.fr/hal-02449080/document
The point is to use SimGrid software under Linux (pthreads and windows... not a nice match) is not so obvious stated on paper.
SimGrid / https://simgrid.org/
Other paper related to first one is a presentation, see slide 47 and next ones / http://www.ccs.neu.edu/home/gene/dmtcp-msr-19.pdf

On V's concurrency front we start working on a Poof of Concept based on Weave project.

@cristian-ilies-vasile finding and fixing needle-in-a-haystack thread synchronization heisenbugs is not equivalent to a paradigm which can never create those bugs. The latter would be provably 100% free from these bugs at compile-time. I urge you please do not repeat the design mistake of encouraging the use of thread synchronization primitives. I urge you to only allow access to these in unsafe{} code. Safe code should be provably safe at compile-time. As I explained herein and in the companion thread #1868 there are shared data paradigms which are provably safe at compile-time such as immutable shared data. The Vlang compiler could enforce it by for example only allowing threads to share references to immutable shared data.

Just as you would not allow pointer arithmetic in safe code, you should not allow in “safe code” a paradigm famous for creating insidious heisenbugs because it is well known to not be safe at all and to _always_ proliferate insidious bugs. I believe Eric Raymond (re-)coined or first used (in 2014) the term “defect attractor” to describe intentional means to create defects in software.

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits. It’s impossible to thoroughly reason about thread synchronization because it is a dynamic synchronization problem and Turing-complete programs can’t be proven to halt (the Halting problem). Thus these _unseeable_ heisenbugs are not equivalent in genre to other kinds of bugs which can be in many cases reasoned about and prevented (or at least discovered) by human brains and _eyeballs_ or some formal methods, _without even running the program_.

Additionally as you presumably know that thread synchronization defeats massively multicore (which has arrived already for servers) due to Amdahl’s law (especially in conjunction with amplified contention due to increasing numbers of contending threads). Really you want to embrace immutability, it is the only viable direction going to the future. And Golang does not have any capability for expressing immutability at this time, so this can be a significant advantage for Vlang.

Another reason to favor “future-proof” immutability is what I wrote in the companion thread:

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

I have contemplated other benefits to immutability.

P.S. I have not read about Weave yet.

EDIT: Eric Raymond wrote which supports my warning:

While the ability to manage lots of concurrency fairly elegantly does pull Go in the direction of what goes on on a modern multi-core processor, CSP is a quite abstract and simplified interface to the reality of buzzing threads and lots of test-and-set instructions that has to be going on underneath. Naked, that reality is unmanageable by mere human brains – at least that’s what the defect statistics from the conventional mutex-and-mailbox approach to doing so say to me.

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

Consider that a variable maintaining a value for security is being left in an indeterminate state from a data race. Targeted parallel calls cause races, causing variable to go undetermined, causing access to be granted when it shouldn't be. Attack may need to be sustained depending on likelihood of success. Memory is often reused in strange ways, the variable may not even be part of the race directly. Undefined behaviour, is undefined including leaving the vault unlocked.

The first 250 simple tasks successfully ran on the R U S H proof of concept (inspired by Andreas Prell's PhD thesis) .

fn x_sum(a int, b int, output_ch chan string)  {   // original function
    x:= a + b 
    output_ch <- ''
    output_ch <- strconv.v_sprintf("x_sum= %04d", x)
    output_ch <- ''
}

I published first version of R U S H proof of concept here:
https://drive.google.com/drive/folders/1Noh6xEd7hJj4-MxhzBtpRTIdogmGtqX5
It's still in a raw layout, with a lot of println(s) placed all around the code.
Also there is a second file, a demo of returning an integer from a function using channels.
The R U S H PoC shows that it is possible to have a piece of code able to run n functions in parallel.
Core ideas are taken from Andreas Prell PhD thesis and C code with a few modifications (improvements).

@cristian-ilies-vasile could you put all the sources to one gist to make it easier to comment on and discuss it (with the nice side effect of not polluting this thread)? Thanks :wink:.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shouji-kazuo picture shouji-kazuo  Â·  3Comments

aurora picture aurora  Â·  3Comments

choleraehyq picture choleraehyq  Â·  3Comments

lobotony picture lobotony  Â·  3Comments

vtereshkov picture vtereshkov  Â·  3Comments