V: Proper support for distributed computing, parallelism and concurrency

Created on 5 Sep 2019 · 83Comments · Source: vlang/v

There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.

Note, I didn't want to write this to https://github.com/vlang/v/issues/561 as the scope I'm describing is broader.

I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).

Why both concurrency and parallelism and not either of them?

In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).

We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about restrict - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.

Because instruction-level paralelism is solved, the best approach nowadays seems to be to:

write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations
then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.
and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do
and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.

The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.

This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).

See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.

IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.

Table of parallelism possibilities we have nowadays (just for reference):

parallelism kind | suited for execution duration | latency overhead | suited for how frequent execution startup | internode bandwidth quality has influence | special parallel SW architecture required | requirements for execution & deployment
---| ---| ---| ---| ---| ---| ---
CPU non-JIT vectorization on a CPU core | very short to long | ~none | frequent | none (it's just 1 node) | no | binary for the particular CPU
CPU JIT vectorization on a CPU core | short to long | low | moderate to frequent | none (it's just 1 node) | no | app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...) | long | low to moderate | moderate | lower to none (it's just 1 node) | yes (except: pure objects as actors or alike) | binary for the particular CPU
accelerators (GPU, FPGA, ...) | long | low to moderate | moderate | lower to none (it's just 1 node) | yes (except: array/matrix/tensor/...-based languages) | binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology | long | moderate | sporadic to moderate | moderate (hypercube/... is really fast) | yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages) | binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster | long | moderate to high | sporadic to moderate | moderate to high (can be fast, but not always) | yes | binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster | long | high | sporadic | high (basically can't be that fast) | yes | binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

(in practice these might overlap and/or be used simultaneously)

P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.

Other things to consider and/or track:

use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as JAX
maybe infinite loop in rand https://github.com/vlang/v/commit/a7c84834f4ba4948b8102a05b603bf8d51484760#r39632493
g_str_buf with parallel access:
- https://github.com/vlang/v/commit/778a1cc34ae5674c345d73e2a507bce9f77dea1b#commitcomment-36848204
- https://github.com/vlang/v/commit/0f92800921c706de0bd872f7eff30047a99c734d
FPU stuff needs special attention from the threading (work stealing) scheduler - see https://github.com/mratsim/weave/issues/163

Feature Request

Source

dumblob

🚀11

Most helpful comment

upvote for parallelism + SIMD vectorization as requested here https://github.com/vlang/v/issues/1079

oleg-kachan on 16 Sep 2019

👍4

All 83 comments

Wow, you should write a full master thesis around this topic one day ;)
For me I would prioritize this topic as follows:
Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io)
Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded
Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it
Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

gslicer on 5 Sep 2019

Wow, you should write a full master thesis around this topic one day ;)

I already did many years ago (wow, the time flies) - but on a (very) different topic :wink:.

Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io)

I'd say let's keep things simple and actually rather provide no locks, no semaphores etc. to "force" programmers stick with the builtin parallelism. Of course, there will always be the unsafe module providing raw locks etc., but that has nothing to do with parallelism in V.

Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded

There shall be just one construct - imagine go from golang, but nothing else. The point is to completely forget about SMP etc. - the programmer shall be actually strongly discouraged to think that the platform running her app will be capable of SMP and instead encourage her to think about fully dynamic behavior.

Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it

There is no need for this - if one would want to offload stuff, then she would just use the unsafe or ffi module for interfacing with e.g. a CUDA library. In other words instruction-level parallelism shall IMHO not have any explicit support in V (instruction-level parallelism is already supported in V through patterns the same way as in C99+ to which V compiles - we shall though "nudge" the compiler to do more vectorization and maybe even use smart auto-annotation SW). If utterly needed, then explicit instruction-level parallelism could be added the same way as C11/C++17/non-standard intrinsics (__m128i _mm_or_si128 _mm_cmpeq_epi8 and tens/hundreds of others).

Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

Actually rather not. There shall be no need for this - that's the responsibility of the programmer who will define her own cooperative/parallel scheduler (as I outlined in my first post). This might be a simple scheduler, but also - as you pointed out - something super complex adding tens/hundreds of new nodes, many new multiplexers and demultiplexers to the huge graph of go/v routines with queues scaling over heterogeneous clusters (for a moderately complex scheduler see internals of Python Dask). So this can't be part of V.

dumblob on 5 Sep 2019

@dumblob I see the "go" keyword as a builtin for multi-threading only (lightweight processess) - V could however try to loadbalance the threads automatically among available CPU cores?

Fully agree, that higher layers of parallelizm should be spinned off the the standard-libs (or even seperate lib-projects written in V)

gslicer on 6 Sep 2019

I see the "go" keyword as a builtin for multi-threading only (lightweight processess)

My proposal is about changing it to a similar semantics as in go lang - i.e. the go keyword would spawn a go/v routine (aka CSP thread or green thread or fiber or you name it).

V could however try to loadbalance the threads automatically among available CPU cores?

If you refer to operating system threads, then there is no need for V to loadbalance them as the operating system does it. If you refer to go/v routines, then sure, V will need a scheduler which shall be aware of the number of physical CPU cores available (disregarding whether tiny ones or fast ones like e.g. in big.LITTLE), spawn this number of threads and schedule go/v routines among them).

dumblob on 6 Sep 2019

👍1

As @aprell pointed out in https://github.com/nim-lang/RFCs/issues/160#issuecomment-529201833 , there is a nice and very current material describing the golang scheduler. It's a must read for those trying to implement an efficient scheduler for the mix of parallelism & concurrency. Some downsides of the golang scheduler are the following:

it treats all operating system threads uniformly assuming each thread runs on one physical CPU core (or "hyperthread" on hyperthreading CPUs), but this is not true on big.LITTLE or in cases of power management (see my comment https://github.com/nim-lang/RFCs/issues/160#issuecomment-529371631 ) - golang could though solve this in the future by moving running goroutines among threads (there are some tricky corners like thread local storage which would need to be atomically moved to the other thread as well, etc., but these are solvable)
there is no concept of "reentrant" goroutines in golang and thus it's impossible to spawn new instances (or kill running ones) and do the muxing/demuxing as outlined in my initial post above to accommodate for bottle necks - this is not so quickly solved in golang compared to the issue (1) above (as it would probably require extending golang) and I think this could be the major advantage of V compared to golang

Otherwise golang implements pretty closely what I described in my initial post above - so it would make sense to "port" all the parallelism & concurrency logic (which is not exactly a trivial code) to V and start improving on top of it.

dumblob on 9 Sep 2019

👍2

Thanks a lot @dumblob

This is the last unresolved thing in V for me, so all this info is very helpful.

I'll start working on this in early October, but if anyone is willing to do this earlier, they are more than welcome.

medvednikov on 9 Sep 2019

👍4

so it would make sense to "port" all the parallelism & concurrency logic to _V_ and start improving on top of it.

I hope you mean whith:
parallelism => preemptive
&
concurrency => cooperative
multitasking.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I hope one day we can state about V as follows (taken from the Pony-project):

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

It’s deadlock free. This one is easy, because V has no locks at all! So they definitely don’t deadlock, because they don’t exist."

That's all from my side for now.

gslicer on 9 Sep 2019

I hope you mean whith:
parallelism => preemptive
&
concurrency => cooperative
multitasking.

Sure, I do.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I don't understand why these 2 should be 2 different use cases. For me it's definitely just one use case - I want things to run purely in parallel (concurrency won't help me at all). But because parallelism is super expensive (memory consumption as well as context switching of operating system threads doesn't scale into millions), I have to combine it with concurrency by leveraging all the more knowledge I have from the programmer (unlike an operating system kernel, which doesn't know what the programmer intended and thus has to waste memory and context switches on fully featured operating system threads) to make also go/v routines (i.e. concurrent threads) behave like truly parallel operating system threads. Thus combining the preemptiveness of operating system threads and the memory and context switching efficiency of go/v routines.

I really don't see any point in having concurrency if we have true parallelism with the same amount of resources and comfort.

Mind, that my proposal gives one full power over scheduling by exposing the scheduler to the programmer.

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

I'm pretty sure this will not be possible (see latest research). At least not with the current Vs semantics. Such guarantees can be achieved through constraining the programmer (see e.g. Rust as already many times mentioned here in V's discussions) or through making the language so incapable that it'll get very impractical :cry:. What can be though achieved in V is to make races highly improbable (the syntax & semantics will guide the programmer into writing more or less race-free code) and on top of that provide e.g. a race analyzer as golang does (and run it automatically as part of compilation?). This is the difference between guarantee and high probability, and it's enormous.

Besides, if one utterly needs plain concurrency, it's easy to do it yourself (there are many implementations on github as it's really an easy high school exercise). Even on embedded systems I always try to use first one big event loop and if that's not enough, then I resort directly to interrupts (which are costly, but provide true preemptiveness) instead of using any coroutines, because they won't bring me anything more than an event loop and they're about as complex as true interrupts with preemptiveness.

dumblob on 9 Sep 2019

👍1

upvote for parallelism + SIMD vectorization as requested here https://github.com/vlang/v/issues/1079

oleg-kachan on 16 Sep 2019

👍4

@medvednikov some implementation ideas:

there is a good looking attempt to improve on a single producer single consumer ring buffer by avoiding the extra empty slot in the buffer, but especially avoiding any modulo (which is very slow instruction) - see Memory and Speed efficient ring-buffer design in https://github.com/nim-lang/RFCs/issues/160#issuecomment-549199561 ;
the (hopefully) final implementation of a lock-less unbounded Multiple Producer Single Consumer queue (ring buffer) is to be found in https://github.com/mratsim/weave/pull/21
follow the yet incomplete design & implementation of a modern thread-safe object pool in https://github.com/nim-lang/RFCs/issues/160#issuecomment-549199561 (the objectives are really challenging and if delivered without any side effects, it'll be a leap forwards)
get inspired by ideas as well as by the actual implementation of mimalloc, the most stable and efficient (and probably fastest as of now) os-threads-friendly malloc() replacement which doesn't sacrifice single-thread performance

dumblob on 4 Nov 2019

👍2

@medvednikov a possible implementation of a highly efficient preemptive scheduling:

https://github.com/krinkinmu/userspace-threads (didn't do any performance benchmarks, but I expect it to be many times faster than os-level scheduling, but still somewhat slower than Go routines - this small difference to Go routines is the tax for being fully preemptive, i.e. guaranteed no deadlocks and not even starvation; another thing is, that it's easier to implement and allows for an easy API for poking the scheduler and adding V language primitive(s) for reentrancy as I outlined above in https://github.com/vlang/v/issues/1868#issuecomment-529391682 )
some theory to make the implementation correct: Communicating finite-state machine

Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine, so we don't need dynamic stacks (see also the cactus stack problem and an innovative solution https://github.com/mratsim/weave/pull/24 - worth reading!), but rather a fixed-size array of different size (known in compile time) for each V parallel routine. This saves a lot of trouble and allows for the best possible performance (yeah, for indirect & direct recursion, for FFI, and for unsafe stuff we actually need dynamic stacks which means tricks like the jump used in Go, but that shouldn't be very common in performance critical places, so we can afford some performance penalty in such cases).

dumblob on 19 Nov 2019

There seems to be an implementation effort having very similar goals (maybe except for the reentrancy which I'd definitely like to see in V as mentioned above) and with extensive survey of existing implementations (Go is mentioned as well as an interesting example).

https://github.com/mratsim/weave/issues/22

dumblob on 1 Dec 2019

Thanks @dumblob I'll check these out.

medvednikov on 1 Dec 2019

@dumblob Could you please take a look at Chapel a parallel programming language (https://chapel-lang.org/ https://github.com/chapel-lang/ ) designed by CRAY engineers? Maybe you can borrow some design ideas from Chapel.

cristian-ilies-vasile on 5 Jan 2020

@cristian-ilies-vasile I already did that. Chapel has a large set of very nice features (e.g. regions, multi-paradigm, etc.), but they don't seem to match V goals (especially the "there is only one way to do a thing") and second they're not generic enough (read: "Chapel is a pile of best practices & optimized principles for a - moderately large - selected finite number of use cases"). Due to this, Chapel programs do squeeze the maximum out of the given supercomputer, but can't be run enough efficiently on non-supercomputer hardware (which is the majority nowadays - see the table in the initial post above) as well as the hardware of tomorrow (yeah, stuff like several layers of remote memories with different access speeds instead of just local and remote etc.). Chapel is also quite complex language from my experience (despite authors saying the opposite :wink:).

I have to admit I couldn't find in Chapel nearly anything useful for V (the above proposal is generic enough to support quite efficiently nearly any of the Chapel abstractions implemented later in specific V libraries libraries - note especially the scheduler "callback").

If you have anything specific in mind, feel free to discuss it here though. I'll be more than happy to learn something or clarify or modify the proposal.

dumblob on 6 Jan 2020

@dumblob Thank you for the nice words.

//1 - framework to eliminate bugs
On this concurrency front my understanding is that more or less the GO green threads scheduler model (threads monitor) shall be used with improvements. However concurrency is hard.
According with this paper "Understanding Real-World Concurrency Bugs in Go" [1] is it easy to induce concurrency bugs even using GO.
Shall be great if compiler or a separate tool can check for such bugs and produce clever human readable error messages.

//2 - Function as a Service at run-time
Quote
_To run foo() concurrently, just call it with go foo()_
On top of this functionality can we have vm foo() and then V compiler should create some sort of micro virtual machine/container, so the OS and in turn the hypervisor can move foo() on a different machine and execute the function there?
That microVM can be wrapped around WASM binary format, so the processor of the remote server does not matter.

[1] https://songlh.github.io/paper/go-study.pdf
Thank you.

Regards,
Cristian.

cristian-ilies-vasile on 11 Jan 2020

@cristian-ilies-vasile YES to //1 but NOOOO to //2 - V shall have no "runtime" or even VM in my opinion - you are free to build your own VM using V though

gslicer on 11 Jan 2020

@gslicer I feel that you did not get it. It's not like Java and JVM, my suggestion is related to one specific case. The coder could decide that a long running function would run automatically on a different server/environment. The V ecosystem should provide tools and mechanisms for such approach on year 2020.

Think about C, it is still used after 50 years... so I presume that V lifetime'll be the same. In this case let's prepare the foundations to sustain this type of weight.

Regards,
Cristian.

cristian-ilies-vasile on 11 Jan 2020

//1...

Well, my proposal diverges from Go lang in a way, that they won't be cooperatively concurrent but preemptively parallel. This is more in line with the "standard threading" experience one gets on Windows or Unix (pthreads and alike) than with Go lang. So one can reuse the knowledge, tooling etc. to more easily reason about the program than one can about Go programs.

My proposal goes further though - I envision a builtin support (e.g. a dedicated syntax or even a new keyword) for "reentrancy" (a scheduler can spawn an arbitrary number of instances of a go routine any time and can kill and remove an arbitrary number of them also any time. Second, the programmer shall get the full power over scheduling (this might be achieved by many means, so I don't have any particular opinion on implementation - imagine e.g. the most error-prone but widely known: callback).

//2

This sounds somewhat similar to the reentrancy & scheduler "callback" I'm proposing, but my proposal is about the primitives needed (the language syntax & semantics) rather than about any built-in "high level" stuff like VMs. My goal is to offer the very minimum set of primitives to make it possible to efficiently implement something like you described in custom libraries, but not as part of V.

dumblob on 12 Jan 2020

@dumblob Related to
//1
could you please add a protection mechanism like Time to live (TTL) when a go routine is called, to set a maximum lifespan of that green thread?
if TTL is 0 (zero) then is ignored otherwise for strict integer positive value then the V's green threads monitor/scheduler must return an error if that value is exceed.

cristian-ilies-vasile on 19 Jan 2020

@cristian-ilies-vasile sounds like something easily doable in the app itself - in other words I don't see any obstacles in my proposal which would prevent implementing such TTL functionality efficiently in your app without any cooperation from V side. It also doesn't sound like a default behavior of V's parallelism, so I don't see currently any reason to add explicit support directly in V.

Or am I missing something? Please elaborate :wink:.

dumblob on 19 Jan 2020

@dumblob
It's up to you, however if we follow yours school of thinking ("asily doable in the app itself") the green thread / fibers logic should be placed at app level not as part and parcel of V run-time, because we shall operate only with low level primitives.

Indeed, one can implement Timing Wheels logic on their own, but wouldn't be more helpful for coders to use a rock solid language abstraction out of the box with TTL included?

Regards,
Cristian.

P.S.
For goroutines scheduler/monitor I feel that this blog [1] is worth to read, that data structures (Range-Encoded Bitmaps, Range-Encoded Bit-Slice Indexes) are implemented in GO [2]
[1] https://www.pilosa.com/blog/range-encoded-bitmaps/
[2] https://github.com/pilosa/pilosa

cristian-ilies-vasile on 19 Jan 2020

@cristian-ilies-vasile thanks for the examples. I have to admit I'm a bit lost here. My proposal in this thread is about a hybrid approach (basically I'm reusing the old concept of user-level preemptive threads - i.e. green threads / fibers, but non-blocking - i.e. working hand in hand with os-level scheduler and timers to wake them up regularly). So I'm not sure how does your concern about placing green threads / fibers on top of my proposal relates to this discussion.

Regarding Timing Wheel stuff it's primarily about the underlying operating system interface (especially timers capabilities) and has nothing to do with this discussion about parallelism in my opinion. Or maybe I'm missing something.

For goroutines scheduler/monitor I feel that this blog [1] is worth to read, that data structures (Range-Encoded Bitmaps, Range-Encoded Bit-Slice Indexes) are implemented in GO [2]

I'm not aware that Go used this data structure in their go routine implementation. Are you speaking about the official go compiler or about cgo (GCC-based) or something else? How does one use range-encoded bitmas and/or range-encoded bit-slice indexes for go routine implementation? The blog post you've linked is also 2 years older than the material describing the whole golang scheduler linked above.

It seems to me we're talking at cross purposes. Could you elaborate how does TTL relate to general scheduling?

If you're talking just about parallel computation in LAN clusters and WAN clusters (as mentioned in the initial description above), then I could understand the need of some TTL to prevent routing issues, but that's all supported in my proposal through the "callback" to your own parallel scheduler which'll be aware of the specific needs of your app and will therefore support TTL handling etc.

Of course, the more is at hand by default, the better for programmers, but TTL handling is so extremely specific (it relates just to a subset of use cases from the last 2 rows in the table in the initial post in this thread), that it should be totally fine if it's part of network libraries etc., but not part of V (by V is usually meant V and it's standard library, which is to be very small). Among the main differences between V and it's standard library versus additional libs (e.g. network libs) is the pace of development and guarantees of backwards/forwards compatibility and guarantees regarding multiplatform functionality (both operating systems as well as hardware itself).

dumblob on 19 Jan 2020

Job Queue Scheduling with Bit Vector Indexes
(http://bitmagic.io/bm-optque.html)

cristian-ilies-vasile on 26 Jan 2020

@cristian-ilies-vasile oh, job scheduling. That's a bit different topic - I'm not proposing any jobs here, but rather a long/infinitely running go routines (imagine each go routine being an endless loop - analogous to a generic stream processor) having at least one input and zero or more outputs. This thread is then about how should the "work agents" (as they're called in the paper you've linked; the stream processors as I call them here) be implemented, but not about how should the work be scheduled among them (in my proposal I'm deliberately leaving this up to the operating system by default with the possibility to define your very own scheduler).

There could be definitely a library for scheduling as I mentioned above if e.g. the operating system scheduler won't be sufficient, but that's currently not the target of my proposal. Feel free to make a separate proposal for scheduling algorithms in this issue tracker :wink:.

dumblob on 26 Jan 2020

If anyone wants to read the design document for coroutines in Kotlin you can find it here: https://github.com/Kotlin/KEEP/blob/master/proposals/coroutines.md

cristian-ilies-vasile on 8 Feb 2020

I've been using https://github.com/dotnet/orleans that implements virtual actor model from Microsoft and I think it's really awesome concept. It's used in Halo games backend services. Actors (grains in orleans) supports things like reentrancy and oneway. You can mark methods or whole grain reentrant. Example you can mark state reading methods are reentrant so they can read grain state while grain is saving something to database. Oneway work so that return from method is not waited and called method returns immediately.

This is all possible because of custom scheduler and async & await. I think https://github.com/AsynkronIT/protoactor-go is trying implement something similar. But not sure how well that will fit to Go's concurrency model.

What ever concurrency model is chosen for V, I hope it's something that can implement virtual actor model.

wanton7 on 16 Feb 2020

What ever concurrency model is chosen for V, I hope it's something that can implement virtual actor model.

Coroutines more flexible then actor based model, you always can use coroutines model like actor if use one coroutine for one resource, no need to fetter yourself.

Omnuchka on 16 Feb 2020

Actors, coroutines, etc. of your choice - that all can be implemented/simulated using the "primitives" I outlined in my proposal.

Btw. async/await or alike (i.e. generally the asynchronous model) is by far not the best solution - it's just modern nowadays, but doesn't solve fundamental problems any better then other models (actually rather vice versa as there are things which can't be solved asynchronously and in addition it gets very error prone and verbose if applied everywhere).

dumblob on 17 Feb 2020

👍2 ❤1

@dumblob Would writing V programs that run on https://github.com/bastion-rs/bastion be good enough? (Very interesting project!) I'm curious how you're thinking about what must be a language feature and what could be part of an underlying platform/distributed runtime/etc.

elimisteve on 25 Feb 2020

Would writing V programs that run on https://github.com/bastion-rs/bastion be good enough?

Didn't read the Rust sources (just skimmed through some examples), but it all sounds that yes, it would be good enough (though I'd need to review performance). On the other hand, we can definitely stay smaller in V - the Bastion runtime is not just "mutually compatible & performant tools/language_features to easily build such a system", but rather "existing ecosystem which you can readily deploy and use".

what must be a language feature

default scheduling of parallel/concurrent computations (both short aka "tasks" as well as long running aka "stream/infinite_loop")
spawn/exit a first, second, third, ... instance parallel/concurrent computation (this involves communication primitive plumbing if the communication primitive was used in the body of the parallel/concurrent computation)
one (not more!) efficient communication primitive between parallel/concurrent computations allowing construction of arbitrary oriented graph of data flow (including "multiplexing" - note there shouldn't be any need for explicit "multiplexing" and "demultiplexing" language keywords/constructs as the communication primitive will encompass that)
reentrancy support as an optional attribute of the given parallel/concurrent computation (e.g. an optional language keyword/operator/...)

What shouldn't be a language feature:

other than just one default scheduler
too specific or bulkier parallel/concurrency abstractions like actors, operating system threads (they have huge APIs and are not lightweight), processes of all kind (it's a way bulkier concept even if talking about Erlang processes than the absolute minimum parallel/concurrent computation), data flow nodes (it's not flexible enough as it disallows e.g. the short ("task") computations and allows just the long running "stream/infinite_loop" computations), ...
only non-preemptive parallel/concurrent computations
event loop (there are so many variants offering different functionality, that nobody would use the built in one anyway)

What might be a language feature (i.e. I'm not against having it built in V, but strictly speaking it's not necessary and could violate the V mantra of do one thing exactly one way):

futures as return values from short ("task") parallel/concurrent computations (but they might be also implemented as a tiny module on top of the above-mentioned building blocks) - please do not confuse with async-await/promise, because with async-await is usually meant an event-loop based asynchronous computation (which is by far not that convenient as the preemptive parallel/concurrent computations used as the basis of these futures) and a promise is "writable" (futures are read-only)

dumblob on 25 Feb 2020

How about async/await comes from rust?
https://blog.rust-lang.org/2019/11/07/Async-await-stable.html

lygstate on 21 May 2020

👎1

@lygstate please see my comment https://github.com/vlang/v/issues/1868#issuecomment-586968033 above

dumblob on 22 May 2020

@dumblob Still not got the idea, async await may not be the best for performance, but easy to understand and easy to coding.

lygstate on 22 May 2020

👎1

but easy to understand and easy to coding

@lygstate I'm afraid I can't share this view - have you read the linked article about "everything async"? That's where this coding style leads :cry:.

Another issue is, that async/await has nothing to do with parallelism (async/await is just concurrency abstraction), but parallelism is the inevitable base building block to use more than one physical CPU core. See the initial post in this thread which explicitly says, that both are needed with emphasis on parallelism.

Second, you're free to create async/await easily in V itself in a highly performant manner and very syntactically pleasant even without any built-in support (btw. there is already async/await available - see e.g. how V UI is built around async/await scheduler).

dumblob on 22 May 2020

❤1 🎉1 👍1

@dumblob Yeap, I am talking about a single thread async/await, not about parallelism.
Is there any things better than async/await on single thread coroutine?

lygstate on 22 May 2020

Is there any things better than async/await on single thread coroutine?

Of course there is - writing your own abstraction of an async function which use an explicit fully dynamic stack and not an implicit fixed-size one (Go has a two fixed sizes, but that still can't compete with explicit management). This is then more performant and doesn't waste any byte of memory, though manual "stack" management is slightly more verbose and the programmer might sometimes feel like a living garbage collector :wink:. This is actually what V UI kind of does.

If you want to see some it more explicitly, take a look e.g. at simulation C++ libraries like https://www.fit.vutbr.cz/~peringer/SIMLIB/ (class Store which uses priority Queue as "stack"). Simulation libraries frequently implement something like a "functional object" with their own stacks in quite performant ways.

dumblob on 22 May 2020

👍1

Is there any demo of that?

for example.

How to do something like this

for (int i = 0; i < 10; i +=1)
{
    await sleep(1000);
    await doIoWork();
}

lygstate on 22 May 2020

@dumblob well, performant is important, but in production, productively and correctness are more important, we can sacrifise 20% performance for productively and correctness

lygstate on 22 May 2020

@lygstate You should move this to official V language chat at Discord because every subscriber to this issue gets notification when you post here. These posts start to feel to me like beating a dead horse. Because there won't be async and await for V, its concurrency model is already decided to be similar to Go language by V language author, https://github.com/vlang/v/blob/master/doc/docs.md#concurrency .

I'm C# programmer by profession and it's a language that mainstreamed async & await and it leads to duplicate implementations in system libraries example. in .NET (C#) libraries you have lot of functions that have another version with Async ending like System.IO.File.File.ReadAllLines and System.IO.File.File.ReadAllLinesAsync. When using Go concurrency model from programmers perspective you always write synchronous code and runtime handles concurrency underneath so you don't have this code duplication. Async and await is also like a virus and it starts to infect your code from top down.

wanton7 on 23 May 2020

Entry link for Discord chat channels for V development & more is https://discord.gg/vlang

cristian-ilies-vasile on 23 May 2020

@dumblob
What do you think about starting a basic implementation?

cristian-ilies-vasile on 26 May 2020

Yes, I think it'd be good to start with a basic implementation to have a fully working one by 0.3.

medvednikov on 27 May 2020

I had written some undoubtedly outdated V code to provide go like channels and select functionality, put on hiatus until generic structs arrive. If we can have generic structs, go like inter thread communication can follow, and hopefully rust level safety next.

krolaw on 27 May 2020

❤1

@cristian-ilies-vasile thanks for the ping (I'm too busy to respond quickly, so please bear with me in these times)

@ylluminate @cristian-ilies-vasile @lygstate I saw your comments starting with https://github.com/vlang/v/issues/3814#issuecomment-633721380 and I'm quite puzzled.

Everything mentioned there is purely non-preemtive though the requirements most of us seemed to agree to do have preemptiveness as one of major goals. Writing something non-preemptive would be a huge step back even when compared to the current state of V.

That's why I also linked the preemptive implementation of extremely lightweight "user-space threads" above (see also https://github.com/krinkinmu/userspace-threads/issues/1 ).

What do you all think?

Just to clarify - I'm not against having "safe points" concurrency on top of a preemptive backend implementation, but I think we first need to implement the preemptive backend and then improve e.g. the scenario when the task is too short to consume the given time slice and instead of interrupting - which is costly - just switch to the next cooperative task and so on until the time slice is up (this would basically mean cooperative scheduling with real-time guarantees).

dumblob on 27 May 2020

@dumblob I mention purely non-preemtive if for also working under bare-metal and Web environment. I mean we need a infrustructre that works both preemptiveness and non-preemtive environment. A same set API for both target

lygstate on 27 May 2020

@dumblob
Hello, thanks for comment.
We shall stick with initial goal implement the preemptive backend and then improve, as summarized in V Concurrency Considerations https://github.com/vlang/v/issues/3814
lygstate just put a lot of links to concurrency resources.

So, we are in sync with this task.

One more thing, do you prefer to communicate with a different channel on top of github comments?
Thank you.

cristian-ilies-vasile on 27 May 2020

@krolaw
I did send a ping on discord yesterday on this matter.

cristian-ilies-vasile on 27 May 2020

@cristian-ilies-vasile thanks for clarification. As for communication, I'll probably stay just here (no time to hang on anywhere else). I think you'll anyway update these GitHub issues if there is anything important to discuss/mention to make it visible to everybody and not just discord users.

dumblob on 3 Jun 2020

@dumblob
Hello,
there are 4 persons involved on this task on the hard side ie writing code:
you, @helto4real,@krolaw and @UweKrueger and in order to make progress I feel that we shall have a proper understanding of:

//1 start drafting a technical design based on available ideas and agreements.
Doing so we can split the main task in small tasks and see who can work on what and how fast.

//2 Another aspect is related to major milestones, what do you think that is mandatory for version 0.3 (short term vision) and how can evolve smoothy up to 1.0 the stable version.

//3 We need to make this concurrency interoperable at syntactic and semantic level with V language -> could need to implement new keywords / concepts which in turn will trigger new PRs. Keep in mind that on scanner-parser-checker - c code generator flow are working only 3-4 persons. Clarifying what reference capabilities V will have (if any ) and how light green threads should use these capabilities.

//4 open points: example
_Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper defer. And having no guarantees totally defeats the purpose of the whole nursery concept wink._

Thank you.

cristian-ilies-vasile on 3 Jul 2020

@dumblob
Hello,
there are 4 persons involved on this task on the hard side ie writing code:
you, @helto4real,@krolaw and @UweKrueger and in order to make progress I feel that we shall have a proper understanding of:

//1 start drafting a technical design based on available ideas and agreements.
Doing so we can split the main task in small tasks and see who can work on what and how fast.

//2 Another aspect is related to major milestones, what do you think that is mandatory for version 0.3 (short term vision) and how can evolve smoothy up to 1.0 the stable version.

//3 We need to make this concurrency interoperable at syntactic and semantic level with V language -> could need to implement new keywords / concepts which in turn will trigger new PRs. Keep in mind that on scanner-parser-checker - c code generator flow are working only 3-4 persons. Clarifying what reference capabilities V will have (if any ) and how light green threads should use these capabilities.

//4 open points: example
_Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper defer. And having no guarantees totally defeats the purpose of the whole nursery concept wink._

Thank you.

Think you have a valid point here Christian. We should start drafting a technical design and a roadmap/milestones so we can work on same goals.

Thanks

helto4real on 4 Jul 2020

I know you are all fairly far along in your decisions, and this is very far out of fashion, but I don't see where anyone else has mentioned the Amiga. Yes, it had some flaws, but at a time when every other desktop OS was struggling with even getting a halfway working cooperative multitasking system going, the Amiga had pre-emptive multitasking that (for the most part) just plain worked. You didn't have to do anything, other than run your programs.

https://en.wikipedia.org/wiki/Exec_(Amiga)

And it did it in only 13KB of memory, so definitely no slouch.

JalonSolov on 4 Jul 2020

❤1

That's a really interesting thought @JalonSolov. I'd really like to see comprehensive research in this area in terms of how Amiga not only handled this, but how it might have been set apart from OS/2 and other approaches. I have to think that some of the other research that's been discussed might have considered this, but yeah... not sure.

ylluminate on 4 Jul 2020

@JalonSolov @ylluminate yes, that system was cool (btw. designed by Carl Sassenrath, who is also the father of the Rebol/Red language and VID dialect as discussed in the V UI threads), but has some flaws which are not acceptable nowadays (it's inherently unsafe and doesn't provide any guarantees).

Don't worry, the nowadays designs are really better in many ways :wink:.

dumblob on 4 Jul 2020

@cristian-ilies-vasile yep, agreed. Though I won't have the capacity to manage it all - I'd rather stay in the role of a reviewer (and even then sometimes pretty delayed - considering the fast pace of V development).

dumblob on 4 Jul 2020

Experimental locked concurrency has been merged:

https://github.com/vlang/v/pull/5637

I want to simplify by removing the shared keyword, making V detect which variables are locked and need to be marked as shared.

medvednikov on 4 Jul 2020

👍3

https://github.com/prismlab/parallel-programming-in-multicore-ocaml/tree/draft

medvednikov on 6 Jul 2020

https://github.com/prismlab/parallel-programming-in-multicore-ocaml/tree/draft

Hm, it's very similar in architecture to Weave as mentioned above (and a bit to zero functional), but less powerful, less performant and unfortunately with more WTF surprises. On the positive side, it's a good showcase how to approach it in functional way.

5637

@cristian-ilies-vasile @medvednikov @helto4real @krolaw @UweKrueger looking at the PR I have the following observations:

V manifests itself as a simple & clear language, but all these new keywords with fine-grained differences do quite increase complexity of code comprehension and I'm still not convinced they're all needed - read further.
There is IMHO a strong overlap of what can be achieved with just channels compared to what can be achieved with such "shared" constructs/variables and it seems these new keywords might not provide that much benefit in the end.

Actually channels should be encouraged and anything else strongly discouraged as anything shared leads to locks and locks translate to stalls so the programmer should not think of concurrency in terms of anything shared, but rather in terms of a "laystall", i.e. "one-way pipes" with buffered data recently processed by one processing unit instance and waiting there to be processed by another processing unit instance in one batch while allowing the first processing unit instance to do some other work in the mean time.

Anything shared (e.g. a variable) is to a go routine instance basically a channel with capacity 0 to be both read&written by either of the go routine instances. The difference to a plain shared variable with locks around is, that channels do not hide the stop the world behavior, but make it explicit. That's exactly why Go popularized this approach.

Second, any channel with capacity 0 can be efficiently compiled to a corresponding low-level representation (a bit similar to the table) based on analysis of the channels usage (e.g. if there are only top() operations on the channel, then the content is immutable; if the item type is atomic on the given HW platform and at the same time only one process instance does write (push()) to it and and arbitrary processes read (top()) from it, then it can be viewed as volatile memory place with no locks needed; if it's a more complicated scenario, then a standard SPSC/SPMC channel shall be used to avoid locks at all cost).

Third, if someone really really wanted to have a "shared" variable, then why not use Erlang approach? Namely having a dedicated go routine just for this one "shared" variable and communicating with this routine instance only through channel(s)? This makes everyhing explicit in code, it's clean and can still be compiled to anything the compiler wants to (e.g. a high-performant optimized assembly).
There is an overlap with things like volatile etc. (generally low-level close-to-metal constructs) as already discussed. This stuff should be designed together with concurrency to make it more coherent than it seems now.

Thoughts?

dumblob on 6 Jul 2020

One of the things I have really liked about Go for the past several years is how channels "just work". No complicated locking/unlocking, no extra keywords to declare something shared or not.

Having the extra keywords available doesn't mean they have to be used, but it does give the _appearance_ of V being a more complicated language than the simple one it claims to be. Many will look at it as "Oh, they have all these keywords, you must have to use them. Guess this is harder than I thought."

JalonSolov on 6 Jul 2020

@dumblob I think you have a valid point. Never thought of that a single capacity 1 channel could be optimized the same way we would do lock constructs. And we get "one way of doing things". This is a tough one. I respect the hard work by @UweKrueger to make this work. So I am cool with keeping them around and we continue explore the channels and how we can optimize them with compiler magic depending on depth, nr of consumers/producers etc.

helto4real on 7 Jul 2020

I think Go's zero length channels and buffered channels keep things really simple. I'm still waiting on generic structs to finish my basic but fully capable channels/selects implementation. Then hopefully it's just a case of working out the thread sharing rules, to have something safe and functional. The rest then just becomes implementation improvements.

krolaw on 8 Jul 2020

@dumblob and all
I agree with you that channels could be a better alternative to a maze of locks however at this point in time we do not have too much code covering concurrency functionality. The lock implementation could be useful in the future, in fact it did not ask for food.
I shall start writing a technical design in the beginning of August.

cristian-ilies-vasile on 11 Jul 2020

@krolaw generic structs are here: check out tests/generics_test.v

Are you experiencing some issues with them?

medvednikov on 11 Jul 2020

@dumblob
hello, just a set of questions in no particular order.

//1 for switching between Light Green Threads I think that we shall use C setjmp() / longjmp() trick - will give us dependency of C back end language but no issue on CPU model / manufacturer.

//2 I feel that we shall partition the Light Green Threads on 2 main types a. the CPU bound and b. the I/O bound with different time slices and maybe more dynamic parameters. The I/O could/would block but on CPU bound tasks we shall not have this issue - a different approach and handling.

//3 the Light Green Thread will have a stack, and the size of the stack will be computed by compiler at compile time?

//4 are you aware of any tools able to emulate a large number of CPUs? I am asking for testing workflows.

//5 makes sense to add helper functions on our runtime in order to improve end user concurrency debugging, see my recommendations to export to .dot grammar or use railway diagrams.

Thank you.

cristian-ilies-vasile on 11 Jul 2020

@dumblob

//6 the Monitor / Scheduler runtime will be a. allocated one to one for exactly the number of cores on that machine or b. will be one scheduler in charge for mux/demux all Light Green Threads aka M:N architecture. In case of a. we would avoid a single point of failure, if something goes wrong, the N-1 cores will still work. In 2-3 years 80..128 cores per CPU will be the new standard.

Thank you.

cristian-ilies-vasile on 11 Jul 2020

@medvednikov It looks like I could also do with generics on interfaces too:

FAIL   30.273 ms channel_test.v
sendrec.v:15:24: error: unexpected `<`, expecting `{` 
   15 | interface ClaimReceiver<T> {
      |                        ^
   16 |     claim() bool
   17 |     receive(T)

Making a generic field optional also fails:

FAIL   42.926 ms channel_test.v
select.v:100:8: error: unexpected `?`, expecting `name` 
   98 | 
   99 | struct SelectReceiver<T> {
  100 |     do fn(?T)
      |           ^
  101 |     mut: 
  102 |         Select

As does composing multiple structs into one without names (i.e. Select above). To be fair, the composition would only make the code cleaner, not actually necessary...

krolaw on 12 Jul 2020

@dumblob
Hello,
// 7 - scheduling strategy and tactical approaches.
I did read five years ago about a product called THREAD MANAGER, the application claims up to 270% improvement on system performance only by intelligent pinning software threads to CPU cores.

I remember when working with matrices that one should keep each core L2 cache full with data and apply code on data already moved in L2 & L1 caches.

speed 1x
for(i = 0; i < LEN; i++)
for(j = 0; j < N; j++)
acc[i] += v[j][i] * v[j][i];

speed 4..5x // swap for(i=...) with for(j=...)
for(j = 0; j < N; j++)
for(i = 0; i < LEN; i++)
acc[i] += v[j][i] * v[j][i];

I think that in the same way V's scheduler should mux/demux Light Green Threads on cores, accounting first for ones where data is already available on L3/L2 cache.
The ratio of memory access / L2 cache is ~ 25 -> is extremely expensive to invalidate core caches.

Thank you.

cristian-ilies-vasile on 12 Jul 2020

Hello,
I was still thinking about partitioning each light green thread in 2 distinct ways.
//1 all cpu bound user space green threads will be mux/demux on dedicated kernel threads, which in turn will be pinned to first core
//2 all I/O bound user space green threads will be mux/demux on dedicated kernel threads, which in turn will be pinned on the same first core

DIAGRAM CPU - KERNEL THREADS - USER SPACE PREEMPTIVE LIGHT GREEN THREADS
========================================================================

<------------->        WORK STEALING FLOW
                       RESCHEDULE A LIGHT GREEN THREAD (USER SPACE) 
                       ON DIFFERENT KERNEL THREAD 
                       BUT ON THE SAME LOGICAL PARTITION SPACE 

EVEN KERNEL THREADS    ONLY FOR CPU BOUND TASKS 
ODD KERNEL THREADS     ONLY FOR I/O BOUND TASKS 

+==========+           +----------------------+        :*********************:                 
|          |<--------->| KERNEL THREAD#0 CPU  |<------>:      CPU BOUND      :<------------->+ 
|          |           +----------------------+        : LIGHT GREEN THREADS :               | 
| HARDWARE |                                           :*********************:               | 
|  #CORE0  |                                                                                 | 
|          |           +----------------------+        :*********************:               | 
|          |<--------->| KERNEL THREAD#1 I/O  |<------>:      I/O BOUND      :<----->+       | 
+==========+           +----------------------+        : LIGHT GREEN THREADS :       |       | 
                                                       :*********************:       |       | 
                                                                                     |       | 
                                                                                     |       | 
|==========+           +---------------------+        :*********************:        |       | 
|          |<--------->| KERNEL THREAD#2 CPU |<------>:      CPU BOUND      :        |       | 
|          |           +---------------------+        : LIGHT GREEN THREADS :<-------|------>+ 
| HARDWARE |                                          :*********************:        |         
|  #CORE1  |                                                                         |         
|          |           +---------------------+        :*********************:        | 
|          |<--------->| KERNEL THREAD#3 I/O |<------>:      I/O BOUND      :        | 
+==========+           +---------------------+        : LIGHT GREEN THREADS :<------>+ 
                                                      :*********************:

Does anyone know if there is such a user thread implementation that uses 2 pinning kernel threads per core and schedule them accordingly?
Thoughts on above approach?

cristian-ilies-vasile on 16 Jul 2020

@cristian-ilies-vasile I'm super busy nowadays, but I have all your questions in mind. Without looking further I'm not perceiving the difference between CPU bound and I/O bound work that strongly, so I'm not sure how would such concept work in practice. Need to read your materials first and put some more thought in it. Thanks for patience.

Keep in mind though, that if the current design will not be perfect, it should definitely stay "constrained enough" to be able to extend the language later (e.g. if it proves useful to differentiate between CPU bound and I/O bound user threads in the future, it should be easily possible to implement it seamlessly while maintaining full backward compatibility).

dumblob on 16 Jul 2020

The idea of splitting into 2 different threads is based on the fact that disk / network access generally blocks kernel threads. We will have different parameters for the 2 types of user space threads, under the same scheduler. The work stealing algorithm would not interfere with different user space threads, there will be different queues for each type. Cpu bound user space threads will be mux/demux-ed only on dedicated cpu's kernel thread.

Now, my question is how the scheduler/thread monitor will identify which function will be scheduled as cpu bound versus I/O bound?
Could be something like this?

   go_cpu foo()
   go_io bar()

Other discord's proposal was to have something like this, without any concurrent function call

   {
       code
       here   
   }

Indeed backward compatibility is paramount, this is the reason we need a good design.

cristian-ilies-vasile on 17 Jul 2020

I think we should keep the syntax simple for the general case. That should be I/O bound work as default This is just the plain.

go i_do_most_io() ...

fn i_do_most_io() {

}

For the cpu bound threads we could use attributes. This will also make language backwards compatible with other threading properties in the future, like scheduling optimizations and threading models.

go i_need_processing_power() ...

[threading_cpu_bound]
fn i_need_processing_power() {

}

helto4real on 17 Jul 2020

The problem with having 2 (or more) configurations is, that it's very difficult in practice to separate the app into such structures - so in the end we might end up with CPU bound threads everywhere :cry: (as the "safer" of the two - thus these should be also the default IMHO). From my experience CPU bound and I/O bound tasks interleave a lot.

I'll try to read all the materials in the upcoming 7 days - hope it's not too late.

dumblob on 17 Jul 2020

Nice to have for debugging (-debug/-trace compiler flag)

THE Fuzzer of Fuzzy Latency for User Space Threads
V is compiled to C so this trick could be done. The compiler will add random delays inside all go fn(), partitioning user space threads in 2 (3..4) zones. For 2 zones, 1/2 of functions (picked randomly) will be delayed at random using a small range of values, the remaining 1/2 will get a larger range in order to introduce an artificial imbalance in the flow of go fn () / user space threads.

cristian-ilies-vasile on 21 Jul 2020

THE Fuzzer of Fuzzy Latency for User Space Threads

This seems more needed for purely non-preemptive scheduling. Otherwise it seems to scratch the whole point of determinism especially in latency-critical situations like games, etc. - an area where Go has had (and still has) historically issues. Btw. Weave doesn't have this problem and if fully strict ordering of processing requests is needed, then Weave even offers a "gather it all in one place" queue for such scenarios. I envision something similar in V.

Btw. I'm slowly reading all the newly referenced materials and will hopefully get to this THE Fuzzer of Fuzzy Latency for User Space Threads soon.

dumblob on 21 Jul 2020

❤1 👍1

@dumblob
THE Fuzzer of Fuzzy Latency for User Space Threads is designed only for debugging purposes. I forgot to mention this detail.

cristian-ilies-vasile on 21 Jul 2020

// 8
Nice to have
A way to let the coder adjust 3..5 thread scheduler/monitor's run time parameters.
At compile time we can use a special knob "-trace_threads", which will instruct thread scheduler to track and collect necessary internal parameters during "preflight" mode, when application will run on a canary/preprod environment. When application exists those parameters will be written on a text file.
Based on collected data and maybe few formulas the negative feedback loop can be closed, and scheduling run time main attributes could be fine tuned.

cristian-ilies-vasile on 23 Jul 2020

I found a paper where the influence of processor frequency on tasks completion times is studied.
In essence, in the real world the processor frequency moves up and down, generally between 1Ghz and 3Ghz.

Based on this, one can think of a mechanism for allocating threads, including "overloading" at the processor level and dispatch more user space threads on fast CPUs.

Paper: Fork/Wait and Multicore Frequency Scaling: a Generational Clash
Link: https://hal.inria.fr/hal-02349987v2/document

cristian-ilies-vasile on 2 Aug 2020

@dumblob
_dumblob commented on Nov 19, 2019_
_Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine_

Well, is not so easy. If inside a go fn() there is a piece of code using recursive calls to functions the static measurement will fail.
We need to:

detect recursive calls at compile time
add a reasonable amount of RAM for that particular stack on top of computed one
enforce proper check at run time against stack overflow disasters

cristian-ilies-vasile on 7 Aug 2020

@dumblob
Do you know resources describing how to multiplex/demultiplex M user space processes on N kernel threads, where N = number of CPUs?
(see also krinkinmu/userspace-threads#1 ) - here this part is missing.
How to feed a limited number of kernel/OS threads with preemptive tasks?
Thank you.

cristian-ilies-vasile on 14 Aug 2020

Answering in hurry, bear with me.

@dumblob
Do you know resources describing how to multiplex/demultiplex M user space processes on N kernel threads, where N = number of CPUs?
(see also krinkinmu/userspace-threads#1 ) - here this part is missing.
How to feed a limited number of kernel/OS threads with preemptive tasks?

I'm not sure I understand, but it sounds like you're asking about the basic idea behind work-stealing :wink:. If it is so, then my answer is: "See any good work-stealing implementation (the best I saw is in Weave , another good implementation is the Go lang scheduler)."

Nonetheless, there is no work-stealing implementation I know of, which handles dynamic removal and addition of physical CPU cores efficiently (i.e. changing N dynamically in runtime - e.g. in response to power saving). I envision the work-stealing scheduler should spawn os threads or stop (join) running os threads (but first stop stealing work in the designated os threads, e.g. those having the highest identification number /because this condition is detectable in each os thread separately without stop-the-world/ or those having shortest queues, and wait for their work queue to empty) as a response.

Btw. https://github.com/krinkinmu/userspace-threads/ (pure preemptive scheduling in user space) could be merged with code from Weave (pure non-preemptive scheduling in user space) and we should be good to go. Actually rather incorporate preemptiveness (inspired by https://github.com/krinkinmu/userspace-threads/ ) into Weave as Weave is formally verified to be correct and at the same time it's already quite polished and as you wrote hella fast (scales nearly linearly with the number of physical cores! There is no other such capable library in the world I'd know of!).

dumblob on 14 Aug 2020

👍1

@dumblob wrote:

3. and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work

[…]

2. there is no concept of "reentrant" goroutines in golang and thus it's impossible to spawn new instances (or kill running ones) and do the muxing/demuxing as outlined in my initial post above to accommodate for bottle necks - this is not so quickly solved in golang compared to the issue (1) above (as it would probably require extending golang) and I think this could be the major advantage of _V_ compared to golang

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

2. Actually channels should be encouraged and anything else strongly discouraged as anything shared leads to locks and locks translate to stalls so the programmer should not think of concurrency in terms of anything shared, but rather in terms of a "laystall", i.e. "one-way pipes" with buffered data recently processed by one processing unit instance and waiting there to be processed by another processing unit instance in one batch while allowing the first processing unit instance to do some other work in the mean time.

Contention leads to a hornet’s nest of Heisenbugs. Lockfree is the only way to go, please. Lift the semantics out of the low-level gutter. Or I mean at least strongly discourage the use of low-level thread synchronization primitives.

shelby3 on 15 Sep 2020

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

Reentrancy is a flag a go routine will get from the programmer serving as a covenant with the meaning "I programmer hereby certify, that the following go routine could be spawned 1+ times any time and any running instance if it's not the last one can be stopped and discarded any time". It's just an optimization giving the scheduler a huge amount of freedom to make scheduling fully dynamic in time (i.e. spawning "unlimited" number of instances of the same go thread or stopping/discarding running instances of go thread if the throughput is not any more a bottle neck) and in general more efficient.

Contention leads to a hornet’s nest of Heisenbugs. Lockfree is the only way to go, please. Lift the semantics out of the low-level gutter.

Hm, I hope you don't confuse lock-free and lockless. The proposal above is fully preemptive and thus what the language (V) itself offers is always lock-free. The programmer though can freely build lock-free as well as not lock-free algorithms (V has unsafe{} to pinpoint usage of structures and approaches leading to potentially non-lock-free behavior). I hope it's clear that V can't guarantee anything if any unsafe{} section is being used and thus I believe your concern is being 100% addressed by not using unsafe{} at all :wink:.

Note though, that I'm not sure V is already in a state when this lock-free gurantee holds in all cases - V is still alpha software, so keep than in mind.

But if you meant lockless, then I'd disagree. Please google for lockless variants of efficient lock-based algorithms and you'll be surprised that lockless is only sometimes more efficient in practice (on top of this lockless quite often suffers from other symptoms). It's by far not this easy to say "lockless is best"...

Or I mean at least strongly discourage the use of low-level thread synchronization primitives.

That's what we also advocate for - mainly through unsafe{} and second by promoting use of channels and other higher-level though still efficient primitives for communication among go routines.

dumblob on 15 Sep 2020

👍1

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

Reentrancy is a flag a go routine will get from the programmer serving as a covenant with the meaning "I programmer hereby certify, that the following go routine could be spawned 1+ times any time and any running instance if it's not the last one can be stopped and discarded any time". It's just an optimization giving the scheduler a huge amount of freedom to make scheduling fully dynamic in time (i.e. spawning "unlimited" number of instances of the same go thread or stopping/discarding running instances of go thread if the throughput is not any more a bottle neck) and in general more efficient.

Oh if it is optional then that seems more acceptable. The programmer could set the flag if for example they don’t mutate any resources which are shared between all copies of the thread. It is very difficult otherwise to prove compliance with the invariant — Rust’s exclusive write ownership is another means to prove reentrancy but I personally find that to be an unpleasant paradigm to code in. And I explained why I think Pony’s model is more flexible than Rust’s.

I hope it's clear that V can't guarantee anything if any unsafe{} section is being used and thus I believe your concern is being 100% addressed by not using unsafe{} at all wink.

That seems reasonable to me iff (if and only if) thread synchronization primitives will only accessible from unsafe{} code.

Hm, I hope you don't confuse lock-free and lockless.

There seems to be no salient distinction between the two, when I distill their definitions down because afaics in theory any code which employs locks will fail to meet the stated requirement of lockfree:

https://stackoverflow.com/questions/20467247/whats-the-difference-between-lockless-and-lockfree

Seems someone has in their mind that the terms should be applied in different contexts?

So per my edit in my post in #3814 thread, I understand now that Dmitry was referring to the scheduler not being lockfree, which is not what I was concerned with. I am referring to the APIs that the safety-conscious Vlang programmer has access to, not the implementation of the runtime or unsafe{} code.

But if you meant lockless, then I'd disagree. Please google for lockless variants of efficient lock-based algorithms and you'll be surprised that lockless is only sometimes more efficient in practice (on top of this lockless quite often suffers from other symptoms). It's by far not this easy to say "lockless is best"...

I’m already aware of that. I mean to suggest to avoid the need to lock by avoiding contention, e.g. by when accessing shared memory only allowing that be access to immutable data, for the code that is not marked unsafe{}. Because locking synchronization in general can not be proven to be free of live-locks, deadlocks and insidious heisenbugs. Analogously a Turing-complete program can’t be proven to halt. Said heisenbugs can in some cases be impossible to track down.

This is a facet of Go that really disappoints me. Unfortunately there is no construct in the Go language for declaring and enforcing data to be immutable.

Note as you may know, immutable data structures have the advantage that they can’t contain circular references, which is important if employing ARC instead of a tracing GC, because ARC can’t collect circular references (although there exist probabilistic designs which attempt to do some localized tracing along with ARC).

One possible model for mutating shared immutable data is for the exclusive owner to create a new mutated, immutable copy (perhaps via an efficient data structure that only copies the changed part which afaik is what Clojure employs), then send a message to threads to tell them to update the copy of the immutable data they are referencing. As I think you all know (bcz I’ve seen it referenced already) that the Pony language has a references model for tracking which resources can be consumed to an immutable and then shared between threads. And there are other possible models such as Arena/Actor-like partitions which I have contemplated.

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

Of course there will still be locking issues at some higher-level semantics layer such as shared access to a shared database from concurrent threads, but afaics that is out-of-scope of Vlang’s design concerns.

You do seem very knowledgeable about concurrency and more so than myself. I found myself upvoting most of your comment posts. Thanks.

shelby3 on 15 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings