Julia: Better default RNG in the future?

Created on 16 Jun 2018  ·  72Comments  ·  Source: JuliaLang/julia

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256. It has essentially the same performance, but better statistical properties. *xoshiro256 is now my preferred PRNG*." I previously quoted: "The clear winner is xoroshiro128+"

https://nullprogram.com/blog/2017/09/21/

"Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. ..

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising."

RNG

Most helpful comment

I'm running an Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz, i.e. a pretty anemic laptop broadwell.

The vectorization was by hand, the code is in the gist, but small enough to post here:

using SIMD, BenchmarkTools
mutable struct xstate{N}
    s0::Vec{N, UInt64}
    s1::Vec{N, UInt64}
    s2::Vec{N, UInt64}
    s3::Vec{N, UInt64}
end
macro new(T)
    return Expr(:new, T)
end

function init(xs::xstate)
    r=rand(UInt8, sizeof(xs))
    GC.@preserve xs, r
        ccall(:memcpy, Cvoid, (Ptr{Nothing}, Ptr{Nothing}, Csize_t), pointer_from_objref(xs), pointer(r), sizeof(xs))
    nothing
end

@inline rotl(a, k) = (a<<k) | (a>>(64-k))

@inline function next_nostore((s0, s1, s2, s3))
    res = rotl(s0 + s3, 23) + s0
    t = s1 << 17
    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3
    s2 ⊻= t
    s3 = rotl(s3, 45)
    return (res, (s0, s1, s2, s3))
end

function XfillSimd_nostore(dst, xs::xstate{N}, num) where N
    lane = VecRange{N}(0)
    s = xs.s0, xs.s1, xs.s2, xs.s3
    @inbounds for i=1:N:num
        dst[lane+i], s = next_nostore(s)
    end
    xs.s0, xs.s1, xs.s2, xs.s3 = s
    nothing
end
N=1024; dst=zeros(UInt64, N);
xs = @new xstate{8}; init(xs)
@btime XfillSimd_nostore($dst, $xs, $N)

Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat.

This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four <8 x UInt64> vectors.

PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:

L144:
    vpsrlq  $41, %ymm4, %ymm6
    vpsllq  $23, %ymm4, %ymm4
    vpor    %ymm4, %ymm6, %ymm4
    vpaddq  %ymm0, %ymm4, %ymm4
    vmovdqa %ymm5, %ymm0
    movq    (%rdi), %rcx
    vmovdqu %ymm4, -8(%rcx,%rdx,8)
    cmpq    %rdx, %rax
    je  L266
    addq    $4, %rdx
    vpaddq  %ymm0, %ymm2, %ymm4
    vpsllq  $17, %ymm1, %ymm6
    vpxor   %ymm0, %ymm3, %ymm3
    vpxor   %ymm1, %ymm2, %ymm2
    vpxor   %ymm1, %ymm3, %ymm1
    vpxor   %ymm0, %ymm2, %ymm5
    vpxor   %ymm6, %ymm3, %ymm3
    vpsllq  $45, %ymm2, %ymm6
    vpsrlq  $19, %ymm2, %ymm2
    vpor    %ymm6, %ymm2, %ymm2
    jns L144

All 72 comments

Too late for this release but the default RNG stream is explicitly documented to not be stable so we can change it in 1.x versions. Another contender is the PCG family of RNGs.

Ok, I thought last change (people assume stable). It's 7 lines of C code.

I'm on smartphone, I at least can't locate "(un)stable" at

https://docs.julialang.org/en/latest/stdlib/Random/

MT, Julia's current default, is "between 1/4 and 1/5 the speed of xoroshiro128+." and MT has "one statistical failure (dab_filltree2)."

"PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software."

http://xoshiro.di.unimi.it

64-bit Generators
xoshiro256** (XOR/shift/rotate) is our all-purpose, rock-solid generator (not a cryptographically secure generator, though, like all PRNGs in these pages). It has excellent (sub-ns) speed, a state space (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 8 TB of output in our test. We believe this slight bias cannot affect any application.

Finally, if for any reason (which reason?) you need more state, we provide in the same vein xoshiro512* / xoshiro512+ and xoroshiro1024* / xoroshiro1024* (see the paper).

It looks like the timings are based on his own implementation of MT. The one we use is probably quite a bit faster.

Update: The Dieharder test suite is also considered outdated

Also intriguing-just use AES:

Speed
The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns.

Extensive testing by others (Daniel Lemire, John D. Cook) favors PCG and notes a few problems with the xor__s. One or the other or another -- it will require Julia-centric testing and benchmarking.

Random is a stdlib so it could have breaking changes and a new release before the next release of the language, no?

Yes, it could have breaking changes before Julia 2.0.

Just for fun, I coded up a prototype implementation:

# A quick test to test a new RNG for Julia
using BenchmarkTools

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # Make sure you use xor() instead of ^
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # I love magic constants
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

function xoroshiro128plus(N, s)
    x = Array{UInt64}(undef, N)
    @inbounds for i = 1:N
        # copy state to temp variables
        s1, s2 = (s[1], s[2])

        # Calculate result immediately
        x[i] = s1 + s2

        # Mash up state
        s2 = xor(s2, s1)
        s1 = xor(((s1 << 55) | (s1 >> 9)), s2, (s2 << 14))
        s2 = (s2 << 36) | (s2 >> 28)

        # Save new state
        s[1], s[2] = (s1, s2)
    end
    return x
end

# Collect some performance data
for N in round.(Int64, 2 .^ range(12, stop=20, length=3))
    println("Testing generation of $N 64-bit numbers")
    local s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime xoroshiro128plus($N, $s)
    @btime rand(UInt64, $N)
end

This gives surprisingly good timings, considering MT has been optimized for many a year:

Testing generation of 4096 64-bit numbers
  13.889 μs (2 allocations: 32.08 KiB)
  8.136 μs (2 allocations: 32.08 KiB)
  6.733 μs (2 allocations: 32.08 KiB)
Testing generation of 65536 64-bit numbers
  219.095 μs (2 allocations: 512.08 KiB)
  130.786 μs (2 allocations: 512.08 KiB)
  100.082 μs (2 allocations: 512.08 KiB)
Testing generation of 1048576 64-bit numbers
  4.658 ms (2 allocations: 8.00 MiB)
  3.245 ms (2 allocations: 8.00 MiB)
  3.118 ms (2 allocations: 8.00 MiB)

I note that @inbounds actually makes xorshift1024 _slower_, which is why that's not included in the code above.

I don't think it's all that surprising—MT is pretty complex. One of the huge advantages of xoroshiro and PCG is that they're much simpler and easier to generate really complex code for.

they put the _suit_ in pseutdorandom

The package RandomNumbers.jl implements many different RNGs, including xoroshiro128+ which is quite a bit faster than the standard RNG from base according to the benchmarks in the documentation.

Is Float rand more important than Int rand? It seems it may matter:

@JeffreySarnoff thanks for the links; still "favors PCG and notes a few problems with the xor__s." seems not conclusive nor comparing to latest variants.

Java's shiftmix64 is also interesting.

Also Google's
https://github.com/google/randen
as per AES comment above.

https://github.com/sunoru/RandomNumbers.jl/issues/1#issuecomment-233533191

@sunoru "Yes, I have considered accelerating some RNGs with GPU"

I guess AES is not available on GPUs so that may be a factor to NOT choose AES as default on CPUs? I.e. would we want same default on GPU and CPU?

In general, it's fine (and often good) to drop some extra output bits. In PCG, IIRC, you split the internal output into external output bits and permutation bits in a way that can be adjusted, raising the possibility that we could use different output steps based on the same state to generate 64 bits of integer output versus 53 bits of floating-point output. Another approach would be to use different RNG streams for different bit sizes, which is appealing for being able to generate random 128-bit values, for example. (A pair of 128-bit streams may not be as good as a single 128-bit stream, for example, unless the two streams have coprime periods which is hard to arrange, so it can be quite helpful to have different sized streams that take advantage of their increased size to produce better streams.)

"The clear winner is xoroshiro128+" (seems outdated) @staticfloat thanks for coding it. Actually "xoshiro256**" seems to be their best; note dropped letters, I first thought a typo.

I'm sorry that I haven't updated that package for long. But maybe this is worth a look: http://sunoru.github.io/RandomNumbers.jl/latest/man/benchmark/

My personal choice is xoroshiro128+. Since some RNGs in PCG don't pass the big crush as its paper says, I don't know if PCG is as reliable as it sounds.

PCG not passing big crush is indeed an issue. That seems like a strange disconnect.

@sunoru I was just looking at your RandomNumbers.jl page; that is some really nice work you've put together. IMO, discussion about a better RNG should be informed by your work as you've put in the time and effort to actually compare against quite a number of different PRNGs. And yes, it does appear true that xoroshiro128+ is the winner in all aspects so far; however I am not sure what the test for quality is nowadays; I've heard of BigCrush, DieHarder, TestU01, etc..... I'm sure Andreas can answer which is the test to pay attention to these days. :)

The real issue with PCG is that it is not being well-tended. I don't know anything about why it got crushed -- if there were more effort behind that family, a remedy modification may have been made but there is none. So, going with new information: I still like PCG .. to use it for what it does best .. and with the failure, that's not this.

If AES-CTR turns out to be fast enough on most architectures, then I think this would be absolutely fantastic as the default RNG.

This eradicates an entire class of security bugs and relieves people from the mental burden of considering the quality of the random stream and how to use it. And people who need something faster can still switch this out.

The only big isssue is architectures without hardware support for AES (then it would almost surely be too slow). So the bitter pill to swallow is that the random stream would become architecture dependent (e.g. older raspberry pi don't have aes acceleration and should fall back on something else).

I would be ok with having a different one on different architectures.

After looking at http://www.cs.tufts.edu/comp/150FP/archive/john-salmon/parallel-random.pdf, I am more convinced that AES would be a really good default on supporting archs: Fast enough (faster than MT on a lot of CPUs, and does not eat as much L1), practically perfect quality, and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

@ViralBShah (on xoroshiro128** or you mean on their older RNG?) "I thought those RNG don't pass the RNG testsuite." I googled, and assume you mean:

https://github.com/andreasnoackjensen/RNGTest.jl/ that I found here: https://github.com/JuliaLang/julia/issues/1391

I think the discussion should continue here, not the startup time issue, while I'm not sure the default RNG needs to pass all tests (I'm not sure it does NOT do that), if you can easily opt into a better RNG.

At http://xoshiro.di.unimi.it/

there's both e.g. xoroshiro128** and xoroshiro128+ to look into.

and I only see a minor flaw on the latter:

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).
[..]
Beside BigCrush, we analyzed our generators using a test for Hamming-weight dependencies described in our paper. As we already remarked, our only generator failing the test (but only after 5 TB of output) is xoroshiro128+. [..]

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

Yes, but if you want reproducibility, you should be using an explicit RNG object instead of the default.

It would be nice if @rfourquet can chime in here.

It would be nice if @rfourquet can chime in here.

I have not much knowledge about RNGs, but I will be happy to review and support an initiative changing the default RNG.

One possibly relevant question for a new RNG: can it generate separate streams (for different threads) that we are confident are sufficiently independant. For MT, the idea was to use the "jump" functionality (as the period is so huge that it can be easily shared among different sub-streams), and there was an issue showing that we didn't really know how independant are streams generated with given seeds (e.g. in 1:n). IIRC, for example, it's quite easy to generate independent streams with PCG.

I just have a couple other comments:

and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.

Yes, but if you want reproducibility, you should be using an explicit RNG object instead of the default.

I don't totally disagree, but I believe it would be a little nicer to keep reproducibility cross-platform if there are no high costs for that. An example where it could matter is the Primes module: random numbers are involved in factorization methods, but the numbers themselves are not relevant for the user, hence are not exposed to her and there is no way (currently) to specify an RNG. But a user may report a bug for a specific seed, and it would be unfortunate for the maintainers to not be able to reproduce it easily.

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.

True, if we take a cryptographically secure PRNG then we should have it on all platforms. I'd guess aes-ctr when available, and e.g. chacha20 otherwise. Chacha20 has a reduced-round variant for very weak CPUs, and we might consider following the random123 paper in taking a reduced-round simplified key-schedule aes variant (I'm personally against, but if people complain a lot about speed, well, still more secure and simpler than MT).

Regardless, it is probably better to provide both secure and insecure PRNG, even if both coincide (such that breakage is reduced in case of future changes).

See eg here for an example using aes intrinsics on x86, AArch64 and power8. Are there any other supported archs than ARMv7 and ancient x86 that lack aes acceleration?

Since the entire internal state fits into 3 cache lines (aes, round keys + counter) or one cache line (chacha20), we can simply give each thread an independent PRNG. For reproducibility, we could simply use something like xor(seed, thread_id()) or encrypt(key=seed, block=thread_id()) as aes keys for explicit Random.seed!(seed::Integer), and pull 16 bytes per thread from /dev/random on startup / Random.seed!().

For birthday-related reasons, we need to occasionally rekey. Most protocols conservatively rekey after 2^(L/4) blocks (L=128 for aes). I would be OK with never rekeying by default (simpler code means less opportunity for bugs with respect to reproducibility; after 2^48 blocks we get 1e-10 collision chance for true random stream, 0 collision chance for our flawed approximation of a random stream).

TL;DR Despite what I said earlier on AES, Idon't think we should use it (nor other [hardware] cryptographically secure PRNG), but rather xoshiro256** (possibly only/rather the slightly faster xoshiro256+ is needed if rand for Float64 is most important).

AES/Randen (with hardware support; or ChaCha8 without or PCG) seems simply slower, not coming close to 0.36 cycles per byte that you can get, see the table at xoshiro.di.unimi.it

Also if we go with cryptographically secure PRNG with or without using special hardware, we can't really go back when we've made that "guarantee". I believe it's better to opt into that for those few(?) that need it. We can always reconsider changing again later, if it's actually faster.

If we do later consider cryptographically secure, it seems Google's randen based on AES should be at the top of the list:

https://github.com/google/randen

What if we could default to attack-resistant random generators without excessive CPU cost? We introduce 'Randen', a new generator with security guarantees; it outperforms MT19937, pcg64_c32, Philox, ISAAC and ChaCha8 in real-world benchmarks. This is made possible by AES hardware acceleration and a large Feistel permutation.

Related work

AES-CTR (encrypting a counter) is a well-known and easy to implement generator. It has two known weaknesses:

See also tables there on cycles per byte.

The problem is that if it's architecture-dependant, some people will naively rely on the default RNG's security guaranties, while these will not hold with code deployed on a different architecture.

True, if we take a cryptographically secure PRNG then we should have it on all platforms. I'd guess aes-ctr when available, and e.g. chacha20 otherwise. Chacha20 has a reduced-round variant for very weak CPUs, and we might consider following the random123 paper

I assume Chacha8 is is the reduced-round variant. It's just much slower than Randen (and Chacha20 5 times slower additionally), while

Having it on all platforms would mean very slow for all the cryptically secure I know of when there's no hardware support.

I repeat the interesting links (and some additional):

https://nullprogram.com/blog/2017/09/21/

August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG.
[..]
My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

xoshiro.di.unimi.it

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

See however http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/ is you want to consider that MT variant.

xoshiro.di.unimi.it/splitmix64.c

/* This is a fixed-increment version of Java 8's SplittableRandom generator
See http://dx.doi.org/10.1145/2714064.2660195 and
http://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html

It is a very fast generator passing BigCrush, and it can be useful if
for some reason you absolutely want 64 bits of state; otherwise, we
rather suggest to use a xoroshiro128+ (for moderately parallel
computations) or xorshift1024* (for massively parallel computations)
generator. */

https://springframework.guru/random-number-generation-in-java/

The java.util.Random class implements what is generally called a linear congruential generator (LCG). It is designed to be fast but does not meet requirements for real-time use

See also:
https://en.wikipedia.org/wiki/Intel_Cascade_Cipher (based on AES)

https://github.com/sunoru/RandomNumbers.jl/issues/52 (related one is implemented there, and if it matters also Random123).

Thanks to @sunoru he has implemented my preferred choice[s]:
https://github.com/sunoru/RandomNumbers.jl/issues/52#issuecomment-460133618

xoshiro256** and xoshiro256+, as well as the other recommended PRNGs in their page (see c91f948)

I also noticed at:

https://en.wikipedia.org/wiki/Xorshift#Variations

xorwow [..]
This performs well, but fails a few tests in BigCrush.[5] This generator is the default in Nvidia's CUDA toolkit.[6]

I'm not bringing it up because I think it should be our default. I just hadn't thought of GPU use (nor did I know of a GPU default). If we want to generate random numbers on GPUs the code better be short, and not based on AES (I assume they don't have nor any other instructions relating to [real] random).

I can maybe see a point in implementing xorwow if we need to have the same RNG on CPU and GPU. I suppose we want that but could then substitute our default RNG for Nvidia's?!

I took a quick look at what other (well one) interesting language, Rust, have chosen to implement.

https://crates.io/crates/rand

some of the dependencies are interesting, e.g. https://crates.io/crates/rand_jitter and the one implementing HC-128 (was new to me, fastest non-hardware secure one?):

http://www.ecrypt.eu.org/stream/p3ciphers/hc/hc128_p3.pdf

Statement 8. Encryption speed is 3.05 cycles/byte on Pentium M processor. [..]
Stream cipher HC-128 is the simplified version of HC-256 [15] for 128-bit security. HC-128 is a simple, secure, software-efficient cipher and it is freely-available. [..]

Our analysis shows that recovering the key of HC-128 is as difficult as exhaustive key search

NumPy has adopted PCG64 (https://github.com/numpy/numpy/issues/13635)

TensorFlow chose these, or strictly TF for Swift language: https://github.com/tensorflow/swift/blob/master/RELEASES.md (link to both from that link, saying e.g. "10-round Philox4x32 PRNG" and "20-round Threefry2x32 PRNG"):
"Philox and Threefry random number generators and generic random distributions."

What to make of it, I guess they liked the speed:

www.thesalmons.org/john/random123/papers/random123sc11.pdf
"Threefry is the fastest Crush-resistant PRNG, requiring only common bitwise operators and in-teger addition. Threefry can be expected to perform well across a wide range of instruction set architectures."

They are interesting as of this type: https://en.wikipedia.org/wiki/Counter-based_random_number_generator_(CBRNG)

[also described there is ARS-5.]

The performance claim above may be outdated as SplitMix is newer, and so is Permuted Congruential Generator (PCG) also and Xoroshiro128+ and related even newer.

According to https://github.com/numpy/numpy/issues/13635#issuecomment-506614353 PCG64, NumPy chose, is almost twice as fast as Philox, while SFC64 is a bit faster, and I do not see them mention Threefry.

Let's not have the perfect be the enemy of the good. I agree that xoshiro variants are preferable to mersenne, in all circumstances.

With regards to cryptographically secure default RNG, I am totally fine with google's cool new alg. I think Random stdlib should export and initialize both a fast and a secure RNG. Exporting a secure and "relatively fast" RNG is important in order to not give people the excuse of minimizing dependencies.

The only question on which we differ is what should be the default RNG: Secure or fast. On most hardware platforms, secure is fast enough for code that is not super random-heavy. The only exception is old ARMv7 that lack acceleration and would need a fallback.

I am arguing that users who don't explicitly specify an RNG should get a secure one, instead of a fast one. Users who know what they care about can choose an explicit RNG. The only people who would be really hurt by this are people who do semi-heavy randomized compute on ARMv7 (not heavy enough to choose an explicit RNG, but heavy enough for random speed to matter). I am arguing that it is OK to throw these users under the bus (they may be an empty set, anyway).

On the other side, I am arguing that security-unsophisticated users who mistakenly use default-rand in security relevant contexts should be saved from themselves.

Case in point: Literally the first thing I looked at, genie.jl session key generation uses an insecure rand() call. There probably are mitigating factors (Genie.SECRET_TOKEN, hash function application might make sequence prediction impractical). But the fact that I need to even think about mitigating factors and scenarios where the config file is leaked is ridiculous.

If rand() becomes crypto, we should introduce fastrand() imo.

We already have a wonderful system for choosing random number generators. This is rand(rng::Random.AbstractRNG, args...), with a default rand(args...) = rand(Random.GLOBAL_RNG, args...). The words "default RNG" refer to this global constant in the Random module.

In the rand-becomes-crypto-by-default world I'm advocating for, we would have additional constants Random.FAST, Random.MERSENNE and Random.SECURE that are always initialized; and we would set Random.GLOBAL_RNG = Random.SECURE instead of the current Random.MERSENNE.

In other words, the natural definition would be fastrand(args...) = rand(Random.FAST, args...). As such, I don't see any reason for a dedicated fastrand function to exist, just like a dedicated securerand function makes imo no sense today.

I did not know that. Nevermind.

On Sun, Aug 18, 2019 at 1:34 PM chethega notifications@github.com wrote:

We already have a wonderful system for choosing random number generators.
This is rand(rng::Random.AbstractRNG, args...), with a default rand(args...)
= rand(Random.GLOBAL_RNG). The words "default RNG" refer to this global
constant in the Random module.

In the rand-becomes-crypto-by-default world I'm advocating for, we would
have additional constants Random.FAST, Random.MERSENNE and Random.SECURE
that are always initialized; and we would set Random.GLOBAL_RNG =
Random.SECURE instead of the current Random.MERSENNE.

In other words, the natural definition would be fastrand(args...) =
rand(Random.FAST, args...). As such, I don't see any reason for a
dedicated fastrand function to exist, just like a dedicated securerand
function makes imo no sense today.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/JuliaLang/julia/issues/27614?email_source=notifications&email_token=AAM2VRQWJLP2QBAEHWQXWHDQFGB4FA5CNFSM4FFKFZNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RERFI#issuecomment-522340501,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAM2VRRZ75HS73RDCNRXIALQFGB4FANCNFSM4FFKFZNA
.

Sorry to jump in the conversation, but there's some talk about stuff I did. So my 2€¢ on the topic are:

  • Please benchmark xoshiro256++, too. In some platform/compiler is faster (no multiplications).
  • I wrote a detailed report of why we should stop using the Mersenne Twister—in case you're interested, you can find it on arXiv (https://arxiv.org/abs/1910.06437).
  • "xoroshiro128+ fails PractRand very badly", in all its vagueness, doesn't really seem to be true: there's an obvious and well known failure in binary rank, which involves the lower bits, and won't happen if you don't use, say, the lowest 8 bits or so; and there's Hamming-weight dependencies, which happen after 5TB in our test and after >32TB in PractRand, and frankly for that speed I think it's OK—the Mersenne Twister at 607 bits fails at 800MB, for example. You can use xoroshiro128++ or xoroshiro128** if you don't want those, losing ≈0.20ns (in artificial benchmarks), which seems fine.

@vigna If up to you, what do you suggest as the replacement for MersenneTwister?

I just implemented xoshiro256++ in pure Julia, it's dead simple.

  • For scalar generation, compared to MersenneTwister, it's 40% faster for UInt64 generation, and 10% faster for Float64 generation (using the same technique as MT: generating in [1, 2) and substracting 1.0). It looks like a no-brainer... but:

  • for array generations: for filling up a vector of 10000 UInt64, it's 5% faster, but 40% slower for Float64. But this is using the default array versions, the performance may be improvable with a dedicated array version (but maybe not enough to be on par with MT).

Is one of xoshiro256++ and xoshiro256** more recommended?

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits ? That's a tradeoff that probably the user should be aware of. Also, in my tests on recent hardware I could measure no real difference (provided you precompute 1 / 2^53 and use exactly 53 bits).

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

I'm a bit surprised by your second benchmark—in principle, the conversion to float should have the same cost for everybody, so the speed comparison in an integer test should give approximately the same results of the comparison on doubles. I wouldn't mind having a look at the code if that's fine for you.

Presently I'm in favor of xoshiro256++.

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits

Indeed, but this was to benchmark equivalent things. There has been quite a number of discussions in Julia's dicourse / github to change that.

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

Yes, I would personally not mind including different variants, as they are so simple to implement.

in principle, the conversion to float should have the same cost for everybody

In general yes, but in this case there is a native array generation method for MT, i.e. the "kernel" (dSFMT) writes directly into the array, instead of writing into a cache array within MersenneTwister struct, and then fetching one value at a time from this cache into the destination array (this is the implementation detail of MersenneTwister in Julia!).

But there is good news, by writing naïve array generation for xoshiro256++, the array versions becomes 25% faster for UInt64, and about 15% slower for Float64 (down from 40%). This would be enough for me to adopt xoshiro256++ as the default generator in Julia, and even more so as xoshiro+ might give even better results for Float64.

BTW, thanks for having put you C code for the xoshiro family in the public domain, that makes trying it out very easy!

Ahhh OK you didn't tell me you were using the dSFMT. Yes, I acknowledge that in my shootout page:

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

If you're using extended instructions and you don't mind being able to generate just half of the possible numbers, that's OK. But remember that the dSFMT fails linearity tests, and that bias can filter through computations (see my note above) in very unexpected ways.

Ahhh OK you didn't tell me you were using the dSFMT.

Ah sorry! I'm used to just writing MersenneTwister in this github to mean the Julia's builtin version, which uses dSFMT.

So yes we are aware that it's not ideal to use the technique used be dSFMT to produce Float64. And regularly a user on discourse will point at this fact and be surprised about it. I don't think it's a big problem for most purposes, but better to improve that if possible.

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

I personally would prefer the "full-range" float generation at the cost of a bit of performance, but I don't know if this is a shared feeling.

Anyway, I vote for making xoshiro256++ the new Julia default RNG :)

And by the way it's really nice to have you chime in @vigna !

Plus, I forgot to mention earlier, the code to implement the jump functionality is also available, and is quite short (should be about 20 lines of code), when you compare with the equivalent for MT (which is about 150 lines of code, including simple implementation of polynomials over GF(2) !)

Does our RNG policy allows us to replace the default generator? I think so, but just checking. If so, this sounds like a good idea.

To be true, I believe the Mersenne Twister comes with a _generic_ jump implementation that makes it possible to jump ahead any number of steps: for that, you need computation of polinomials over Z/2Z modulo the characteristic polynomial, which needs code.

For speed and convenience, I provide a method for long jumps and one for large jumps. The idea is that you use large jumps to get different streams for multiple computational units, but then you can short jump and get a large number of streams in each unit (e.g., threads). Polynomials are precomputed and stored as bit arrays, but the technique is the same.

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

It would be interesting to see whether interleaving carefully computation and conversion can help there. Something like, you first fill the array and then you do a pass to convert to double (but you'll need C-style unions to do that). Or, even, having a temporary variable for the last result and at each iteration convert to double the previous result and store it while you compute the next state, and then store it in the temporary variable. Breaking dependency chains can have very surprising results sometimes.

To be true, I believe the Mersenne Twister comes with a generic jump implementation that makes it possible to jump ahead any number of steps

Ah right, I forgot about that.

It would be interesting to see whether interleaving carefully computation and conversion can help there.

Actually, there is no cost in Julia to convert a UInt64 to Float64. I still tried you suggestion to first do the bit-twidling work for the whole array in a loop, and then to substract 1.0 from all positions, but the performance are quite worse. I also indeed tried a bit to see how to "break dependency chains", but without success in this case. Basically at each iteration, bit-twidling and one store in the array must happen.

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

For me, it's totally fine as is and potential improvements could be addressed later, but let's see what other people think.

Does our RNG policy allows us to replace the default generator?

I would believe so. We explicitly documented that the generated streams of numbers can change, and we don't say anything about the stability of the type of the default rng. The only semi-official thing was to use Random.GLOBAL_RNG in the cases when you implement new functions which need to take an optional RNG, but even that got its type changed in 1.3 (it's now of type Random._GLOBAL_RNG instead of MersenneTwister).

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

Yes, that was what I was telling you. I think that FPUs have a very very fast execution path if you just have the significant bits and you shift them in. It might be slower, for example, if you don't do the >>11 shift and divide by 2.0^-64, because I think that in that case there's some rounding going on.

Actually, there is no cost in Julia to convert a UInt64 to Float64.

This is _kind_ of true: the reinterpret itself is a no-op, but in order to do most integer operations on floating-point values in hardware, it's necessary to move the value from a floating-point register to an integer register (and back if you want to continue using as floating-point), which does take a cycle in each direction, so two cycles to go back and forth. So it's not free in practice.

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

On my machine I get down to ~0.14 cpb:

julia dsfmt 1024 x UInt64
  2.089 μs (0 allocations: 0 bytes)
ref impl, 1 x interleaved, 1024 x UInt64
  1.579 μs (0 allocations: 0 bytes)
ref impl, 4 x interleaved, 1024 x UInt64
  2.150 μs (0 allocations: 0 bytes)
julia 1 x interleaved 1 x SIMD, 1024 x UInt64
  4.336 μs (0 allocations: 0 bytes)
  1.838 μs (0 allocations: 0 bytes)
julia 1 x interleaved 2 x SIMD, 1024 x UInt64
  2.709 μs (0 allocations: 0 bytes)
  1.384 μs (0 allocations: 0 bytes)
julia 1 x interleaved 4 x SIMD, 1024 x UInt64
  1.487 μs (0 allocations: 0 bytes)
  709.241 ns (0 allocations: 0 bytes)
julia 1 x interleaved 8 x SIMD, 1024 x UInt64
  927.750 ns (0 allocations: 0 bytes)
  604.148 ns (0 allocations: 0 bytes)
julia 1 x interleaved 16 x SIMD, 1024 x UInt64

with beautiful code (allmost no branches, everything 256 bit wide; if I had a AVX512 machine, this would be even better).

CF https://gist.github.com/chethega/344a8fe464a4c1cade20ae101c49360a

Can anyone link what xoshiro code you benchmarked? Maybe I just suck at C?

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

Wow, now you're making me shiver. 0.14 cycle/B? The vectorization is performed by the compiler or did you handcraft it (sorry, totally not proficient with Julia)?

I'm running an Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz, i.e. a pretty anemic laptop broadwell.

The vectorization was by hand, the code is in the gist, but small enough to post here:

using SIMD, BenchmarkTools
mutable struct xstate{N}
    s0::Vec{N, UInt64}
    s1::Vec{N, UInt64}
    s2::Vec{N, UInt64}
    s3::Vec{N, UInt64}
end
macro new(T)
    return Expr(:new, T)
end

function init(xs::xstate)
    r=rand(UInt8, sizeof(xs))
    GC.@preserve xs, r
        ccall(:memcpy, Cvoid, (Ptr{Nothing}, Ptr{Nothing}, Csize_t), pointer_from_objref(xs), pointer(r), sizeof(xs))
    nothing
end

@inline rotl(a, k) = (a<<k) | (a>>(64-k))

@inline function next_nostore((s0, s1, s2, s3))
    res = rotl(s0 + s3, 23) + s0
    t = s1 << 17
    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3
    s2 ⊻= t
    s3 = rotl(s3, 45)
    return (res, (s0, s1, s2, s3))
end

function XfillSimd_nostore(dst, xs::xstate{N}, num) where N
    lane = VecRange{N}(0)
    s = xs.s0, xs.s1, xs.s2, xs.s3
    @inbounds for i=1:N:num
        dst[lane+i], s = next_nostore(s)
    end
    xs.s0, xs.s1, xs.s2, xs.s3 = s
    nothing
end
N=1024; dst=zeros(UInt64, N);
xs = @new xstate{8}; init(xs)
@btime XfillSimd_nostore($dst, $xs, $N)

Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat.

This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four <8 x UInt64> vectors.

PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:

L144:
    vpsrlq  $41, %ymm4, %ymm6
    vpsllq  $23, %ymm4, %ymm4
    vpor    %ymm4, %ymm6, %ymm4
    vpaddq  %ymm0, %ymm4, %ymm4
    vmovdqa %ymm5, %ymm0
    movq    (%rdi), %rcx
    vmovdqu %ymm4, -8(%rcx,%rdx,8)
    cmpq    %rdx, %rax
    je  L266
    addq    $4, %rdx
    vpaddq  %ymm0, %ymm2, %ymm4
    vpsllq  $17, %ymm1, %ymm6
    vpxor   %ymm0, %ymm3, %ymm3
    vpxor   %ymm1, %ymm2, %ymm2
    vpxor   %ymm1, %ymm3, %ymm1
    vpxor   %ymm0, %ymm2, %ymm5
    vpxor   %ymm6, %ymm3, %ymm3
    vpsllq  $45, %ymm2, %ymm6
    vpsrlq  $19, %ymm2, %ymm2
    vpor    %ymm6, %ymm2, %ymm2
    jns L144

So, after consideration:

The proposed setup, if we wanted to switch to xoshiro256++, would be to reserve 5 cache-lines of RNG state (ach thread has its own completely independent generator). One CL is for sequential generation, and laid out like xstate{2}, with an implementation that only advances the first of the RNGs for 1-8 byte outputs, and advances both for 16-byte outputs. This is used in code that uses rand(some_small_bitstype).

Then, there are 4 cache-lines for bulk generations, laid out like xstate{8}.

These 4 cachelines aren't hot for code that doesn't use bulk generation, so nobody cares about state-size.

There sequential generator uses xstate{2} layout because I have no better idea what to do with the remaining half of the line, so we may well use it to speed up 16byte-wide generation; and there is no reason to compactify the layout of the primary generator (sequential generation of <= 8 byte numbers), since we can't use wide SIMD loads anyway.

The sequential generator only exists because: (1) sequential generation should only fill in a single cache-line of state, and (2) bulk generation wants to use wide vector loads/stores for the state. Code that uses both sequential and bulk generation will therefore fill 5 instead of 4 lines; doesn't matter, who cares.

The sequential generator should be properly inlined (verify that, and possibly add @inline). Unfortunately I see no good way of getting the compiler to optimize out the stores to xstate in tight loops; I assume that this is in order to ensure correct behaviour when the store traps? But we don't deal with segfaults anyway, so maybe there is some magic llvm annotation that tells llvm to not care about correct xstate after trapping? I am not opposed to writing the whole RNG in a giant llvmcall.

There is no reason to implement jump. We only need seed! and output generation.

How do we tell the allocator to give us 64-byte aligned mutable struct?

In total, I think this would be a good move, both for speed and in order to cut out the dependency. One would need to port a variant of my code over to Base-julia (i.e. without SIMD.jl) and verify that the resulting @code_native is OK on all supported archs, and that the 8 x wide SIMD doesn't hurt perf on weaker machines (i.e. check that the compiler doesn't spill registers). Someone with access to AVX512, POWER and various ARM should chime in at some point. Worst case we can stick with 4 x wide simd and save 2 cache lines. Downside is slightly lower perf with AVX2 and presumably massive perf degradation with AVX512. SIMD widths are increasing all the time, so being future-proof without breaking random stream would be nice.

On the other hand, if we change the default RNG at all, I would still prefer a cryptographic choice; if that is rejected and if scientific consensus says that xoshiro gives "good enough" streams, then that's good enough for me.

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable.

Hey, I just ported naïvely @vigna's code to Julia, I wanted to have an actualized performance comparison compared to last year and with xoshiro256++ specifically. It was good enough for me, and I currently don't have the skills to write simd-optimized code! :D

What you did is amazing, with your benchmark above, this gets 2.6 times faster than my array generation routine.
If this is what you asked, here is my code:

using Random

mutable struct Xoropp <: AbstractRNG
    s0::UInt64
    s1::UInt64
    s2::UInt64
    s3::UInt64
end

rotl(x::UInt64, k::UInt) = (x << k) | (x >> ((64 % UInt) - k))

function Base.rand(rng::Xoropp, ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    result = rotl(s0 + s3, UInt(23)) + s0

    t = s1 << 17

    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3

    s2 ⊻= t

    s3 = rotl(s3, UInt(45))

    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3

    result
end


# array generation code: nothing magic, but somehow gets quite faster than the default
# code for array, which does `rand(rng, UInt64)` in a loop
function Random.rand!(rng::Xoropp, A::AbstractArray{UInt64},
                      ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    @inbounds for i = eachindex(A)
        A[i] = rotl(s0 + s3, UInt(23)) + s0

        t = s1 << 17

        s2 ⊻= s0
        s3 ⊻= s1
        s1 ⊻= s2
        s0 ⊻= s3

        s2 ⊻= t

        s3 = rotl(s3, UInt(45))
    end
    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
    A
end

# for Float64 generation
Random.rng_native_52(::Xoropp) = UInt64

I can't really comment on your "proposed setup" above, as I have not much ideas above cache lines and such.

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

That is correct!

In even more detail, each thread would have 10 independent xoshiro instances. 1-64 bit sequential user requests would advance rng_1, 65+ bit sequential user requests would advance rng_1 and rng_2, and all bulk user requests would advance rng_3-rng_10. rng_1 and rng_2 would be laid out together in an xstate{2}, and the bulk generators would be laid out in an xstate{8}.

Hey, I just ported naïvely

Sorry, I didn't want to impugn your code. You do awesome work in keeping julia's random infrastructure together, and your APIs rock!

What you did is amazing

That's too much praise ;) I just read the assembly, noticed that it sucks, had the audacity to propose 10 interleaved xoshiro instances instead of a single one, and passed a Vec{8, UInt64} instead of a single UInt64 into the RNG function. Praise belongs more to @vigna for xoshiro and @eschnett for SIMD.jl. I guess we should also port the bulk generator to C, such that @vigna can use it as well. Would be a fun exercise in intel intrinsics (alternative: copy-paste the generated julia code and make it inline assembly. While at it, we can also clean it up: I see no reason that the hot loop has two conditional branches when we could do with a single one, probably some weird size minimization).

PS:

array generation code: nothing magic, but somehow gets quite faster than the default
code for array, which does rand(rng, UInt64) in a loop

That is because you hoisted writing and reading the state out of the loop. This is a super important optimization, and is the main property of my nostore variant. I feel fine with this for things with a pointer, but am unsure about AbstractArray: If the setindex! calls rand again, then we would generate duplicate random numbers. We don't know the setindex! implementation: Maybe the AbstractArray is really backed by a very complicated sparse array that internally uses randomized algorithms?

That is because you hoisted writing and reading the state out of the loop

I assumed so, but was surprised that the compiler doesn't do this seemingly easy optimization by itself. But your point about setindex! being allowed to interleave a call to the same generator helps me understand why this is not trivial.
(The faster version which assumes no weird setindex! should probably be made available to array implementations in some way).

Also, I realized that in my benchmarks above, I compared xoshiro++ to a local instance of MersenneTwister, not to the global RNG. In 1.3, with threading, its performance got a serious hit (by more than 2x), so it would be interesting to test xoshiro++ as the default global RNG.

So, please help me: why won't this be vectorized by gcc or clang? Here XOSHIRO256_UNROLL=8 and I'm using -O3 -mavx2...

static inline uint64_t next(void) {
    uint64_t t[XOSHIRO256_UNROLL];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s0[i] + s3[i], R) + s0[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s1[i] >> A;

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= s0[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] ^= s1[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s1[i] ^= s2[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s0[i] ^= s3[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= t[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] = rotl(s3[i], B);
    // This is just to avoid dead-code elimination
    return t[0] ^ t[XOSHIRO256_UNROLL-1];
}

Try running clang with -Rpass-missed=loop-vectorize or -Rpass-analysis=loop-vectorize. That should print out debugging information showing which loops were not vectorized, which statements stopped vectorization, etc...

Reference: https://llvm.org/docs/Vectorizers.html#diagnostics

I got this from clang:

./harness.c:83:11: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize]
uint64_t e = -1;

But this is the external benchmarking loop. There's never a warning saying it's not optimizing the loops above. :(

From what I've read the problem would be that I'm using memory and not registers (i.e., variables aren't local).

Have you tried with -Rpass=loop-vectorize to see if it actually does think it's vectorizing the loop?

@vigna I don't know why your code didn't vectorize, but I am quite happy with clang's output for the following. The generated assembly is even nicer to read than julia's. Feel free to add this to your reference implementation, no attribution required.

#include <stdint.h>

#define XOSHIRO256_UNROLL 4

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}

typedef struct {uint64_t s[4][XOSHIRO256_UNROLL];} xstate;

xstate global_state;

#define next_macro(result, s) { \
    uint64_t t[XOSHIRO256_UNROLL]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s[0][i] + s[3][i], 23) + s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s[1][i] << 17; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] ^= s[1][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[1][i] ^= s[2][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[0][i] ^= s[3][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= t[i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] = rotl(s[3][i], 45); \
} \

void rngbulk(uint64_t* restrict dst, xstate* restrict state, unsigned int len){
    xstate slocal = *state;
    uint64_t result[XOSHIRO256_UNROLL];
    unsigned int len_trunc = len - (len % XOSHIRO256_UNROLL);
    for(int j = 0; j < len_trunc; j += XOSHIRO256_UNROLL){
        next_macro( (result) , (slocal.s) );
        for(int k=0; k<XOSHIRO256_UNROLL; k++) { dst[j+k] = result[k]; }
    }

    if(len_trunc != len){
        next_macro( (result) , (slocal.s) );
        for(int k = 0; k < len-len_trunc; k++){ dst[len_trunc + k] = result[k]; }
    }
    *state = slocal;
}

void rngbulk_globstate(uint64_t* restrict dst, unsigned int len){ rngbulk(dst, &global_state, len);}

edit: removed memory corruption if len is not divisble by XOSHIRO256_UNROLL.

Ah OK! Just switching to s[4][XOSHIRO256_UNROLL] instead of s0[XOSHIRO256_UNROLL], etc. made my code vectorize with gcc 💪🏻.

Now I have to understand why clang does yours and not mine, but we're getting there...

Please note that the >> A should be << A. 🤦🏻‍♂️

And, most fundamental, it will _not_ happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

For me, GCC 8.3 generates fast(er) code with -O2 -ftree-vectorize -mavx2 and bad code with -O3 -ftree-vectorize -mavx2 for @vigna's XOSHIRO256_UNROLL (the bad code is about 33% slower).


With -O2 -ftree-vectorize -mavx2 I get:

4010ea: c5 fd 6f 2d 8e 2f 00    vmovdqa 0x2f8e(%rip),%ymm5        # 404080 <s>
  4010f1:   00 
  4010f2:   48 89 d8                mov    %rbx,%rax
  4010f5:   c5 fd 6f 0d 43 30 00    vmovdqa 0x3043(%rip),%ymm1        # 404140 <s+0xc0>
  4010fc:   00 
  4010fd:   c5 fd 6f 25 9b 2f 00    vmovdqa 0x2f9b(%rip),%ymm4        # 4040a0 <s+0x20>
  401104:   00 
  401105:   c5 fd 6f 05 53 30 00    vmovdqa 0x3053(%rip),%ymm0        # 404160 <s+0xe0>
  40110c:   00 
  40110d:   48 8d 93 00 00 01 00    lea    0x10000(%rbx),%rdx
  401114:   c5 fd 6f 1d a4 2f 00    vmovdqa 0x2fa4(%rip),%ymm3        # 4040c0 <s+0x40>
  40111b:   00 
  40111c:   c5 fd 6f 15 bc 2f 00    vmovdqa 0x2fbc(%rip),%ymm2        # 4040e0 <s+0x60>
  401123:   00 
  401124:   0f 1f 40 00             nopl   0x0(%rax)
  401128:   c5 d5 d4 f1             vpaddq %ymm1,%ymm5,%ymm6
  40112c:   c5 bd 73 f2 11          vpsllq $0x11,%ymm2,%ymm8
  401131:   c5 e5 ef c9             vpxor  %ymm1,%ymm3,%ymm1
  401135:   48 83 c0 40             add    $0x40,%rax
  401139:   c5 fd 7f 70 c0          vmovdqa %ymm6,-0x40(%rax)
  40113e:   c5 dd d4 f0             vpaddq %ymm0,%ymm4,%ymm6
  401142:   c5 ed ef c0             vpxor  %ymm0,%ymm2,%ymm0
  401146:   c5 d5 ef 3d b2 2f 00    vpxor  0x2fb2(%rip),%ymm5,%ymm7        # 404100 <s+0x80>
  40114d:   00 
  40114e:   c5 fd 7f 70 e0          vmovdqa %ymm6,-0x20(%rax)
  401153:   c5 f5 ef ed             vpxor  %ymm5,%ymm1,%ymm5
  401157:   c5 b5 73 f3 11          vpsllq $0x11,%ymm3,%ymm9
  40115c:   c5 dd ef 35 bc 2f 00    vpxor  0x2fbc(%rip),%ymm4,%ymm6        # 404120 <s+0xa0>
  401163:   00 
  401164:   c5 fd ef e4             vpxor  %ymm4,%ymm0,%ymm4
  401168:   c5 c5 ef db             vpxor  %ymm3,%ymm7,%ymm3
  40116c:   c4 c1 45 ef f9          vpxor  %ymm9,%ymm7,%ymm7
  401171:   c5 7d 7f 0c 24          vmovdqa %ymm9,(%rsp)
  401176:   c5 cd ef d2             vpxor  %ymm2,%ymm6,%ymm2
  40117a:   c4 c1 4d ef f0          vpxor  %ymm8,%ymm6,%ymm6
  40117f:   c5 7d 7f 44 24 20       vmovdqa %ymm8,0x20(%rsp)
  401185:   c5 fd 7f 35 93 2f 00    vmovdqa %ymm6,0x2f93(%rip)        # 404120 <s+0xa0>
  40118c:   00 
  40118d:   c5 cd 73 f1 2d          vpsllq $0x2d,%ymm1,%ymm6
  401192:   c5 f5 73 d1 13          vpsrlq $0x13,%ymm1,%ymm1
  401197:   c5 cd eb c9             vpor   %ymm1,%ymm6,%ymm1
  40119b:   c5 cd 73 f0 2d          vpsllq $0x2d,%ymm0,%ymm6
  4011a0:   c5 fd 7f 1d 18 2f 00    vmovdqa %ymm3,0x2f18(%rip)        # 4040c0 <s+0x40>
  4011a7:   00 
  4011a8:   c5 fd 73 d0 13          vpsrlq $0x13,%ymm0,%ymm0
  4011ad:   c5 fd 7f 15 2b 2f 00    vmovdqa %ymm2,0x2f2b(%rip)        # 4040e0 <s+0x60>
  4011b4:   00 
  4011b5:   c5 cd eb c0             vpor   %ymm0,%ymm6,%ymm0
  4011b9:   c5 fd 7f 2d bf 2e 00    vmovdqa %ymm5,0x2ebf(%rip)        # 404080 <s>
  4011c0:   00 
  4011c1:   c5 fd 7f 25 d7 2e 00    vmovdqa %ymm4,0x2ed7(%rip)        # 4040a0 <s+0x20>
  4011c8:   00 
  4011c9:   c5 fd 7f 3d 2f 2f 00    vmovdqa %ymm7,0x2f2f(%rip)        # 404100 <s+0x80>
  4011d0:   00 
  4011d1:   c5 fd 7f 0d 67 2f 00    vmovdqa %ymm1,0x2f67(%rip)        # 404140 <s+0xc0>
  4011d8:   00 
  4011d9:   c5 fd 7f 05 7f 2f 00    vmovdqa %ymm0,0x2f7f(%rip)        # 404160 <s+0xe0>

But with -O3 -ftree-vectorize -mavx2 I get:

 4010bf:    48 8b 1d ea 2f 00 00    mov    0x2fea(%rip),%rbx        # 4040b0 <s+0x30>
  4010c6:   48 8b 05 b3 2f 00 00    mov    0x2fb3(%rip),%rax        # 404080 <s>
  4010cd:   4c 8b 05 6c 30 00 00    mov    0x306c(%rip),%r8        # 404140 <s+0xc0>
  4010d4:   48 8b 3d 6d 30 00 00    mov    0x306d(%rip),%rdi        # 404148 <s+0xc8>
  4010db:   48 89 9c 24 b0 00 00    mov    %rbx,0xb0(%rsp)
  4010e2:   00 
  4010e3:   48 8b 1d 86 30 00 00    mov    0x3086(%rip),%rbx        # 404170 <s+0xf0>
  4010ea:   48 89 84 24 b8 00 00    mov    %rax,0xb8(%rsp)
  4010f1:   00 
  4010f2:   48 8b 05 8f 2f 00 00    mov    0x2f8f(%rip),%rax        # 404088 <s+0x8>
  4010f9:   48 89 9c 24 a8 00 00    mov    %rbx,0xa8(%rsp)
  401100:   00 
  401101:   48 8b 1d b0 2f 00 00    mov    0x2fb0(%rip),%rbx        # 4040b8 <s+0x38>
  401108:   48 89 84 24 80 00 00    mov    %rax,0x80(%rsp)
  40110f:   00 
  401110:   48 8b 05 79 2f 00 00    mov    0x2f79(%rip),%rax        # 404090 <s+0x10>
  401117:   48 89 9c 24 a0 00 00    mov    %rbx,0xa0(%rsp)
  40111e:   00 
  40111f:   48 8b 1d 52 30 00 00    mov    0x3052(%rip),%rbx        # 404178 <s+0xf8>
  401126:   48 89 44 24 78          mov    %rax,0x78(%rsp)
  40112b:   48 8b 05 66 2f 00 00    mov    0x2f66(%rip),%rax        # 404098 <s+0x18>
  401132:   48 89 9c 24 98 00 00    mov    %rbx,0x98(%rsp)
  401139:   00 
  40113a:   48 8b 1d 7f 2f 00 00    mov    0x2f7f(%rip),%rbx        # 4040c0 <s+0x40>
  401141:   48 89 44 24 70          mov    %rax,0x70(%rsp)
  401146:   48 8b 05 53 2f 00 00    mov    0x2f53(%rip),%rax        # 4040a0 <s+0x20>
  40114d:   48 89 9c 24 f8 00 00    mov    %rbx,0xf8(%rsp)
  401154:   00 
  401155:   48 8b 1d 6c 2f 00 00    mov    0x2f6c(%rip),%rbx        # 4040c8 <s+0x48>
  40115c:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
  401163:   00 
  401164:   48 8b 05 3d 2f 00 00    mov    0x2f3d(%rip),%rax        # 4040a8 <s+0x28>
  40116b:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  401172:   00 
  401173:   48 8b 1d 56 2f 00 00    mov    0x2f56(%rip),%rbx        # 4040d0 <s+0x50>
  40117a:   48 89 84 24 90 00 00    mov    %rax,0x90(%rsp)
  401181:   00 
  401182:   48 8b 35 c7 2f 00 00    mov    0x2fc7(%rip),%rsi        # 404150 <s+0xd0>
  401189:   48 8b 05 d8 2f 00 00    mov    0x2fd8(%rip),%rax        # 404168 <s+0xe8>
  401190:   48 8b 0d c1 2f 00 00    mov    0x2fc1(%rip),%rcx        # 404158 <s+0xd8>
  401197:   48 89 9c 24 c8 00 00    mov    %rbx,0xc8(%rsp)
  40119e:   00 
  40119f:   48 8b 15 ba 2f 00 00    mov    0x2fba(%rip),%rdx        # 404160 <s+0xe0>
  4011a6:   48 8b 1d 2b 2f 00 00    mov    0x2f2b(%rip),%rbx        # 4040d8 <s+0x58>
  4011ad:   c7 44 24 1c 00 04 00    movl   $0x400,0x1c(%rsp)
  4011b4:   00 
  4011b5:   48 89 44 24 68          mov    %rax,0x68(%rsp)
  4011ba:   4c 8b 3d 3f 2f 00 00    mov    0x2f3f(%rip),%r15        # 404100 <s+0x80>
  4011c1:   48 89 9c 24 e8 00 00    mov    %rbx,0xe8(%rsp)
  4011c8:   00 
  4011c9:   48 8b 1d 10 2f 00 00    mov    0x2f10(%rip),%rbx        # 4040e0 <s+0x60>
  4011d0:   4c 8b 35 31 2f 00 00    mov    0x2f31(%rip),%r14        # 404108 <s+0x88>
  4011d7:   4c 8b 2d 32 2f 00 00    mov    0x2f32(%rip),%r13        # 404110 <s+0x90>
  4011de:   48 89 9c 24 d0 00 00    mov    %rbx,0xd0(%rsp)
  4011e5:   00 
  4011e6:   48 8b 1d fb 2e 00 00    mov    0x2efb(%rip),%rbx        # 4040e8 <s+0x68>
  4011ed:   4c 8b 25 24 2f 00 00    mov    0x2f24(%rip),%r12        # 404118 <s+0x98>
  4011f4:   4c 8b 1d 2d 2f 00 00    mov    0x2f2d(%rip),%r11        # 404128 <s+0xa8>
  4011fb:   48 89 9c 24 d8 00 00    mov    %rbx,0xd8(%rsp)
  401202:   00 
  401203:   48 8b 1d e6 2e 00 00    mov    0x2ee6(%rip),%rbx        # 4040f0 <s+0x70>
  40120a:   4c 8b 15 1f 2f 00 00    mov    0x2f1f(%rip),%r10        # 404130 <s+0xb0>
  401211:   4c 8b 0d 20 2f 00 00    mov    0x2f20(%rip),%r9        # 404138 <s+0xb8>
  401218:   48 89 9c 24 e0 00 00    mov    %rbx,0xe0(%rsp)
  40121f:   00 
  401220:   48 8b 1d d1 2e 00 00    mov    0x2ed1(%rip),%rbx        # 4040f8 <s+0x78>
  401227:   48 89 9c 24 f0 00 00    mov    %rbx,0xf0(%rsp)
  40122e:   00 
  40122f:   48 8b 1d ea 2e 00 00    mov    0x2eea(%rip),%rbx        # 404120 <s+0xa0>
  401236:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40123d:   00 00 00 
  401240:   48 8b 84 24 f8 00 00    mov    0xf8(%rsp),%rax
  401247:   00 
  401248:   4c 33 bc 24 b8 00 00    xor    0xb8(%rsp),%r15
  40124f:   00 
  401250:   4c 33 b4 24 80 00 00    xor    0x80(%rsp),%r14
  401257:   00 
  401258:   4c 33 6c 24 78          xor    0x78(%rsp),%r13
  40125d:   48 c1 e0 11             shl    $0x11,%rax
  401261:   4c 33 64 24 70          xor    0x70(%rsp),%r12
  401266:   48 33 9c 24 88 00 00    xor    0x88(%rsp),%rbx
  40126d:   00 
  40126e:   48 89 44 24 58          mov    %rax,0x58(%rsp)
  401273:   48 8b 84 24 c0 00 00    mov    0xc0(%rsp),%rax
  40127a:   00 
  40127b:   4c 33 9c 24 90 00 00    xor    0x90(%rsp),%r11
  401282:   00 
  401283:   4c 33 94 24 b0 00 00    xor    0xb0(%rsp),%r10
  40128a:   00 
  40128b:   48 c1 e0 11             shl    $0x11,%rax
  40128f:   4c 33 8c 24 a0 00 00    xor    0xa0(%rsp),%r9
  401296:   00 
  401297:   48 33 bc 24 c0 00 00    xor    0xc0(%rsp),%rdi
  40129e:   00 
  40129f:   48 89 44 24 50          mov    %rax,0x50(%rsp)
  4012a4:   48 8b 84 24 c8 00 00    mov    0xc8(%rsp),%rax
  4012ab:   00 
  4012ac:   48 33 b4 24 c8 00 00    xor    0xc8(%rsp),%rsi
  4012b3:   00 
  4012b4:   48 33 8c 24 e8 00 00    xor    0xe8(%rsp),%rcx
  4012bb:   00 
  4012bc:   48 c1 e0 11             shl    $0x11,%rax
  4012c0:   4c 33 84 24 f8 00 00    xor    0xf8(%rsp),%r8
  4012c7:   00 
  4012c8:   48 89 44 24 48          mov    %rax,0x48(%rsp)
  4012cd:   48 8b 84 24 e8 00 00    mov    0xe8(%rsp),%rax
  4012d4:   00 
  4012d5:   48 c1 e0 11             shl    $0x11,%rax
  4012d9:   48 89 44 24 40          mov    %rax,0x40(%rsp)
  4012de:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
  4012e5:   00 
  4012e6:   48 c1 e0 11             shl    $0x11,%rax
  4012ea:   48 89 44 24 38          mov    %rax,0x38(%rsp)
  4012ef:   48 8b 84 24 d8 00 00    mov    0xd8(%rsp),%rax
  4012f6:   00 
  4012f7:   48 c1 e0 11             shl    $0x11,%rax
  4012fb:   48 89 44 24 30          mov    %rax,0x30(%rsp)
  401300:   48 8b 84 24 e0 00 00    mov    0xe0(%rsp),%rax
  401307:   00 
  401308:   48 c1 e0 11             shl    $0x11,%rax
  40130c:   48 89 44 24 28          mov    %rax,0x28(%rsp)
  401311:   48 8b 84 24 f0 00 00    mov    0xf0(%rsp),%rax
  401318:   00 
  401319:   48 c1 e0 11             shl    $0x11,%rax
  40131d:   48 89 44 24 20          mov    %rax,0x20(%rsp)
  401322:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
  401329:   00 
  40132a:   48 31 d0                xor    %rdx,%rax
  40132d:   48 89 44 24 60          mov    %rax,0x60(%rsp)
  401332:   48 8b 44 24 68          mov    0x68(%rsp),%rax
  401337:   48 33 84 24 d8 00 00    xor    0xd8(%rsp),%rax
  40133e:   00 
  40133f:   48 8b 94 24 a8 00 00    mov    0xa8(%rsp),%rdx
  401346:   00 
  401347:   48 33 94 24 e0 00 00    xor    0xe0(%rsp),%rdx
  40134e:   00 
  40134f:   4c 31 bc 24 f8 00 00    xor    %r15,0xf8(%rsp)
  401356:   00 
  401357:   48 89 94 24 a8 00 00    mov    %rdx,0xa8(%rsp)
  40135e:   00 
  40135f:   48 8b 94 24 98 00 00    mov    0x98(%rsp),%rdx
  401366:   00 
  401367:   48 33 94 24 f0 00 00    xor    0xf0(%rsp),%rdx
  40136e:   00 
  40136f:   4c 31 b4 24 c0 00 00    xor    %r14,0xc0(%rsp)
  401376:   00 
  401377:   4c 31 ac 24 c8 00 00    xor    %r13,0xc8(%rsp)
  40137e:   00 
  40137f:   4c 31 a4 24 e8 00 00    xor    %r12,0xe8(%rsp)
  401386:   00 
  401387:   48 31 9c 24 d0 00 00    xor    %rbx,0xd0(%rsp)
  40138e:   00 
  40138f:   4c 31 9c 24 d8 00 00    xor    %r11,0xd8(%rsp)
  401396:   00 
  401397:   4c 31 94 24 e0 00 00    xor    %r10,0xe0(%rsp)
  40139e:   00 
  40139f:   4c 31 8c 24 f0 00 00    xor    %r9,0xf0(%rsp)
  4013a6:   00 
  4013a7:   48 89 94 24 98 00 00    mov    %rdx,0x98(%rsp)
  4013ae:   00 
  4013af:   48 8b 54 24 60          mov    0x60(%rsp),%rdx
  4013b4:   4c 31 84 24 b8 00 00    xor    %r8,0xb8(%rsp)
  4013bb:   00 
  4013bc:   49 c1 c8 13             ror    $0x13,%r8
  4013c0:   48 31 bc 24 80 00 00    xor    %rdi,0x80(%rsp)
  4013c7:   00 
  4013c8:   48 c1 cf 13             ror    $0x13,%rdi
  4013cc:   48 31 74 24 78          xor    %rsi,0x78(%rsp)
  4013d1:   48 c1 ce 13             ror    $0x13,%rsi
  4013d5:   48 31 4c 24 70          xor    %rcx,0x70(%rsp)
  4013da:   48 c1 c9 13             ror    $0x13,%rcx
  4013de:   48 31 94 24 88 00 00    xor    %rdx,0x88(%rsp)
  4013e5:   00 
  4013e6:   48 8b 94 24 a8 00 00    mov    0xa8(%rsp),%rdx
  4013ed:   00 
  4013ee:   48 31 84 24 90 00 00    xor    %rax,0x90(%rsp)
  4013f5:   00 
  4013f6:   48 c1 c8 13             ror    $0x13,%rax
  4013fa:   4c 33 7c 24 58          xor    0x58(%rsp),%r15
  4013ff:   4c 33 74 24 50          xor    0x50(%rsp),%r14
  401404:   4c 33 6c 24 48          xor    0x48(%rsp),%r13
  401409:   48 89 44 24 68          mov    %rax,0x68(%rsp)
  40140e:   48 8b 84 24 a8 00 00    mov    0xa8(%rsp),%rax
  401415:   00 
  401416:   48 31 94 24 b0 00 00    xor    %rdx,0xb0(%rsp)
  40141d:   00 
  40141e:   48 8b 94 24 98 00 00    mov    0x98(%rsp),%rdx
  401425:   00 
  401426:   48 31 94 24 a0 00 00    xor    %rdx,0xa0(%rsp)
  40142d:   00 
  40142e:   48 c1 c8 13             ror    $0x13,%rax
  401432:   48 8b 54 24 60          mov    0x60(%rsp),%rdx
  401437:   4c 33 64 24 40          xor    0x40(%rsp),%r12
  40143c:   48 89 84 24 a8 00 00    mov    %rax,0xa8(%rsp)
  401443:   00 
  401444:   48 8b 84 24 98 00 00    mov    0x98(%rsp),%rax
  40144b:   00 
  40144c:   48 33 5c 24 38          xor    0x38(%rsp),%rbx
  401451:   4c 33 5c 24 30          xor    0x30(%rsp),%r11
  401456:   48 c1 ca 13             ror    $0x13,%rdx
  40145a:   48 c1 c8 13             ror    $0x13,%rax
  40145e:   4c 33 54 24 28          xor    0x28(%rsp),%r10
  401463:   4c 33 4c 24 20          xor    0x20(%rsp),%r9
  401468:   ff 4c 24 1c             decl   0x1c(%rsp)
  40146c:   48 89 84 24 98 00 00    mov    %rax,0x98(%rsp)
  401473:   00 
  401474:   0f 85 c6 fd ff ff       jne    401240 <main+0x1d0>
  40147a:   c5 fa 7e 54 24 78       vmovq  0x78(%rsp),%xmm2
  401480:   48 8b 44 24 68          mov    0x68(%rsp),%rax
  401485:   c4 e3 e9 22 4c 24 70    vpinsrq $0x1,0x70(%rsp),%xmm2,%xmm1
  40148c:   01 
  40148d:   c5 fa 7e 9c 24 b8 00    vmovq  0xb8(%rsp),%xmm3
  401494:   00 00 
  401496:   c4 e3 e1 22 84 24 80    vpinsrq $0x1,0x80(%rsp),%xmm3,%xmm0
  40149d:   00 00 00 01 
  4014a1:   c5 fa 7e a4 24 b0 00    vmovq  0xb0(%rsp),%xmm4
  4014a8:   00 00 
  4014aa:   c5 fa 7e ac 24 88 00    vmovq  0x88(%rsp),%xmm5
  4014b1:   00 00 
  4014b3:   c5 fa 7e b4 24 c8 00    vmovq  0xc8(%rsp),%xmm6
  4014ba:   00 00 
  4014bc:   c5 fa 7e bc 24 f8 00    vmovq  0xf8(%rsp),%xmm7
  4014c3:   00 00 
  4014c5:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4014cb:   c4 e3 d9 22 8c 24 a0    vpinsrq $0x1,0xa0(%rsp),%xmm4,%xmm1
  4014d2:   00 00 00 01 
  4014d6:   c4 c1 f9 6e e5          vmovq  %r13,%xmm4
  4014db:   c5 fd 7f 05 9d 2b 00    vmovdqa %ymm0,0x2b9d(%rip)        # 404080 <s>
  4014e2:   00 
  4014e3:   c4 e3 d1 22 84 24 90    vpinsrq $0x1,0x90(%rsp),%xmm5,%xmm0
  4014ea:   00 00 00 01 
  4014ee:   c4 c1 f9 6e ef          vmovq  %r15,%xmm5
  4014f3:   c5 fa 7e 94 24 e0 00    vmovq  0xe0(%rsp),%xmm2
  4014fa:   00 00 
  4014fc:   c5 fa 7e 9c 24 d0 00    vmovq  0xd0(%rsp),%xmm3
  401503:   00 00 
  401505:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40150b:   c4 e3 c9 22 8c 24 e8    vpinsrq $0x1,0xe8(%rsp),%xmm6,%xmm1
  401512:   00 00 00 01 
  401516:   c4 c1 f9 6e f2          vmovq  %r10,%xmm6
  40151b:   c5 fd 7f 05 7d 2b 00    vmovdqa %ymm0,0x2b7d(%rip)        # 4040a0 <s+0x20>
  401522:   00 
  401523:   c4 e3 c1 22 84 24 c0    vpinsrq $0x1,0xc0(%rsp),%xmm7,%xmm0
  40152a:   00 00 00 01 
  40152e:   c4 e1 f9 6e fb          vmovq  %rbx,%xmm7
  401533:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401539:   c4 e3 e9 22 8c 24 f0    vpinsrq $0x1,0xf0(%rsp),%xmm2,%xmm1
  401540:   00 00 00 01 
  401544:   c4 e1 f9 6e d6          vmovq  %rsi,%xmm2
  401549:   c5 fd 7f 05 6f 2b 00    vmovdqa %ymm0,0x2b6f(%rip)        # 4040c0 <s+0x40>
  401550:   00 
  401551:   c4 e3 e1 22 84 24 d8    vpinsrq $0x1,0xd8(%rsp),%xmm3,%xmm0
  401558:   00 00 00 01 
  40155c:   c4 c1 f9 6e d8          vmovq  %r8,%xmm3
  401561:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401567:   c4 c3 d9 22 cc 01       vpinsrq $0x1,%r12,%xmm4,%xmm1
  40156d:   c5 fa 7e a4 24 a8 00    vmovq  0xa8(%rsp),%xmm4
  401574:   00 00 
  401576:   c5 fd 7f 05 62 2b 00    vmovdqa %ymm0,0x2b62(%rip)        # 4040e0 <s+0x60>
  40157d:   00 
  40157e:   c4 c3 d1 22 c6 01       vpinsrq $0x1,%r14,%xmm5,%xmm0
  401584:   c4 e1 f9 6e ea          vmovq  %rdx,%xmm5
  401589:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40158f:   c4 c3 c9 22 c9 01       vpinsrq $0x1,%r9,%xmm6,%xmm1
  401595:   c5 fd 7f 05 63 2b 00    vmovdqa %ymm0,0x2b63(%rip)        # 404100 <s+0x80>
  40159c:   00 
  40159d:   c4 c3 c1 22 c3 01       vpinsrq $0x1,%r11,%xmm7,%xmm0
  4015a3:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015a9:   c4 e3 e9 22 c9 01       vpinsrq $0x1,%rcx,%xmm2,%xmm1
  4015af:   c5 fd 7f 05 69 2b 00    vmovdqa %ymm0,0x2b69(%rip)        # 404120 <s+0xa0>
  4015b6:   00 
  4015b7:   c4 e3 e1 22 c7 01       vpinsrq $0x1,%rdi,%xmm3,%xmm0
  4015bd:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015c3:   c4 e3 d9 22 8c 24 98    vpinsrq $0x1,0x98(%rsp),%xmm4,%xmm1
  4015ca:   00 00 00 01 
  4015ce:   c5 fd 7f 05 6a 2b 00    vmovdqa %ymm0,0x2b6a(%rip)        # 404140 <s+0xc0>
  4015d5:   00 
  4015d6:   c4 e3 d1 22 c0 01       vpinsrq $0x1,%rax,%xmm5,%xmm0
  4015dc:   c4 e3 7d 38 c1 01       vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015e2:   c5 fd 7f 05 76 2b 00    vmovdqa %ymm0,0x2b76(%rip)        # 404160 <s+0xe0>

And, most fundamental, it will _not_ happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

Are you using gcc? Most often when I get vectorization with -O2 -ftree-vectorize but not -O3, I do get it with -O3 -fdisable-tree-cunrolli. This happens with manually unrolled code.


In VectorizedRNG.jl I have a vectorized PCG implementation. It is a lot slower than xoshiro. With AVX2, it is also about 50% slower than the default MersenneTwister, but it is faster with AVX512.
Perhaps also of interest is that I implemented a SIMD random normal generator using the Box-Muller algorithm. It is close to 1.5x and 4x faster than the default with AVX2 and AVX512, respectively. I suspect it can be improved further.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dpsanders picture dpsanders  ·  3Comments

TotalVerb picture TotalVerb  ·  3Comments

omus picture omus  ·  3Comments

tkoolen picture tkoolen  ·  3Comments

sbromberger picture sbromberger  ·  3Comments