Rust: SIMD-enabled utf-8 validation

Created on 22 Jan 2020 · 26Comments · Source: rust-lang/rust

Introduction

The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.

Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.

Since rustc does a lot of utf-8 validation (each .rs source file needs to be valid utf-8), it
got me curious about what rustc currently does. Validation seems to be done by the following routine:

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L1500-L1618

This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.

Performance improvements

The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from 8 cycles per byte parsed to 0.7 cycles per byte parsed.

When passing Rust's validation code through the godbolt decompiler, from_utf8_unchecked outputs 7 instructions, and from_utf8 outputs 57 instructions. In the case of from_utf8 most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.

_All examples linked have been compiled with -O -C target-cpu=native._

Also ecosystem libraries such as serde_json perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.

Implementation

There are two known Rust implementations of lemire's algorithm available in Rust today:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The simdjson library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as AVX2).

These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.

Future work

Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L4136-L4141

Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.

Conclusion

In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.

I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!

cc/ @jonas-schievink @nnethercote

References

A-simd A-unicode C-enhancement T-libs

Source

yoshuawuyts

❤12 👍9

Most helpful comment

Of some relevance, we are publishing a research paper on the topic:

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience (to appear)

lemire on 9 Oct 2020

❤12 👍3

All 26 comments

assembly for str::from_utf8 (godbolt) - 57 lines

This does not do any validation, it just calls to core::str::from_utf8 which is not part of the assembly.

CryZe on 22 Jan 2020

❤1

@CryZe ah, my bad. Yeah I was confused about the exact output, so for good measure I also copied over std's algorithm into godbolt (third link) to see what would happen. Thanks for clarifying what's actually going on!

yoshuawuyts on 22 Jan 2020

UTF-8 validation is hopefully not where the compiler spends its time :)

However, I could imagine this having some impact on "smallest possible" compile times (e.g., UI tests, hello world).

My recommendation is to replace the algorithm in core::str::from_utf8 (or where-ever it is in core) with direct use of AVX2 or some similar set, and we can then run that by perf.rust-lang.org as a loose benchmark. That's likely not tenable in reality (we would need to conditionally, likely at runtime, gate use of SIMD instructions) but would give I believe the best possible performance wins (since it would apply to all uses of from_utf8.

Mark-Simulacrum on 22 Jan 2020

Here's pretty much a 1:1 port:
https://godbolt.org/z/NSop8w

CryZe on 22 Jan 2020

❤1

Going to add the benchmarks from isutf8 (run on my machine RUSTFLAGS='-C target-cpu=native' cargo +nightly bench):

     Running target/release/deps/validate_utf8-6efda746d5fb1b3a
Gnuplot not found, disabling plotting
random_bytes/libcore    time:   [4.9977 ns 5.0048 ns 5.0117 ns]                                  
                        thrpt:  [ 97428 GiB/s  97562 GiB/s  97702 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
random_bytes/lemire_sse time:   [69.970 us 69.985 us 70.000 us]                                    
                        thrpt:  [6.9755 GiB/s 6.9769 GiB/s 6.9784 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/lemire_avx time:   [62.055 us 62.084 us 62.109 us]                                    
                        thrpt:  [7.8617 GiB/s 7.8649 GiB/s 7.8685 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
random_bytes/lemire_avx_ascii_path                                                                            
                        time:   [62.549 us 62.582 us 62.615 us]
                        thrpt:  [7.7982 GiB/s 7.8023 GiB/s 7.8064 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/range_sse  time:   [62.763 us 62.772 us 62.782 us]                                   
                        thrpt:  [7.7774 GiB/s 7.7787 GiB/s 7.7798 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
random_bytes/range_avx  time:   [45.737 us 45.746 us 45.755 us]                                    
                        thrpt:  [10.672 GiB/s 10.674 GiB/s 10.676 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

mostly_ascii/libcore    time:   [166.94 ns 167.23 ns 167.54 ns]                                 
                        thrpt:  [17.988 GiB/s 18.022 GiB/s 18.053 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
mostly_ascii/lemire_sse time:   [430.20 ns 430.45 ns 430.68 ns]                                    
                        thrpt:  [6.9977 GiB/s 7.0014 GiB/s 7.0054 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
mostly_ascii/lemire_avx time:   [382.89 ns 383.07 ns 383.25 ns]                                    
                        thrpt:  [7.8637 GiB/s 7.8673 GiB/s 7.8711 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
mostly_ascii/lemire_avx_ascii_path                                                                            
                        time:   [65.208 ns 65.255 ns 65.311 ns]
                        thrpt:  [46.145 GiB/s 46.184 GiB/s 46.218 GiB/s]
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  12 (12.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
mostly_ascii/range_sse  time:   [384.43 ns 384.64 ns 384.87 ns]                                   
                        thrpt:  [7.8307 GiB/s 7.8352 GiB/s 7.8395 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
mostly_ascii/range_avx  time:   [286.05 ns 286.26 ns 286.47 ns]                                   
                        thrpt:  [10.521 GiB/s 10.528 GiB/s 10.536 GiB/s]

ascii/libcore           time:   [72.975 ns 73.076 ns 73.189 ns]                          
                        thrpt:  [39.307 GiB/s 39.368 GiB/s 39.422 GiB/s]
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe
ascii/lemire_sse        time:   [423.11 ns 423.35 ns 423.62 ns]                             
                        thrpt:  [6.7912 GiB/s 6.7954 GiB/s 6.7993 GiB/s]
ascii/lemire_avx        time:   [373.82 ns 374.45 ns 375.43 ns]                             
                        thrpt:  [7.6628 GiB/s 7.6830 GiB/s 7.6958 GiB/s]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) low mild
  1 (1.00%) high severe
ascii/lemire_avx_ascii_path                                                                            
                        time:   [50.353 ns 50.588 ns 50.925 ns]
                        thrpt:  [56.492 GiB/s 56.869 GiB/s 57.133 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
ascii/range_sse         time:   [375.11 ns 375.87 ns 376.96 ns]                            
                        thrpt:  [7.6318 GiB/s 7.6538 GiB/s 7.6695 GiB/s]
Found 35 outliers among 100 measurements (35.00%)
  23 (23.00%) low severe
  8 (8.00%) high mild
  4 (4.00%) high severe
ascii/range_avx         time:   [272.39 ns 272.59 ns 272.82 ns]                            
                        thrpt:  [10.545 GiB/s 10.554 GiB/s 10.562 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

utf8/libcore            time:   [9.0154 us 9.0263 us 9.0389 us]                          
                        thrpt:  [1.7096 GiB/s 1.7119 GiB/s 1.7140 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/lemire_sse         time:   [2.1554 us 2.1568 us 2.1581 us]                             
                        thrpt:  [7.1601 GiB/s 7.1645 GiB/s 7.1693 GiB/s]
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx         time:   [1.9184 us 1.9188 us 1.9192 us]                             
                        thrpt:  [8.0515 GiB/s 8.0530 GiB/s 8.0547 GiB/s]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx_ascii_path                                                                             
                        time:   [1.4670 us 1.4679 us 1.4691 us]
                        thrpt:  [10.518 GiB/s 10.527 GiB/s 10.534 GiB/s]
utf8/range_sse          time:   [1.9426 us 1.9452 us 1.9491 us]                            
                        thrpt:  [7.9280 GiB/s 7.9439 GiB/s 7.9544 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/range_avx          time:   [1.4656 us 1.4733 us 1.4833 us]                            
                        thrpt:  [10.418 GiB/s 10.489 GiB/s 10.544 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

Benchmarking all_utf8/libcore: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s or reduce sample count to 50
all_utf8/libcore        time:   [1.8757 ms 1.8790 ms 1.8832 ms]                              
                        thrpt:  [2.1672 GiB/s 2.1721 GiB/s 2.1759 GiB/s]
all_utf8/lemire_sse     time:   [586.47 us 586.93 us 587.47 us]                                
                        thrpt:  [6.9474 GiB/s 6.9538 GiB/s 6.9593 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
all_utf8/lemire_avx     time:   [517.80 us 518.97 us 521.39 us]                                
                        thrpt:  [7.8279 GiB/s 7.8644 GiB/s 7.8822 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
all_utf8/lemire_avx_ascii_path                                                                            
                        time:   [523.97 us 524.27 us 524.63 us]
                        thrpt:  [7.7796 GiB/s 7.7849 GiB/s 7.7894 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
all_utf8/range_sse      time:   [525.94 us 527.21 us 528.57 us]                               
                        thrpt:  [7.7216 GiB/s 7.7415 GiB/s 7.7601 GiB/s]
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  11 (11.00%) high severe
all_utf8/range_avx      time:   [392.25 us 392.91 us 393.80 us]                               
                        thrpt:  [10.364 GiB/s 10.388 GiB/s 10.405 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

all_utf8_with_garbage/libcore                                                                             
                        time:   [3.6752 ns 3.7353 ns 3.8034 ns]
                        thrpt:  [1137275 GiB/s 1158024 GiB/s 1176952 GiB/s]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
all_utf8_with_garbage/lemire_sse                                                                            
                        time:   [616.73 us 616.89 us 617.03 us]
                        thrpt:  [7.0103 GiB/s 7.0119 GiB/s 7.0136 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high severe
all_utf8_with_garbage/lemire_avx                                                                            
                        time:   [551.11 us 552.09 us 554.06 us]
                        thrpt:  [7.8070 GiB/s 7.8349 GiB/s 7.8488 GiB/s]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
all_utf8_with_garbage/lemire_avx_ascii_path                                                                            
                        time:   [557.08 us 557.28 us 557.51 us]
                        thrpt:  [7.7587 GiB/s 7.7618 GiB/s 7.7646 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
all_utf8_with_garbage/range_sse                                                                            
                        time:   [554.79 us 555.10 us 555.42 us]
                        thrpt:  [7.7879 GiB/s 7.7924 GiB/s 7.7967 GiB/s]
all_utf8_with_garbage/range_avx                                                                            
                        time:   [417.05 us 417.38 us 417.74 us]
                        thrpt:  [10.355 GiB/s 10.364 GiB/s 10.372 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) low mild

Licenser on 22 Jan 2020

I re-read some of the Lemire algorithm and there are some key differences which might make it not that suitable for the general string validation.

The two key points are:

its build to just validate not find the error, so there is no error tracking of where an error is in the string
to make the first point worse it defers testing for errors until it checked the entire data. This makes a lot of sense in its context since it's safe to assume that for a JSON parser most input will be valid JSON and errors are an edge case. This also explains why for the benchmarks of invalid data the stdlib implementation is a lot faster.
It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.
It allocates a struct for each check.

I think :tm: that might make them slower for small payloads.

Licenser on 22 Jan 2020

I'm at work now, so I will read this thread later in more detail, but here are some things I want to say:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this).

Indeed. I created the crate originally to be included in the Rust internals. I still feel that some of the algorithms need a little bit of refactoring to make them more idiomatic, before being included. I also want to add @Zwegner's algorithm (https://github.com/zwegner/faster-utf8-validator), which seems to be the fastest algorithm that has been invented to date.

But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

If the build fails, can you open an issue? :) I would appreciate that.

argnidagur/rust-isutf8

Also: You misspelled my name ;)

ArniDagur on 22 Jan 2020

It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.

Is that the AVX version or something? The SSE version just allocates 16 bytes on the stack for the remaining <16 bytes.

It allocates a struct for each check.

I don't even know what this means. Are we talking about this one?

struct processed_utf_bytes {
  __m128i rawbytes;
  __m128i high_nibbles;
  __m128i carried_continuations;
};

This one gets completely optimized away.

CryZe on 22 Jan 2020

Is that the AVX version or something? The SSE version just allocates 16 bytes.

Sorry, a bit tired, bit not byte :)

This one gets completely optimized away.

Good point

Licenser on 22 Jan 2020

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

CryZe on 22 Jan 2020

In reply to @Licenser:

its build to just validate not find the error, so there is no error tracking of where an error is in the string

I have thought about this and we have a few options:

Have, as a compile time feature, functions which for every iteration of the loop checks if we have errors, then the error can be narrowed down. This should cost around 5% performance. If we wrapped the check in an unlikely! macro (maybe a compile time feature called optimize_for_correct), this number would probably decrease. The 5% number comes from Zwegner's findings IIRC.
First go through all the text to see if it's valid. We may possible assume that most of the text is valid UTF-8. Then, in case of errors, search for the error using scalar code or some variation of option 1.

We would need to think more about this.

ArniDagur on 22 Jan 2020

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

It can be compile-time only for now. Those who self-compile (such as myself) would see an improvement.

ArniDagur on 22 Jan 2020

Hey, thanks for the shout out. I want to note that I have made several improvements to my code that aren't present in the master branch (but are in the wip branch), and it is significantly faster than when it was published (about 40%). Most importantly, I was able to completely elide the separate check for the first continuation byte.

When not memory-bandwidth bound, my algorithm is now over 2x faster (in my tests, on AVX2) than the one in Daniel Lemire's original repo (the one described in the blog post), and my SSE4 path is faster than the AVX2 path of that repo. The algorithm used in simdjson has made some improvements, but last I checked I think my algorithm is still faster.

I still need to finish writing up documentation for the new algorithm. The code's definitely more hairy, from dealing with more architectures (AVX-512 and NEON), handling pointer alignment, generating lookup tables with a Python script, etc... But I'm happy to help out if anyone wants to use it/port it.

zwegner on 22 Jan 2020

❤7

@Licenser Are you still working on this?

pickfire on 20 Apr 2020

I ported the validator https://github.com/simd-lite/faster-utf8-validator-rs but from what I understand this all falls apart with the stdlib not being able to use CPU native features under the hood?

Licenser on 20 Apr 2020

🎉1

It would need to be gated under runtime CPUID checks or left disabled by default, gated under a cfg(target_feature=...) that is only true if users rebuild libstd with those target features enabled (currently an unstable Cargo feature, but should get more accessible in the future). The latter is easy but only helps software that can afford to only run on newer processors, the former faces the challenge of not regressing performance but would help more users if it worked.

ifuncs or similar mechanisms might also be an option on certain platforms, but they're not very portable and have obscure and hard-to-satisfy constraints. Manually emulating them with function pointers initialized on first call might have more overhead, not sure.

hanna-kruppe on 20 Apr 2020

My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because:

AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark
AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat

Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

bprosnitz on 9 Jul 2020

This article goes into detail on when it makes sense to use AVX (AVX-512 especially). The most relevant parts:

Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).

The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.

Some older CPUs downclock all cores when any of them are using AVX2 instructions, but for newer ones, they mostly only affect the core running them. Also, the SIMD instructions used for string validation would fall in the "light" category of instructions as they don't involve the floating-point unit.

As @bprosnitz mentioned, take them with a grain of salt, but microbenchmarks certainly imply there is something to be gained in using an accelerated validator.

milkey-mouse on 11 Jul 2020

👍3

Note that we have an updated validator called lookup 4...

https://github.com/simdjson/simdjson/pull/993

It is going to be really hard to beat.

@bprosnitz

My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because: AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark, AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat, Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

Firstly, let us set aside AVX-512 and "heavy" (numerical) AVX2 instructions. They have their uses (e.g., in machine learning, simulation). But that's probably not what you have in mind.

This being said...

Regarding power usage, it is generally true that the faster code is the code that uses less energy. So if you can multiply your speed using NEON, SSE, AVX, go ahead. You'll come up on top. It is a bit like being concerned with climate change and observing that buses and trains use more energy than cars. They use more energy in total, but less energy per work done. So you have to hold the work constant if you are going to make comparisons. Does it take more energy to do 4 additions, or to use one instruction that does 4 additions at once?

So SIMD instructions are the public transportation of computing. They are green and should be used as much as possible. (Again, I am setting aside AVX-512 and numerical AVX2 instructions that are something more controversial.)

Regarding the fear that SIMD instructions are somehow exotic and rare, and that if you ever use it, you will trigger a chain reaction of slowness... You are using AVX all the time... Read this commit where the committer identified that the hot function in his benchmark was __memmove_avx_unaligned_erms. You can bet that this function is AVX-based. The Golang runtime uses AVX, Glibc uses AVX, LLVM uses AVX, Java, and so forth. Even PHP uses SIMD instructions for some string algorithms. And yes, Rust programs use AVX or other SIMD instructions.

lemire on 14 Jul 2020

@hanna-kruppe

Manually emulating them with function pointers initialized on first call might have more overhead

I don't think so. We have a well tested approach. It is as simple as that...

internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};


const implementation *detect_best_supported_implementation_on_first_use::set_best() const noexcept {
  return active_implementation = available_implementations.detect_best_supported();
}

So you just need an atomic function pointer.

Obviously, you do not get inlining, but that's about the only cost. Loading an atomic pointer is no more expensive, really, than loading an ordinary pointer. So this is pretty much free... except for the first time. You pay a price of first use, but that's not a fundamental limitation: you could set the best function any time, including at startup.

lemire on 14 Jul 2020

Note that we have an updated validator called lookup 4...

simdjson/simdjson#993

It is going to be really hard to beat.

I mentioned this off-hand on Twitter, but to clarify, in my benchmarks of pure UTF-8 validation, my code with scalar continuation byte checks and unaligned loads is still faster than lookup4 (or at least my translation of lookup4 to C). The difference depends on compiler (on my HSW, testing 1M of UTF-8, GCC gives .227 (me) vs .319 (L4) cycles/byte, while LLVM has .240 (me) vs .266 (L4)).

The picture gets more complicated in the context of simdjson, when plenty of other code is competing for the same execution ports (and the best algorithm is less clear), but I think in the case of Rust, the pure UTF-8 microbenchmarks are probably more representative.

zwegner on 14 Jul 2020

@lemire to clarify, my benchmarks are already using the lookup4 implementation in simdjson (from the pull request's branch).

milkey-mouse on 15 Jul 2020

@milkey-mouse Fantastic.

cc @jkeiser

lemire on 15 Jul 2020

Blog post: The cost of runtime dispatch.

lemire on 17 Jul 2020

❤5