Stl: <chrono>: Consider caching QueryPerformanceFrequency()

Created on 24 Jan 2020  路  9Comments  路  Source: microsoft/STL

steady_clock::now() says:

https://github.com/microsoft/STL/blob/a83d8c0061eca6e22d1de0963dfbbdb916f91267/stl/inc/chrono#L612-L614
https://github.com/microsoft/STL/blob/a83d8c0061eca6e22d1de0963dfbbdb916f91267/stl/src/xtime.cpp#L92-L96

As the comment indicates, QueryPerformanceFrequency documentation says: "The frequency of the performance counter is fixed at system boot and is consistent across all processors. Therefore, the frequency need only be queried upon application initialization, and the result can be cached."

This code originally used magic statics, but TFS checkin 1586419 on March 16, 2016 removed that. My checkin notes claimed (emphasis added):

Remove intentional but unnecessary use of magic statics in steady_clock::now() calling _Query_perf_frequency(). Calling QPC is very fast (when I originally profiled now(), IIRC I could call it 6-7 times before the tick changed). Calling QPF will be comparably fast, so we may as well just do that instead of paying the magic statics cost. I've left the "system boot" comment, since it's helpful to know.

I had no evidence for this performance assumption and it was incorrect. In DevCom-505019, where this issue was originally reported, Damian Zwoli艅ski noted that while QPC is indeed efficient on most platforms (aside: IIRC, on certain VMs it is expensive), QPF is not efficient.

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users), but we should investigate whether it's possible to restore caching without TLS and without breaking ABI. For example, we could have a static long long initialized to 0 (no magic) and use interlocked operations to cache QPF, since 0 is never a valid value for it.

fixed performance

Most helpful comment

Hi! Agree, with the use of a global long long, but no need to use interlocked operations or any form of hardware memory fences, since you don't need to synchronize access to any other state in relation to the variable. All you need to do is guarantee atomicity of the load and store operations. It's alright if multiple threads end up racing independently in an attempt to initialize it.

So, it should be sufficient to use the equivalent of only the load() and store() operations on an std::atomic<long long> with std::memory_order_relaxed ordering.

_Edit: On 32-bit CPU architectures, the above may end up being implemented in terms of a CAS or LL/SC loop, but on 64-bit architectures, should expect it to be optimized to plain old load/store instructions so long as the architecture guarantees atomicity of such operations._

I believe .NET Core uses this same technique deep in the internals of System.Threading.SpinWait to record a measurement of the time that the x86 PAUSE instruction takes, since this can vary depending on the architecture (for example, Skylake's PAUSE takes much longer than previous Intel microarchitectures). It doesn't use interlocked operations, atomics, or locking since I believe the C# memory model for loads/stores on non-volatile variables of types bool, char, byte, sbyte, short, ushort, uint, int and float is basically the same as std::memory_order_relaxed.

All 9 comments

Hi! Agree, with the use of a global long long, but no need to use interlocked operations or any form of hardware memory fences, since you don't need to synchronize access to any other state in relation to the variable. All you need to do is guarantee atomicity of the load and store operations. It's alright if multiple threads end up racing independently in an attempt to initialize it.

So, it should be sufficient to use the equivalent of only the load() and store() operations on an std::atomic<long long> with std::memory_order_relaxed ordering.

_Edit: On 32-bit CPU architectures, the above may end up being implemented in terms of a CAS or LL/SC loop, but on 64-bit architectures, should expect it to be optimized to plain old load/store instructions so long as the architecture guarantees atomicity of such operations._

I believe .NET Core uses this same technique deep in the internals of System.Threading.SpinWait to record a measurement of the time that the x86 PAUSE instruction takes, since this can vary depending on the architecture (for example, Skylake's PAUSE takes much longer than previous Intel microarchitectures). It doesn't use interlocked operations, atomics, or locking since I believe the C# memory model for loads/stores on non-volatile variables of types bool, char, byte, sbyte, short, ushort, uint, int and float is basically the same as std::memory_order_relaxed.

Other thing to consider for this area:

https://github.com/microsoft/STL/blob/a83d8c0061eca6e22d1de0963dfbbdb916f91267/stl/inc/chrono#L616-L618

There's _mul128 and _div128 for x64.

By replacing highlighted code with the following:

#ifdef _M_X64
            long long _High;
            long long _Rem_unused;
            long long _Low = _mul128(period::den, _Ctr, &_High);
            long long _Res = _div128(_High, _Low, _Freq, &_Rem_unused);
            return time_point(duration(_Res));
#else
          // old code
#endif

Can have one division instead of two.

But on my system QPC takes most of the time.
I'm proposing to consider this, since there are apparently systems where QPC is faster.

It also may be not worth of trouble bringing intrinsic functions to that header.
And digging it into .cpp is API change.

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users)

Maybe a bit off topic, but why do magic statics need thread local storage?

@MikeGitb The 'Magic Statics' algorithm uses a thread-local read to see 'did this thread already see that this value' to avoid synchronization if the current thread has already seen that the value is initialized.

(If we started this off-topic) what are reasons to avoid TLS ? I recall there are issues with .NET, but this can be ruled out by preprocessor. There was dll in XP problem, but XP is no longer there. Some more exotic, like kernel mode drivers, executable packers / protectors, malware ? This STL is not suited for such scenarios anyway.

Adding a .TLS section to a DLL that previously didn't have one can push a program over the TLS slot limit, so it's a breaking change.

TLS slot limit? I though the algorithm that exists since Vista has no limits, it would reallocate TLS as many times as needed.

(Sure such TLS reallocation may be heap expensive for a program with many threads, and many DLLs already loaded, and may cause reaching heap limit on x86, but then any adding change can cause that)

@AlexGuteniev My understanding is the limit in Vista was changed from ~58 to ~1000ish but that there is still a limit.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AlexGuteniev picture AlexGuteniev  路  18Comments

StephanTLavavej picture StephanTLavavej  路  10Comments

ohhmm picture ohhmm  路  32Comments

MahmoudGSaleh picture MahmoudGSaleh  路  20Comments

StephanTLavavej picture StephanTLavavej  路  10Comments