Runtime: Proposal: Introducing a JIT intrinsic for acquiring timestamps

Created on 11 Nov 2016  路  9Comments  路  Source: dotnet/runtime

Acquiring timestamps (Stopwatch.GetTimeStamp) is an important activity in many applications, including modern web services. Often breaking down an application's execution time into "scopes" aids in performance analysis and is achieved by taking timestamps at predetermined code-sites, usually interesting logical points in the applications execution.

Furthermore, often these timestamps are taken in BeginActivity/EndActivity pairs to get "regions" of code. Therefore each "region" is represented (at a minimum) with two timestamps and usually also some symbolic name to identify the activity.

An example to illustrate this pattern:

static void Foo()
{
    using (MonitoredScope.Create(nameof(Foo)))
    {
        // ...
    }
}
class MonitoredScope : IDisposable
{
   public static MonitoredScope Create(string activityName)
   {
       LogBegin(Stopwatch.GetTimeStamp(), activityName);
       // ...
   }

   public void Dispose()
   {
       LogEnd(Stopwatch.GetTimeStamp());
   }
}

Today the JIT would encounter the call Stopwatch.GetTimeStamp CIL instruction and if it inlines the method body at the call-site, the generated code would be raising a PInvoke frame and then calling into the PAL supplied function. For example, on Unix systems that would be SystemNative_GetTimestamp. The PAL function defined in the CoreFX repo would then call into the OS supplied function (clock_gettime).

That is a total of 2 calls + the PInvoke frame. The majority of the cost is the PInvoke frame, and we could remove that if we make this an FCall in the runtime.

However, we can improve on this further if we can teach the JIT (intrinsic) to call directly into the OS supplied function, and do away with the extra call into the PAL.

Given that getting to these timestamps usually ends up being an RDTSC (or equivalent) instruction on modern OSes with modern hardware (all recent Windows machines, and I suspect quite a few Linux ones as well) + maybe some OS code it's a pretty good win if all we're adding by running on the runtime is an indirect call into the OS function which is what a non-runtime language like C++ would also achieve. You could imagine that one could take this further and provide an intrinsic for only the RDTSC instruction (and maybe that's also a valid issue to file) but I see it being significantly less portable at a proportionally higher maintenance cost.

As a final note, this will be a particularly major CPU efficiency improvement (in addition to the wall-clock time improvement) in large datacenter applications like Bing.com.
category:proposal
theme:runtime
skill-level:intermediate
cost:medium

JitUntriaged area-CodeGen-coreclr enhancement

Most helpful comment

This blog post by @AndreyAkinshin has lots of great detail on the overhead of Stopwatch, including this summary:

image

All 9 comments

cc @cmckinsey

This blog post by @AndreyAkinshin has lots of great detail on the overhead of Stopwatch, including this summary:

image

@tannergooding Thoughts? I'd love if we could emit RDTSC or RDTSCP from Hardware Intrinsics.

aside, it seems this would reduce the pinvoke overhead for fast calls like this.
https://github.com/dotnet/coreclr/pull/26458

I'm worried that people would do the wrong thing with RDTSC and RDTSCP in particular. The underlying OS function handles the hardware specifics and does the fixups and exposing a mechanism to remove any overhead of calling those OS specific functions is probably better.

The OS functions are designed for use in games and other high performance scenarios and so they should be more than good enough for us as well, provided that any additional overhead on the call can be removed.

https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps goes into more details about some of the things the underlying OS call handles and why directly calling things like RDTSC/RDTSCP is not recommended.

I'm worried that people would do the wrong thing with RDTSC and RDTSCP in particular.

goes into more details about some of the things the underlying OS call handles and why directly calling things like RDTSC/RDTSCP is not recommended.

@tannergooding I don't think that in general intrinsics are exposed for devs who do not know how to handle them correctly.

I don't think that in general intrinsics are exposed for devs who do not know how to handle them correctly.

RDTSC and RDTSCP in particular are ones that are realistically meant for use by the kernel and drivers and not by user-mode applications. (Hence the additional ability for the OS to restrict it such that they can only be called from privilege level 0).

Applications should be exclusively calling the OS provided functionality which handles the normalization, privilege checks, and other logic. If there is additional overhead that makes that undesirable for .NET applications, we should work on addressing that instead.

For Stopwatch in particular, .NET Core 3 has dropped support for the low-resolution timers and has partially cleaned up the P/Invoke calls to avoid implicit pinning and other overhead, so this should already be minimally better. Other functionality like function pointers (which will give calli) and the ability to remove a GC transition (assuming its safe for these calls) will help improve this more.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

v0l picture v0l  路  3Comments

yahorsi picture yahorsi  路  3Comments

bencz picture bencz  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

matty-hall picture matty-hall  路  3Comments