Iot: High precision timer waiting 1 microsecond

Created on 31 Jan 2019 · 26Comments · Source: dotnet/iot

Hi,
For the implementation of the DHT sensor family, I had to wait for 1 microsecond. The only reliable way I found is a for loop doing nothing.
The values I found to be working on a Raspberry Pi 3 are 99 for release mode and 27 for debug.
So 1 waiting 1 microsecond in release mode would then be:

for (byte wt = 0; wt < 99; wt++)
     ;

It would be great to have the possibility to find those values at execution time for any platform or have a real way to wait for 1 microsecond with a specific high precision timer. This is needed for a lot of sensors and is quite a challenge in managed code.

The mockup code to find those values is below. The core idea is to use stopwatch. First step is to calibrate how much tricks stopwatch is taking to start and stop.
Second step is to run the for loop with the specific value to check the behavior.
Please note that in all the cases, the first and second time, values are not that accurate, so it’s better to start after a third time. Tests shows that running 100 times is largely enough. Precision is around 5% which is more than acceptable for most sensor.
The next step for this mockup would be to find the values by iteration. But there maybe a much better and much simpler way to do it.
And it would anyway be great to have an inline function or equivalent to do this microsecond, high precision timer.

using System;
using System.Diagnostics;

namespace TestNetCore
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            Stopwatch stopwatch = Stopwatch.StartNew();
            for (int i = 0; i < 30; i++)
                ;
            stopwatch.Stop();
            Console.WriteLine($"First run: {stopwatch.ElapsedTicks}");

            //Run it 10 times and take an average
            double origin = 0;
            for (int steps = 0; steps < 100; steps++)
            {
                stopwatch = new Stopwatch();
                stopwatch.Start();
                stopwatch.Stop();
                // Console.WriteLine($"Start and stop: {stopwatch.ElapsedTicks}");
                origin = (origin * steps + stopwatch.ElapsedTicks)/(steps + 1);
            }

            Console.WriteLine($"Origin: {origin}");
            double average = 0;
            for (int steps = 0; steps < 99; steps++)
            {
                stopwatch = new Stopwatch();
                stopwatch.Start();
                for (int i = 0; i < 98; i++)
                    ;
                stopwatch.Stop();
                Console.WriteLine($"Total time: {stopwatch.ElapsedTicks}");
                double freq = (stopwatch.ElapsedTicks - origin) * 1000.0 * 1000.0 / Stopwatch.Frequency;
                average = (average * steps + freq) / (steps + 1);
                Console.WriteLine($"Total time-origin: {freq} microseconds");
            }
            Console.WriteLine($"Average microseconds: {average}");
        }
    }
}

area-System.Device.Gpio enhancement

Source

Ellerbach

👍1

Most helpful comment

@kouvel is responsible for the SpinWait code.

Ideally if you want an absolute time, you use Stopwatch.GetTimestamp() which is our highest resolution clock (it will not allocate) to watch time go by (it does mean you are spinning). SpinWait was not designed for the precision that was desired here, so I am not a fan of doing this. Its main value is that in hyperthreaded machines, it allows the other hyperthread to use the CPU while it is spinning (but that makes the time delay MORE variable).

What is the StopWatch.Frequency on your platform? Typically it is well under 1usec. now if you need 5usec to 5% accuracy (which is .25 usec), you may be able to do a spin loop.

Otherwise you are left basically relying on the reproducibility of a running some code for a while (some spin loop), which is what @Ellerbach is suggesting above. It has the problem that

It varies from debug / release builds.
It can vary based on the alignment of the loop (thus you better not inline it).

Having the runtime do this (a new API), is reasonable, but we should play around with a non-runtime implementation a bit to understand the need for this API, and what are implementation options are. I really would prefer something that was 'closed loop' (that is we actually measure a clock as opposed to simply relying on the time reproducibility of running instructions.

vancem on 4 Feb 2019

👍3

All 26 comments

You're looking to wait at _least_ a microsecond, correct? I've got a better helper for this that I've got in my LCD driver that I'm finishing up. It is hard to guarantee that you don't wait for longer and you don't want to peg the CPU spinning. The best option we have is to use SpinWait in the loop with the Stopwatch.

JeremyKuhne on 1 Feb 2019

@JeremyKuhne AFAIR it had to be precisely 1uS with a very small margin for error (in many cases more is forgivable but some devices will just not respond if you wait too short or too long). The allocation of Stopwatch was even affecting the time here. @Ellerbach can probably provide more specifics, but one of the examples would be here: https://github.com/dotnet/iot/blob/master/src/devices/Dhtxx/DhtSensor.cs#L203

krwq on 1 Feb 2019

👍1

The allocation of Stopwatch was even affecting the time here.

In my case I keep an instance on my class. Don't want to keep allocating the Stopwatch even if the timing isn't critical after all. :)

JeremyKuhne on 1 Feb 2019

👍1

Also see the DHTxxx example (just edited the post above) :-)

krwq on 1 Feb 2019

@JeremyKuhne, I have to wait exactly (+-5%) 1 micro second. And there are many examples where a very precise timing like this is needed when dealing with sensors.
At this precision, as @krwq mention, the Stopwatch is affecting big time the results so it can't be used. So that's the reason why the most precise way to wait for a microsecond is a for loop with a specific value. This value dépends of the platform and the release mode (debug or release).
You can run the code I posted on a Raspberry Pi 3 for example, you'll find always the 99 value (for Release build) and 27 (for Debug build). If you run on another hardware, this value will change.
So would be great to have a function that can return the number of cycle to run in a for loop to wait exactly in a sync mode for 1 microsecond. And most likely 2, 3 up to 50 microseconds. That would makes life easier to build drivers/binding for time sensitive sensors.

Ellerbach on 2 Feb 2019

👍1

@Ellerbach I did a ton of measurements looking at the timer loop in #206. 1μs is about what the SpinWait class goes for until it yields, but it isn't that precise. It depends on the runtime measuring timing and that is currently not a public or documented thing.

https://github.com/dotnet/coreclr/blob/a0f81f59a7beb7120d3147c1547ef8ec1f05e0ae/src/System.Private.CoreLib/src/Internal/Runtime/Augments/RuntimeThread.cs#L211
https://github.com/dotnet/coreclr/blob/616fea550548af750b575f3c304d1a9b4b6ef9a6/src/vm/yieldprocessornormalized.cpp#L16

In any case, I think we need to loop in some more folks with more expertise. @vancem, can you chime in on this / loop in others that have the right knowledge here?

JeremyKuhne on 4 Feb 2019

👍1

@kouvel is responsible for the SpinWait code.

What is the StopWatch.Frequency on your platform? Typically it is well under 1usec. now if you need 5usec to 5% accuracy (which is .25 usec), you may be able to do a spin loop.

Otherwise you are left basically relying on the reproducibility of a running some code for a while (some spin loop), which is what @Ellerbach is suggesting above. It has the problem that

It varies from debug / release builds.
It can vary based on the alignment of the loop (thus you better not inline it).

vancem on 4 Feb 2019

👍3

The main issue with using SpinWait is the yield/Sleep(0) that it does may last a very long time. Also it does an exponential backoff in spin-waiting iterations up to a limit, so that would decrease the precision. For the highest precision you're probably best off to get the Stopwatch frequency ahead of time, calculate the minimum number of ticks for the delay ahead of time based on the frequency, and spam Stopwatch.GetTimeStamp() in a tight loop without any delays. If some (unspecified but typically short) error is ok Thread.SpinWait(1) may be acceptable in the loop. It currently targets about 40 ns delay, but that is up to the system, the system may delay longer. If you know which hardware you're going to be running on you can just measure it and see what kind of precision loss you would have from adding Thread.SpinWait(1) in the loop.

kouvel on 4 Feb 2019

Also depends on how fast GetTimestamp() is, last I checked a while back but I sort of remember it was slower on Linux. Your method probably would have the best precision, I'm not too fond of that for an API though, as the measurement is questionable (anything can be running during that), and processor frequency may also be dynamic and there may not be one iteration count that yields a specific time.

kouvel on 5 Feb 2019

Somethign @kouvel mentions is worth emphasizing. In a world of dynamic processor frequency, you can't naively rely on the running-time of a piece of code being constant from one run to another. You go special lengths (setting various typically non-portable) to turn off the dynamic frequency logic. Uggh. This is why it is better to use 'true time' (that is a counter that is DESIGNED to measure time, like Stopwatch.GetTimeStamp()) if at all possible.

By the way, if we discover that GetTimeStamp is slower than we would like, we should look into why. Conceptually at least, there is no reason why it should be super fast (modern processors support it directly).

vancem on 5 Feb 2019

IIRC the relative perf difference on Linux in GetTimeStamp was mostly on the native side in clock_gettime(CLOCK_MONOTONIC), and it's different depending on whether it's a VM or not. I hadn't looked into why but the code looked sufficiently different.

kouvel on 5 Feb 2019

@kouvel this is generally needed for hardware so I'd presume VM scenarios are not that interesting (at least as of now).

krwq on 5 Feb 2019

Yea probably not a big deal on a VM, there's not much that can be done about that anyway. There is also the possibility of bypassing the native implementation, if really necessary.

kouvel on 5 Feb 2019

wait for 1 microsecond with a specific high precision timer. This is needed for a lot of sensors and is quite a challenge in managed code.

This is classical real time system scheduling problem

The requirements you described is one of the prerequisites for having any type of soft or hard Real Time System. Since currently .NET Core is not supported by any hard real time system the best what can be achieved is to run .NET Core on soft Real Time System. This is available on Linux kernel with RT patches (development is active again) and to much lesser extent on macOS.

Unfortunately .NET Core does not have any mechanism to guarantee 1 us scheduler event resolution and prevention of priority inversion. Solution to these 2 problems would require rewriting current scheduler and implementation of priority inversion handling algorithm. As this forms a part of requirements for creating RT framework it should be handled in wider context of real time specification for .NET which should focus on many other areas like garbage collection versus explicit memory management (major culprit in non-deterministic scheduling in virtual machines with GC) and many others.

Essentially this is a part of wider problem which IMHO should be handled via systematic design work leading to something like Real Time .NET Core (like Real Time Java Specification). The idea for creation of such .NET subsystem is already available in https://github.com/dotnet/csharplang/issues/761 issue. Finally, it does not surprise me that IoT team is hitting RT related issues while working directly with hardware - at such level it is to be expected.

4creators on 7 Feb 2019

I did a ton of measurements looking at the timer loop in #206. 1μs is about what the SpinWait class goes for until it yields

@Ellerbach @kouvel

None of this is relevant in context of inherently non real time system. As long as basic requirements for having RT system are not met there is no reason to expect that 1 us medium scheduling interval guarantee can be provided.

4creators on 7 Feb 2019

As long as basic requirements for having RT system are not met there is no reason to expect that 1 us medium scheduling interval guarantee can be provided.

Clearly we cannot provide RT guarantees. We can, however, provide higher fidelity timers then we already have and fully document the limitations. While that won't make us real-time that will still benefit some of what we're looking at. In my scenario pulsing data for an enable pin has a minimum requirement, but no max. That minimum can be as low as 230ns. A number of the timings need accuracy of about 37us. The closer I can get to the timings the better, but the world won't end if I sit high for too long. That is the reason I listed the other numbers- the amount of possible overshoot and the likelihood of said overshoot vs the compute cost is all useful data for picking appropriate solutions for different problems.

We can also look into GC disabling strategies for reducing uncertainty. Exploring what can and can't be done there is useful data. GC.TryStartNoGCRegion takes about 1.5ms on RPi3B+ and about .5ms on my Win10 box.

@kouvel / @vancem : I'm getting ~300ns to hit Stopwatch.GetTimestamp() on a Raspberry Pi 3b+ and ~180ns on Windows. Is that something we can do better on? Would measuring spinning and providing a loop that doesn't git the timer be practically possible once we're in managed code? Or should we be leveraging the measurements the runtime is already doing and possibly even letting the native code do the looping?

JeremyKuhne on 7 Feb 2019

@4creators we don't need guarantee, just high confidence :smile: As @Ellerbach mentioned somewhere before he gets some occasional errors but he is able to detect it.

Some slightly more reliable solution could be perhaps writing some kernel module which takes a callback or just sets a pin to specific value/pattern for specific time (i.e. user specifies period=500ns and passes the array: [0,1,1,1,0] which tells what the value should be in specific time). There is some thread on waiting in kernel mode:
https://raspberrypi.stackexchange.com/questions/8586/arm-timer-in-kernel-module-with-precision-less-than-microsecond
We would have to solve it for multiple pins at the same time perhaps.

Some user space solution could perhaps be some interrupt based (I've seens some solutions in C but that could be tricky to do in C#). Seems to me the simple loop makes sense as a solution and kernel module could be addition to allow ns precision

krwq on 7 Feb 2019

I'm getting ~300ns to hit Stopwatch.GetTimestamp() on a Raspberry Pi 3b+ and ~180ns on Windows.

@JeremyKuhne I think I'm seeing about 20 ns for Stopwatch.GetTimestamp() on my x64 Windows machine, which is about the same as what I'm seeing from the native side. 180 ns does seem to be a bit high, can you post the code you're using to measure it? Here's the code I used:

c# [MethodImpl(MethodImplOptions.AggressiveOptimization)] private static void Main(string[] args) { long ticksPerS = Stopwatch.Frequency; long durationTicks = (ticksPerS + 1) / 2; while (true) { uint n = 0; long startTicks = Stopwatch.GetTimestamp(); long nowTicks; do { ++n; nowTicks = Stopwatch.GetTimestamp(); } while (nowTicks - startTicks < durationTicks); Console.WriteLine($"{(nowTicks - startTicks) * (1000 * 1000 * 1000) / (double)(ticksPerS * n):0.00}"); } }

Would measuring spinning and providing a loop that doesn't git the timer be practically possible once we're in managed code?

We already measure spinning but the measurement is questionable and can vary a lot. Limits are placed so that a bad measurement doesn't lead to bad spinning behavior. I don't think it's reliable for timing, the processor also does not make any guarantees about how long a pause/yield takes, and it actually varies on some procs at least.

Or should we be leveraging the measurements the runtime is already doing and possibly even letting the native code do the looping?

That would be fine to do but at the moment I don't see that it buys much. If there is overhead that it would avoid it would be interesting to measure that.

kouvel on 7 Feb 2019

@tarekgh is also thinking some more about this space in terms of GPIO

joshfree on 23 Feb 2019

@tarekgh are you planning to work on this? If not, I'll flag it as up-for-grabs.

joperezr on 16 Apr 2019

@tannergooding tweaked Stopwatch to improve the performance characteristics here. We're going to be limited getting any better due to the actual resolution limits on the Pi.

The thing I wonder is if there are other hardware facilities that would allow us to do a single high-fidelity timed pulse to a data line. I haven't had time to look into it, so just wondering out loud.

JeremyKuhne on 16 Apr 2019

I see, @JeremyKuhne would you then suggest to close this issue then as this would be kind of a no-op for us on the iot side given the fix has gone to stopwatch?

joperezr on 16 Apr 2019

@joperezr I'd rather leave this open and add reference to the corresponding issue so that we can adjust if needed when this is already fixed (we can move issue to future though or add some kind of BLOCKED label)

krwq on 16 Apr 2019

@tannergooding could you edit the first post and add refs to corresponding issues related with Stopwatch?

krwq on 16 Apr 2019

@krwq, what corresponding issue are you refering to? There was just a PR created/merged and it resolves a very minor per issue.

The root issue is that RPi doesn't have a very high precision timer and there isnt anything tracking that, to my knowledge (it is also unactionable on our end).