Virtual-environments: Time is not monotonic on macOS runner

Created on 1 May 2020 · 12Comments · Source: actions/virtual-environments

I am not sure if this is a right repo to report this kind of issue, so please feel free to close it, but point me into the right direction.

Sometimes, we note that consequence calls to get current time of operating system return values that are not monotonically increasing. The 2nd call, for example, may return value that is less than previous one.

I can reproduce this by a simple Python script that runs on macOS github-hosted runner. See it here: https://github.com/tysonite/test_cpython/blob/master/script.py

The source of interest is:

        t1 = time.time()
        time.sleep(0.001)
        t2 = time.time()

Where t2 - t1 may be negative sometimes which is proved by this run.

I expect t2 - t1 to be always positive, but probably this is something related to VM on which runner runs.

Virtual environments affected

[x] macOS 10.15
[ ] Ubuntu 16.04 LTS
[ ] Ubuntu 18.04 LTS
[ ] Windows Server 2016 R2
[ ] Windows Server 2019

Apple macOS investigate

Source

tysonite

👀1 👍1

Most helpful comment

Hello @tysonite,
I was able to reproduce the issue only in Github Actions. I tried to reproduce the issue in ADO(Azure DevOps) agents or locally, but with no luck. It seems the issue is specific for current platform.
We will continue the investigation and keep you posted.
Also, I will try to find a way how to mitigate the issue

Darleev on 4 May 2020

👍2

All 12 comments

Darleev on 4 May 2020

👍2

@tysonite During the investigation, I found that issue most probably related to MacOS time synchronization(using sntp). I'm trying to find a way how to disable time synchronization for MacOS at all, to test how it works without sync.
Currently, I see only 1-2 failed requests for 1.000.000 attempts. Meanwhile, could you please provide use case for this specific task(t2-t1 ever 0.0001 sec)? Maybe we will try to find a better way to mitigate the issue.

Darleev on 5 May 2020

We use Github Actions to run unit/acceptance tests for Robot Framework project, sometimes, tests fail due to that issue, and it is just weird.

The test verifies that elapsed time is not negative, and the mitigation on Robot Framework side is clear, it is only needed to remove such check. But the removal of check may lead to potential errors, and we would like to avoid that.

Example of test failure: https://condescending-banach-9beeb5.netlify.app/log.html#s1-s12-s7-t25-k1-k4-k1

EDIT: Btw, sometimes, similar issue happens on Windows, but I can't reproduce it with a simple script.

tysonite on 5 May 2020

@tysonite We've found a workaround for the issue on MacOS agent, to reach our goal, it is required to disable time synchronisation using the following commands in runtime:

sudo systemsetup -setusingnetworktime off
sudo rm -rf /etc/ntp.conf

Please try this solution on your side. We are looking forward to your reply.

Darleev on 6 May 2020

Thanks, @Darleev, I will report back if any issues.

tysonite on 6 May 2020

@tysonite could you please provide us with results of your tests? Are they successful?
We are looking forward to your reply.

Darleev on 7 May 2020

I merged PR robotframework/robotframework#3571 by @tysonite with these fixes and results are here:
https://github.com/robotframework/robotframework/runs/652387887

There's some macOS failure but apparently it's something different and @tysonite already reported it as #841. No time related failures this time, but these failures have been flaky so one successful run doesn't guarantee anything. We just made a release and there are no planned code changes right now that would kick a new run. Should we start one manually to see how it goes? You can obviously do that yourself if you can.

pekkaklarck on 7 May 2020

@pekkaklarck Thank you for an update. I have already verified that the workaround works fine . Just in case I will perform additional test, python/c++ to make sure it's not just a coincidence.
Successful run #1
Successful run #2
Successful run #3
In a row. I was able to reproduce the issue every attempt before the workaround.
Let's wait a final test results.

Darleev on 7 May 2020

I also have scheduled tests running once a day here (C++ / Python): https://github.com/tysonite/test_cpython/actions. I suppose if they are fine till Monday, we could consider issue I described in the first post as fixed.

However, this is only a work-around, but not a fix if I understand this right. Despite of the good results, the fix is required globally applied to all runners, probably.

tysonite on 7 May 2020

@tysonite currently, we are not planning to implement fix for the issue globally, and I can only provide the workaround. Synchronisation is out of the box MacOS feature, and it seems 1 failure for 1.000.000 attempts(every 0.0001) is expected behavior in such cases.
I'm going to close the issue for now. If additional assistance or information is required, please do not hesitate to contact us.

Darleev on 7 May 2020

I never used macOS, and you might be right. In Linux, such behavior usually means misconfiguration if either guest or host OS, e. g. https://www.redhat.com/en/blog/avoiding-clock-drift-vms

In any case, thanks again for your support!

tysonite on 7 May 2020

For our usage the workaround is adequate. Thanks for support @Darleev and thanks for investigation @tysonite!

pekkaklarck on 8 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings