Mbed-os: STM32L0xx: (tickless using lp_ticker) Becomes unresponsive after indeterminate time

Created on 22 Mar 2018 · 26Comments · Source: ARMmbed/mbed-os

Description

Type: Bug
Priority: Major

Bug

Target
STM32L072, possibly more

mbed-os sha:
5.7.7 16bac101a6b7b4724023dcf86ece1548e3a23cbf

Issue
We are experiencing some problems where the device becomes unresponsive after an indeterminate time. This can be anywhere between several minutes and a couple of hours.

The configuration we are using is the following:

// Configuration parameters
#define MBED_CONF_APP_THREAD_STACK_SIZE                   1024        
#define MBED_CONF_EVENTS_PRESENT                          1           
#define MBED_CONF_TARGET_LOWPOWERTIMER_LPTIM              0           
#define MBED_CONF_APP_MAIN_STACK_SIZE                     2048        
#define MBED_CONF_PLATFORM_STDIO_BAUD_RATE                115200      
#define CLOCK_SOURCE                                      USE_PLL_HSI 
#define MBED_CONF_EVENTS_SHARED_HIGHPRIO_STACKSIZE        1024        
#define MBED_CONF_PLATFORM_DEFAULT_SERIAL_BAUD_RATE       9600        
#define MBED_CONF_RTOS_PRESENT                            1           
#define MBED_CONF_EVENTS_SHARED_DISPATCH_FROM_APPLICATION 1           
#define MBED_CONF_EVENTS_SHARED_EVENTSIZE                 1024        
#define MBED_CONF_EVENTS_SHARED_STACKSIZE                 2048        
#define MBED_CONF_EVENTS_USE_LOWPOWER_TIMER_TICKER        1           
#define MBED_CONF_TARGET_LSE_AVAILABLE                    1           
#define MBED_CONF_EVENTS_SHARED_HIGHPRIO_EVENTSIZE        512         

// Macros
#define MBED_TICKLESS                                                 
#define NO_SYSTICK

After some testing using some gpio pins I have found out that the processor hangs in a continuous loop of the following functions:

rtc_irq
    lp_ticker_irq_handler()
        irq_handler() default -> ticker_irq_handler
            ticker_irq_handler()
                first loop
                    Set systick irq
                second loop
                    schedule_interrupt()
                        rtc_set_wake_up_timer()
exit_irq

systick_irq
    SysTick_Handler()
        osRtxTick_Handler()
            OS_Tick_AcknowledgeIRQ()
                os_timer->schedule_tick() with default delta = 1
                    insert_absolute()
                        ticker_insert_event_us()
                            update_present_time()
                            schedule_interrupt()
                                fire_interrupt()
exit_irq

The rtc_irq sets the systick_irq request and vice versa.

When the processor hangs it seems the in the os_timer->_tick is out of date. Whereas normally it is in sync with the lp_data->queue->present_time. This causes the interrupt to be scheduled in the systick irq handler to fire immediately because the event is requested in the past. The synchronization should happen in the critical section of the default_idle_hook but this seems to have stopped for some reason.

I will continue to investigate the problem but there is an issue here.

closed_in_jira st mirrored bug

Source

marcemmers

Most helpful comment

We have been running 5.9.3 which includes #7365 for a while and I haven't encountered this problem since so this can be closed.

marcemmers on 23 Aug 2018

👍2

All 26 comments

A potential cause might be a ticker value that is smaller than the previous:

ticker_event_queue_t *queue = ticker->queue;
uint32_t ticker_time = ticker->interface->read();
if (ticker_time == ticker->queue->tick_last_read) {
    // No work to do
    return;
}

uint64_t elapsed_ticks = (ticker_time - queue->tick_last_read) & queue->bitmask;
queue->tick_last_read = ticker_time;

If ticker_time reads smaller than that what is stored in tick_last_read for whatever reason the result is a small negative number cast to a uint becoming a very large number which is then added to the present_time.

Edit:
After some more testing it does occur that the ticker_time is smaller than the tick_last_read. It seems the ticker_time is exactly 1s - 1us earlier than the tick_last_read. This might be an issue with the rtc synchronization.

In the datasheet it says we must wait for the RSF flag to be set to verify that the shadow registers have been set.

Edit2:
My previous edit seems to be false. It sometimes is 1s - 1us earlier but it can also be less than 1s off.

I now run into a breakpoint if tick_last_read is bigger than ticker_time. If i then compare the timestructs I have saved after reading out the rtc twice they are equal.

marcemmers on 22 Mar 2018

@c1728p9 Can you review please?

0xc0170 on 22 Mar 2018

cc @ARMmbed/team-st-mcd

0xc0170 on 22 Mar 2018

I have now bypassed the shadow registers and followed the instructions in the datasheet (reading twice and verifying the data) and this seems to solve the issues I am having. I will let this run for the weekend to see if it is indeed a suitable solution.

marcemmers on 23 Mar 2018

After a weekend of running two devices: one is still running without problems, one is stuck in the same problem as we had before.

Most likely it can still happen that the read time is earlier than the stored previous time. Might be an implementation issue on my side. I will see if I can improve on that.

Otherwise it might be an idea to make the update_present_time function a bit more robust to prevent this issue also.

marcemmers on 26 Mar 2018

@marcemmers thanks for sharing your inputs and findings. One question : why are you changing the default STM32L072 configuration like this;

define MBED_CONF_TARGET_LOWPOWERTIMER_LPTIM 0

Is there a specific reason for that ?

LMESTM on 26 Mar 2018

@LMESTM Yes there is, we are using the LPTIM as a PWM output for driving some leds during sleep. Hence we couldn't use it also as base clock for the OS.

marcemmers on 26 Mar 2018

@marcemmers that's interesting input - thanks. And what low power modes are you entering ? Would like to understand if you're using the LP TICKER as a very periodic wake-up source (which period ?) or more like an RTC alarm for waking-up after a rather long sleep period

LMESTM on 26 Mar 2018

@LMESTM We are entering stop mode via the mbed sleep api.

In the configuration we are using the operating system is using the wakeup event from the RTC to go into sleep until the next event needs to be activated. I am not actively controlling the RTC for this. The rtc date/time is used as the base time from which the us time is calculated.

I did not have to change anything except using the previous configuration.

marcemmers on 26 Mar 2018

@marcemmers
would you try out the commit here and let me know if this helps:
LMESTM/mbed@bbc01b8

LMESTM on 3 Apr 2018

@LMESTM I am sorry, I missed your last comment. I will try this in a few days. We have been running using bypassed shadow registers for a while now without random crashes.

There is however another problem that has come up by using the RTC as the main timer. The RTC function rtc_read_us() returns a uint32_t. A uint32_t can only contain 4294 seconds in us accuracy. This causes the result to overflow several times each day which is fine.

However, at the end of the day this does become a problem. 23:59:59 translates to 499154100us or 0x01DC07CB4 where the next second will result in 0us. This causes the same issues in the mbed_ticker_api layer as discussed before in:

ticker_event_queue_t *queue = ticker->queue;
uint32_t ticker_time = ticker->interface->read();
if (ticker_time == ticker->queue->tick_last_read) {
    // No work to do
    return;
}

uint64_t elapsed_ticks = (ticker_time - queue->tick_last_read) & queue->bitmask;
queue->tick_last_read = ticker_time;

The way I see it there are 2 possible fixes for this issue:

Also use the date struct to calculate the time in the rtc_read_us function.
Keep an internal timer in the rtc_api file that can be incremented with how much time has passed. This also fixes the issue that the RTC time cannot be set while using the rtc_read_us as a timebase because setting the time will mess up the readout.

marcemmers on 16 Apr 2018

@marcemmers note that I've pushed a PR to master which includes similar fix as I pointer in previous comment. You'd rather rebase on master to get the latest fix. Anyway from your analysis, this looks like a different matter than the one I fixed

LMESTM on 16 Apr 2018

@LMESTM Yeah I figured. I will make a PR for the fix of the shadow registers, we can then discuss if the fix i have made is acceptable.

Should i make an other issue for the RTC overflow problem or should we continue here?
@0xc0170 ?

marcemmers on 16 Apr 2018

Should i make an other issue for the RTC overflow problem or should we continue here?
@0xc0170 ?

This issue is still present on https://github.com/ARMmbed/mbed-os/tree/feature-hal-spec-rtc ?

cc @c1728p9 you might be interested in the above report

0xc0170 on 16 Apr 2018

This is already based on the changes in commit 9fb865ae576f47aa924e445a800c1113f623a435 that is present in the feature branch and already merged to master.

There are no other changes to the rtc_read_us part so it seems the issue would still be present.

Edit:
A cleaner solution might be to calculate the time using _rtc_maketime() and then multiply it to create a us time in the rtc_read_us function.

marcemmers on 16 Apr 2018

Hi
Interested in the results with the latest implementation.
Thx

jeromecoutant on 25 Apr 2018

I am testing the newest feature-hal-spec-ticker now with #6788. I also had to undo the changes made in #6561 because my serial didn't work anymore.

marcemmers on 2 May 2018

what is the solution to this problem

helmi-haddad on 12 May 2018

I am testing the newest feature-hal-spec-ticker now

Any good news ? :-)

jeromecoutant on 18 May 2018

@jeromecoutant It seems to be running stable however I am still a bit wary about the RTC readout problems I have experienced before that don't seem to be fixed in the ticker branch.

marcemmers on 25 May 2018

ARM Internal Ref: MBOTRIAGE-209