Marlin: [BUG] THERMAL RUNAWAY NOT FIXED FOR MKS SBASE

Created on 27 Apr 2019  路  29Comments  路  Source: MarlinFirmware/Marlin

Took a fresh download of the 2.0-bugfix repository today (approx 20:15 CET)

Compiled for MKS SBASE v1.3 with 12864 display.

Compiled the code and uploaded to the board - but as soon as the new firmware is booted, the display shows "BED THERMAL RUNAWAY"

This even happens with THERMAL_PROTECTION_BED commented out

Steps to Reproduce
Configure as with the supplied config files - set target to LPC1768 in platformio.ini
Compile in VScode
Upload firmware to SDcard

13783

Expected behavior:

Board would boot and show the normal info/status screen

Actual behavior: [What actually happens]

BED THERMAL RUNAWAY

displayed on LCD

Potential ?

All 29 comments

That's funny. I am on the current latest bugfix-2.0.x with an MKS SBASE right now and it doesn't show that error. It DID fail as you described, you can see my comment in the #13783, a few commits ago, but that was fixed and the error is now gone on mine.

Today I upgrade firmware on SKR1.3, When I stop printing and reprint again, That message go up.

Compiled latest version (6 hours old)-No improvement With THERMAL_PROTECTION_GRACE_PERIOD set at 800 moving onto a 1000

Well It appears that reflashing the firmware doesn't necessarily me that it updates the grace period in frustration I uploaded smoothieware then re uploaded Marlin 2.0 now 800 works just fine. odd

Worked until the first restart then it did it again! :( Sooooo I re-flashed smoothieware then reflashed marlin -same error upped grace period to 5000 per #13783, no change. I am at my wits end....

EEPROM clearing M500 etc.??? maybe? I got MKS SBASE and it's ok on default... have not looked at the code so EEPROM settings involved or not, I dunno. But something is fishy here...

I agree @FanDjango I cant connect via serial while in the reset mode but you give me an Idea I will relfash with eeprom off and then reflash with the eeprom and see what happens....

okay so it is definitely related to the eeprom being cleared I set it for 5000 for now.... but this is an odd one. definitely a pain to have to flash twice to use the latest firmware wonder what changed?

Hmm I suspect this is all down to how long it is taking Marlin to start running. It looks like the grace period is defined as a period of time starting from when Marlin first boots. But really what it is trying to do is to allow for a certain number of samples to have been made before it applies the checks. But these are not the same. So if for instance the grace period is 800mS and it takes say 500 mS before Marlin finishes getting around to starting to sample then it will have 300mS worth of samples when the tests start to apply. But if Marlin takes say 1000mS to get started then it will not have any time to build up samples before the checks are enabled. So things like establishing a USB connection, initialising the SD card and perhaps handling eeprom will impact how well this works, even things like displaying splash screens may impact it. At least that is my take on it. Perhaps the grace period should only start when Marlin begins to sample temperatures?

@cinealfa Do you have anything that means Marlin takes longer than normal to start?

@p3p @thinkyhead Any thoughts on this theory?

@gloomyandy Well I do have a Ender splash screen .... but should the setting "stick" until shut offf the eprom and turned it back on? Even after a refresh. I am leaving it at 5000ms for the time being.

@gloomyandy Grace period is set on the first call to manage_heater(), which is done some time during boot (but way after the bootloader slowness). This is also the first time it is supposed to check the thermal runway conditions, so the grace period kicks in right about when it's needed.
@cinealfa As far as I am aware, grace period is not saved in the eeprom settings, might be something else at play here that is saved there. The need to double flash is indeed weird, I have never seen that. You could try doing a clean build - remove the .pioenvs and .piolibdeps directories and upload the firmware via PlatformIO again. That will make sure you're on the latest deps and platform while building Marlin.

@Idorobots Hmm are you sure that is when the following will be being executed...

     static millis_t grace_period = ms + THERMAL_PROTECTION_GRACE_PERIOD;

I may be wrong but I thought that statics had to be initialised "before any other code in the same compilation unit is called" from what I remember of C++. In this case there are things like the isr that are probably called pretty early on (and which are in the same compilation unit as this code I think). I'm also not talking about any bootloader slowness. More things like setting up the USB stack, the SD card and other Marlin startup stuff.

@gloomyandy I think this would be true of C, where the compiler would scream at you that ms is not a constant expression when used in a static variable initializer, while in C++ this is legal.

If the USB stack setup (performed by the HAL code, right?) is done before the first call to manage_heater() then we should be good, if it's done afterwards, then indeed it needs to be taken into account in the grace period.

I'm not sure why the large variability on different LPC1768-based boards though. One could think that the startup time will be about the same regardless of the hardware configuration as long as they are using the same chip, however some of the boards seem to be happy with 500 ms, while other require extra time. I only have a ReARM, so I can't really compare it to anything else to try to figure this out.

@Idorobots Ah this is a static local variable in which case the rule is.... "Variables declared at block scope with the specifier static have static storage duration but are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped."

So you are correct this should be OK.

However having said all of that. I'm not convinced that the current solution is really a good one. It uses a time period to try and ensure that sufficient samples have been added to the filter to get a valid reading. But it could be that in some circumstances time does not map well to the number of samples. For instance I seem to remember that the SD card is initialised by the menu code after the system is "running" this initialisation can take a long time to run and may be reducing the number of samples obtained substantially. There may be other things that also happen at startup the can have a similar impact. Wouldn't it be simpler just to have some minimum number of samples rather than a time value?

@gloomyandy Yup, that would be much better. Alternatively, one could lower the K-value of the LPF applied to the samples as well as the median filter size. In my case it does make it work even without the grace period, while still providing proper values. I hear that it might not be the case for some boards though.

I hear that it might not be the case for some boards though.

It's more a matter of the how noisy the ground/signals are on the specific setup than the board, the high level of default filtering is what was required to stop getting reports, I was incredibly surprised at the time how much I needed to add for some people when on the same boards I didn't need any filtering (ReArm or SBase).

SDcard init is the only thing I can see causing such variability in boot times.

@p3p @Idorobots Perhaps the BLTouch may also introduce some sort of delay into things....
https://github.com/MarlinFirmware/Marlin/issues/13835
Any suggestions as to what they can do?

An enabled BLTouch makes marlin.cpp call its init() routine, which currently consists of a RESET and STOW command - a total of BLTOUCH_DELAY * 2 ms = 375 *2 = 750ms. That value can be configured to be longer and the trend is towards higher values, potentially up to 2-3 seconds or even more in the future.

However having said all of that. I'm not convinced that the current solution is really a good one. It uses a time period to try and ensure that sufficient samples have been added to the filter to get a valid reading. But it could be that in some circumstances time does not map well to the number of samples. For instance I seem to remember that the SD card is initialised by the menu code after the system is "running" this initialisation can take a long time to run and may be reducing the number of samples obtained substantially. There may be other things that also happen at startup the can have a similar impact. Wouldn't it be simpler just to have some minimum number of samples rather than a time value?

I agree with this.

Any suggestions as to what they can do?

As a work around large vales for THERMAL_PROTECTION_GRACE_PERIOD should work? but a better system is needed, adding a HAL api method for adc_reading_stabilised seems a little overkill as LPC176x is the only platform that has filtered ADCs ..

Is there any danger if I increase the THERMAL_PROTECTION_GRACE_PERIOD from 500 to 5000? Even if I have a thermal runaway, wouldn't a 5 second grace period should still give plenty of time to catch it before damage is done?

Given that it only applies when you initially boot the printer, it is hard to imagine how a 5 second delay could cause any problems at all!

Should I keep this open? It sounds like the ball is still in play. I will try a clean build again @Idorobots this next weekend. Even though I started with a completely fresh dl save the config files. But I will try for the good of - SCIENCE!!

I ended up setting THERMAL_PROTECTION_GRACE_PERIOD for 10000 (10 seconds) to help insure I don't run into this again with a future firmware update.

@cinealfa yes I think you should keep this open. We may have a workaround for the problem, but it is not really fixed.

agreed, does anyone what changed that caused this?

The Thermal runaway / faulty sensor detection was changed so that is always active, not just when the heaters are active, so on the LPC176x platforms the ADC filters don't get time to stabilise on boot before the protection detects that they are not in the expected range. The THERMAL_PROTECTION_GRACE_PERIOD setting just gives them some time to get enough samples before the protection kicks in.

A problem at boot-up in the temperature code has been found which @gloomyandy and @Roxy-3D helped find. Hopefully it will fix these problems once and for all. I'm able to disable the grace period on my SKR board completely (though I prefer to keep it) with the fixed code.

Hopefully they will fix the code for you very shortly.

If you want to try it yourself, let them know if it works for you. Have a look at https://github.com/MarlinFirmware/Marlin/pull/13888, the source code fix is near the end of the thread.

Lack of Activity
This issue is being closed due to lack of activity. If you have solved the
issue, please let us know how you solved it. If you haven't, please tell us
what else you've tried in the meantime, and possibly this issue will be
reopened.

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings