Marlin: bugfix 2.0.x: Occasional hotend temperature glitch when printing

Created on 20 Jul 2018 · 58Comments · Source: MarlinFirmware/Marlin

Description

I've recently been seeing drops in temperature during printing. My hotend temperature will be set at say 200 and will be being maintained at that +/- a few degrees. However from time to time I will see a reading of around 176.

I've spent some time digging into this by adding some debug code into the temperature.cpp file. What I see is that normally over a period of a second or so the min and max raw ADC readings for the temperature sensor are 86/87 however from time to time I get a range of 86/937. Digging into this further I added code to record the min and max time from the call to start the ADC to reading it. Normally these values are around 944/1140 uS. However when I get the above "odd" reading I get a minimum period of only 28uS. My understanding is that the Atmega 2560 ADC conversion time minimum is around 100uS so I suspect this is what is causing the glitch. My guess is that at times other interrupt routines are delaying the start of the ADC conversion resulting in the short conversion time.

This will almost certainly be Arduino specific and probably depends upon what options you have enabled (and hence how long the other interrupt calls run for).

I'm going to add code to ensure that there is always a minimum delta of 100uS between successive calls to the temperature interrupt routine. But is there a better way to fix this?

Steps to Reproduce

Run a print job and closely monitor the hotend temperature reports.

Expected behavior:

Temperature should remain constant within a few degrees of the set point

Actual behavior:

Temperature occasionally drops by 25 or so degrees.

Additional Information

Configuration files are here:
https://github.com/gloomyandy/Marlin/tree/bugfix-2.0.x/Marlin

Source

gloomyandy

All 58 comments

There aren't so many things what could delay/interrupt the temperature interrupt. About the only one coming to my mind is the stepper-IRQ Try to move slower in z or go down with steps/mm.

A bit more challenging would be to investigate/change the temp-IRQ. Are the new endstop tests inserted before or after the ADC handling. Moving ADC handling before everything else would give the stepper-IRQ less chance to interrupt the temp-IRQ before the ADCs are handled - should result in less jitter - on cost of more jitter in the soft-PWM, endstops, ... .

AnHardt on 21 Jul 2018

All ADC peripherals have CONVERSION_DONE bit. This bit should ALWAYS be polled to ensure the read value is the result of a full conversion.
Yes. using the timer ISR as "delay" between phases has the risk of something delaying the execution of one temperature ISR (for example, the temperature ISR is preempted by the stepper ISR and the UART isr) and then the delay btween starting the ADC conversion and the read of the result will not be respected. There is an easy fix: As the ADC is being driven by a state machine, if the ADC result is not ready, just instead of reading the value, keep the state machine at the read state, and let the next ISR read the value

ejtagle on 21 Jul 2018

I added a simple check on the time since the last call, if it was less than 100uS then I simply did a return right at the start of the isr and in effect waited for the next isr tick. What was surprising was that I think this is happening far more often than I originally thought. Before the above change I was seeing a fluctuation in hotend temperature during printing of around 5 degrees or so on a reasonably frequent basis. After adding the above change not only do I not see the occasional big dip, but I now see hardly any fluctuation in temperature at all, so I suspect that "missing" the conversion time must be happening reasonably often though probably resulting in smaller dips in temperature readings. These dips are probably triggering an incorrect increase in the hotend temperature which is not actually required.

@ejtagle & @thinkyhead is the ADC complete bit currently exposed via the HAL in some way? If so I can try testing for this rather than my rather crude time based approach. If the complete bit is not exposed then I guess we will either need to expose it, or perhaps define the "ADC" macro in some way such that it can return a not ready indication?

gloomyandy on 21 Jul 2018

There is currently no HAL exposed method for retrieving the conversion state of the ADC peripherals, but it can easily be added, The MACROS are HAL_READ_ADC and HAL_START_ADC in HAL.h, HAL_READY_ADC (or something) would work.

edit: Although I did add the HAL_adc_finished() function to the LPC176x platform it appears, so this must have crossed my mind at the time.

p3p on 21 Jul 2018

Hmm, I was reasonably happy to produce a PR for a crude time based fix for this, but I'm not sure I fancy creating one that requires adding new functionality to all of the HALs. What is the process for such a change? It seems unlikely that anyone could easily test them all?

I'm also a little concerned about some of the other code in this isr, will all of the other functionality (like the soft PWM code) be OK when you can have a call to call delta that is as low as 28uS? Does anything that hangs off that code require any sort of minimum pulse width?

I suspect that a lot of people will hit this issue as I really don't have anything very extreme in my configuration or machine setup. I'm just using junction deviation, linear advance and UBL with pretty standard step rates (only 16 microsteps, standard belts for X/Y and a standard screw thread on Z), operated at pretty modest speeds. I've had a few thermal runaway halts that I'm pretty sure have been caused by this.

gloomyandy on 21 Jul 2018

@gloomyandy: Just add for each hal a dummy function HAL_IS_ADC_CONVERSION_DONE that returns true, and properly implement it for AVR. I can implement the DUE version. Do not skip the whole timer interrupt, just the temperature measurement part :+1:

ejtagle on 21 Jul 2018

👍1

i think i'm hit by this issue too

negspikes2
negspikes3

what i tried to "fix" it:

PID tune
New wires for thermistor all the way from Ramps to Thermistor
New thermistor cartridge from [e3d-online.com]
New heater cartridge also from [e3d-online.com]
rechecked that all wires in screw therminal are secured in place and no wires are sticking out.
rechecked that all screws in screw terminals are tight
tried to switch arround the thermistor to plug with no noise on
tried new mega board
tried another PSU
tried with a noname 100K thermistor directly to ramps board, wires are just long enough
checked that thermistor pin headers make good contact by bending them just slightly apart
tried to drive extruder heater from bed output to rule out a bad FET

boelle on 24 Jul 2018

@gloomyandy

to test this am i correct that the only 2 files i have to overwrite are

Marlin/src/HAL/HAL_AVR/HAL.h
Marlin/src/module/temperature.cpp

boelle on 24 Jul 2018

Yes that should do it!

gloomyandy on 24 Jul 2018

Assuming you are using AVR of course!

gloomyandy on 24 Jul 2018

oki... you will know shortly if it did the trick for me

i just need to move wires for nozzle and bed back to where they belong

and btw the 2 graphs i put in are not the worst ones :-P

boelle on 24 Jul 2018

i'm.... RAmps+Mega

boelle on 24 Jul 2018

it helped a LOT for the extruder heater, but printing in PLA i have the bed off and now that gets crazy amount of positive spikes, and extruder 2 only have a temp sensor connected for now and that also gets spikes even thou its just hanging outside far away

pos_spikes

both blue and orange should be flat at this point, blue will rise slowly in the start as the hot plastic is put down

boelle on 24 Jul 2018

Perhaps you should try without my changes but exactly the same hardware setup and see if you still get the spikes on the other sensors, they may not be anything to do with the firmware changes.

gloomyandy on 24 Jul 2018

without your changes but the same setup i dont get spikes on the bed, look at my graph just a few post up

what is your setup?

boelle on 24 Jul 2018

would it make sense if i had the excat same setup but just did the same print without your changes?

boelle on 24 Jul 2018

But that graph was produced before you moved around various wires/connections (you said above that you needed to change some things back before you tested the updated firmware), perhaps in making these changes you are now picking up noise. Best to only change one thing at a time if possible, so worth doing the test with and without the firmware changes with exactly the same hardware setup.

gloomyandy on 24 Jul 2018

yep i had the extruder heater on the bed output to test if the FET's was going bad and making noise

that was all

but yes... i will let this print run to end, then undo ONLY your changes and do it again with the same gcode

boelle on 24 Jul 2018

👍1

neg_spikes

same gcode and no wires touched - without your changes

boelle on 24 Jul 2018

with your changes

pos_spikes

boelle on 24 Jul 2018

and still with no changes to hardware but i printed a fitment/tolrance test and turned the bed on
with bed

what is strange is that the orange line has spikes and its just a dummy sensor i have to get ambient temp

boelle on 24 Jul 2018

Thanks for trying that I've just fixed a couple of stupid copy and paste errors one of which may have been causing the problems you are seeing. Could you give the updated version a try?

gloomyandy on 24 Jul 2018

yep sure

i'm trying to get my tolerances right (square box with circle hole and a matching disc) so i will be doing a lot of those small prints anyway

boelle on 24 Jul 2018

still about the same
with bed1

boelle on 24 Jul 2018

I'm surprised at that can you check to make sure that you have the following lines

if (HAL_IS_CONVERSION_DONE()) \
       var += HAL_READ_ADC; \
else \
       adc_sensor_state = (ADCSensorState)(int(adc_sensor_state) - 1);

and in particular that the last line is not

       adc_sensor_state = (ADCSensorState)(int(SensorsReady) - 1);

Thanks

gloomyandy on 24 Jul 2018

  /*
   * This gives each ADC 0.9765ms to charge up.
   */
  #define ADD_ADC_SAMPLE_IF_READY(var) \
    if (HAL_IS_CONVERSION_DONE()) \
      var += HAL_READ_ADC; \
    else \
      adc_sensor_state = (ADCSensorState)(int(adc_sensor_state) - 1);

  switch (adc_sensor_state) {

    case SensorsReady: {

i copied the whole file so everything is ok...
what i do on github is viewing the whole file in raw and i just select everything and copy... and in notepad++ i select everything and hit insert

but yeah its a strange one

boelle on 24 Jul 2018

whole file as is here: https://pastebin.com/0XdCdaWg

boelle on 24 Jul 2018

Hmm that is strange. I'll see if I can reproduce the temp glitches on the bed. Could you post the section from your configuration file that has the various temp sensor configurations...

gloomyandy on 24 Jul 2018

the whole configuration.h: https://pastebin.com/SXvUbXLW

partner is home and asked if i could go help with shopping bags hence the link to whole file

boelle on 24 Jul 2018

No problem!

I've just fixed a couple of other mistakes (that I missed in the last change), I doubt they will make any difference. But it is probably worth updating. I'm going to try and set things up to match how you have things and see if I get the same glitches.

gloomyandy on 24 Jul 2018

might be worth noting that i have a dummy extruder in my setup to allow me to see ambient/chamber temp directly in octoprint

the chamber temp feature only sends the temp on serial and i cant see that in octoprint, only if i look at the terminal..

chamber is the orange line

what i did was adding a stepper with free pins and then i changed settings to say i have 2 extruders

so yeah i used free pins to create another extruder stepper and there was also free pins for a 2nd heater and fan.... my goal was to be able to use that and have an electrical heater at some point for ABS prints, but that would prob not work as it would be to slow in heating up. so for now i just use it to check temp

boelle on 24 Jul 2018

for the fun of it my configuration_adv.h: https://pastebin.com/SL3nMPBi

boelle on 24 Jul 2018

Thanks! I don't think I can recreate your setup as I have my extruder cooling fan on the mosfet that would be used for E2. But I think I've been able to reproduce the issue with just a single extruder. Will let you know!

gloomyandy on 24 Jul 2018

I dont have a 2nd extruder heater or fan connected, only the temp sensor for a 2nd extruder,
the rest are just dummies created with free pin numbers

:-D

boelle on 24 Jul 2018

OK I've just pushed a couple of changes that I think may help things. You will need temperature.cpp and temperature.h. If you get chance could you give it a try?

gloomyandy on 24 Jul 2018

I will do the first thing in the morning while its still reasonble cold
all this heat makes my brain go very slow :-(

boelle on 24 Jul 2018

👍1

i had a cold drink and did a test
late_test
tomorrow i will go back to using my 48A ATX psu but i expect no changes other than psu config :-D

boelle on 24 Jul 2018

👍1

someone on reprap.org suggested that temp readings should be an avg of the last 10 readings
i dont know if that would have any negative effects?

but lets say that we have 60 readings every sec (maybe overkill) so instead of reporting 60 readings we report the avg of every 10 and only report back 6 readings... just an example as i dont know how many readings are actually taken

boelle on 29 Jul 2018

The reading you see is already an average of 16 readings a reading is taken every 10mS and a new averaged sample is produced every 160mS, so you do indeed get approx 6 readings per second. There may be additional processing on this reading before you see it, but I've not looked at that.

gloomyandy on 29 Jul 2018

my thought was just that do we need that much?

boelle on 29 Jul 2018

Well I guess you need a reasonably high frequency for the PID controller to be able to maintain the temperature. The averaging will help with any noise issues. Perhaps you could sample less often, but I suspect the cpu saving would not be that great.

gloomyandy on 29 Jul 2018

I think I'm seeing the same problem. Most of the time the hotend temp is very stable, but then I get times where the temperature becomes very erratic and sometimes causes a runaway thermal reset. Like @boelle I've replaced the entire heater block with the latest from e3d which has the thermistor cartridge with the nice plastic connector that hooks to a single run of wire to the RAMPS board. I'm using a version of bugfix-2.0.x by tcm0116 that fixes another bug involving retraction. I'll probably try this fix and see if I see things being more stable.

brucehvn on 9 Oct 2018

@thinkyhead, has this fix been merged into the main Marlin bugfix-2.0.x branch?

brucehvn on 16 Oct 2018

I assume with all this gabbing that someone actually did something. The last comment was July 29. If it hasn't been addressed by now then someone who has a clue needs to get on it.

thinkyhead on 17 Oct 2018

@gloomyandy, @boelle can you confirm that this fix got into the main bugfix branch? Today, I had the temp jump to about 450C in a couple seconds. I was really lucky it came down before the thermal protection kicked in.

brucehvn on 20 Oct 2018

Yes the code has been in for some time, I don't think the problem you are seeing is this one. I think you need to open a new issue and provide a lot more details.

gloomyandy on 20 Oct 2018

It could still be the same problem. I've been on a branch from around the time you started this thread due to another issue that was fixed by that fork. Once I verify that other fix has also been merged, then I can go back on the main bugfix-2.0.x branch. But either way, if the problem still exists after I go back on the main branch, I'll put in another ticket.

brucehvn on 20 Oct 2018

@brucehvn

at least provide a temp graph like i did, it will tell a lot more than words :-D

boelle on 20 Oct 2018

Ok, luckily I was able to just grab one when I noticed the temp was fluctuating again. I'm using Repetier Server and it doesn't show a large time frame in the graph. This is a fairly small fluctuation compared to some I get.
selection_173

brucehvn on 20 Oct 2018

Another example during the same print after about 20 minutes of pretty stable temps.
selection_174

brucehvn on 20 Oct 2018

this is a little more than a "glitch" or spike

sure its not a thermistor that has one leg in the coffin?

boelle on 20 Oct 2018

@brucehvn
What processor do you use?
The above 'patch' does only work for the AVR-processors.
If i'v seen it right the other platforms all do use analogRead() storing in a variable to get the value later and fake the conversion ready test. Because analogRead() is busy waiting for the result this should work nicely, but is poisoning the timing.

Got surprising results.
Testet the duration of analogRead()

// MEGA   time per ADC sample = 112.0000000000µs =   8.9285717010 Khz
// DUE    time per ADC sample =   4.0000000000µs = 250.0000000000 Khz 30 times faster
// F429ZI time per ADC sample =  53.0000000000µs =  18.8679245283 Khz 2,113207547169811 times faster That's disappointing

For the MEGA this is the expected bad value. For good reason we don't use anologRead() at the AVRs, but start the conversion and pick up the result later.
At the DUE analoRead() is about 30 times faster. So we can probably use it.
The F429ZI is an 180MHz ARM and for that surprisingly slow in analogeRead() with stm32duino/Arduino_Core_STM32 kernel.

How does your board perform?

const int analogInPin = A0;
int sensorValue = 0;
void setup() {
  Serial.begin(9600);
  while (!Serial) {}; // wait for serial port to connect. Needed for native USB port only
}

#define NSAMPLES 100000

void loop() {
  unsigned long time1 = micros();
  for (long long i = NSAMPLES; i >= 0; i--)
    sensorValue += analogRead(analogInPin);
  unsigned long time2 = micros();
  Serial.print("time per ADC sample = ");
  double msps = (time2 - time1) / (NSAMPLES);
  Serial.print(msps ,10);
  Serial.print("µs = ");
  Serial.print(1000.0/msps,10);
  Serial.println(" Khz");
  delay(2);
}

// MEGA   time per ADC sample = 112.0000000000µs =   8.9285717010 Khz
// DUE    time per ADC sample =   4.0000000000µs = 250.0000000000 Khz 30 times faster
// F429ZI time per ADC sample =  53.0000000000µs =  18.8679245283 Khz 2,113207547169811 times faster That's disapointing

AnHardt on 21 Oct 2018

@AnHardt : Not all 32bit processors have integrated fast ADC converters. Some of them could be using a ramp based (delta) ADC, that are quite precise, but very slow...

ejtagle on 21 Oct 2018

@AnHardt I'm using the typical Mega 2560 AVR with Ramps 1.4. I've just installed the latest bugfix-2.0.x code and will give it a try first before pursuing this any further.

brucehvn on 21 Oct 2018

@boelle I would tend to agree with you except I've completely replaced the heater block with the latest e3d v6 heater block with the thermistor cartridge assembly. It could be a problem with that, but I saw the same behavior with the old heater block and thermistor, so if it's hardware related, it would probably be something with the analog functionality of the ramps or mega board. I had another issue where the z motors were turning a little bit when a retract/unretract was executed. I spent 2 days swapping out all the hardware, ramps, mega, lcd display, etc., only to find out it was a firmware issue that was being worked on. So nowadays, I check the firmware first, then move on to the hardware if needed.

brucehvn on 21 Oct 2018

@ejtagle
Got some more numbers.

// MEGA/UNO/NANO time per ADC sample = 112.0000000000µs =   8.9285717010 Khz
// DUE           time per ADC sample =   4.0000000000µs = 250.0000000000 Khz
// ESP32         time per ADC sample =   9.0000000000µs = 111.1111111111 Khz
// Teensy++ 2.0  time per ADC sample =  60.0000000000µs =  16.6666666667 Khz
// Teensy 3.5    time per ADC sample =   8.0000000000µs = 125.0000000000 Khz
// Teensy 3.6    time per ADC sample =   6.0000000000µs = 166.6666666667 Khz
// STM32F401RE   time per ADC sample =  56.0000000000µs =  17.8571428571 Khz
// STM32F411RE   time per ADC sample =  53.0000000000µs =  18.8679245283 Khz
// STM32F429ZI   time per ADC sample =  53.0000000000µs =  18.8679245283 Khz

The STM32s ADCs are fast - 1Msample/s should be reachable with the right use. It's the current implementation of analogRead() what takes that long (stm32duino/Arduino_Core_STM32 kernel). For every analogRead() the ADC is - completely set up, start a calibration, wait for calibration, start a single measurement, wait for measurement, shut down ADC.

AnHardt on 25 Oct 2018

@AnHardt : That is the problem with Arduino core--- It is a general purpose framework, and when trying to get maximum performance from the hw, it starts to interfere.
All the digitalXXX() and analogXXX() functions reinitialize everything, just in case, so they are pretty slow. And that is the reason of the fastio.h macros.

I guess that if analogXXX() functions are getting in the way of performance, they should be reimplemented (hey!, performing an ADC conversion is not that hard at all ... ;)

(And BTW, yes, i HATE polling on embeded systems... on STM we should be doing it the same way than in AVR, using an state machine...)

ejtagle on 25 Oct 2018

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.