Marlin: "ok" getting lost in M155 auto-temperature report

Created on 9 Jan 2018 · 51Comments · Source: MarlinFirmware/Marlin

Bug Report

Description: One of our users is reporting an issue with OctoPrint where the "ok" is getting embedded in the temperature auto-report line. This causes OctoPrint to eventually stall (look for the "N7531ok N7531 P0 B3" below):

2017-12-26 16:06:21,805 - Send: N7530 G1 X141.376 Y130.063 E0.3773*98
2017-12-26 16:06:21,816 - Recv: ok N7530 P1 B3
2017-12-26 16:06:21,820 - Send: N7531 G1 X138.650 Y127.336 E0.4529*99
2017-12-26 16:06:21,830 - Recv: T:190.26 /190.00 B:59.98 /60.00 T0:190.10 /190.00 T1:190.26 /190.00 @:64 B@:18 @0:65 @N7531ok N7531 P0 B3
2017-12-26 16:06:23,825 - Recv: T:190.23 /190.00 B:59.94 /60.00 T0:190.07 /190.00 T1:190.23 /190.00 @:66 B@:25 @0:66 @1:66

Our configuration:

BAUDRATE 250000
TX_BUFFER_SIZE 32
AUTO_REPORT_TEMPERATURES
ADVANCED_OK

Cura 2 uses this to re-synchronize itself, so Cura 2 has no problem completing a print, but OctoPrint does. This behavior is present in Marlin 1.1.5, unfortunately I had not had the chance to test it in later versions as our FW is a few versions behind right now.

Expected behaviour: There should be a carriage return before the "ok"
Actual behaviour: The temperature auto-report appears truncated, is immediately followed by a line number, and then the advanced okay string.
Steps to reproduce:
- Start a print
- Enable auto temperature reporting using M115 S4
- Inspect serial log for overrun temperature reports.

Question

Source

marcio-ao

All 51 comments

M105 has always included ok at the start of the output line.
It sounds like something else is going on there.

thinkyhead on 9 Jan 2018

@thinkyhead: Not M105, M155. This is an automatic report. Does that make a difference?

marcio-ao on 9 Jan 2018

@thinkyhead: If I were to make a guess, the "ok" actually belongs to another command. This happens while a print is in progress, so I think while OctoPrint is sending commands, a temperature autoreport gets intermixed with the acknowledgement from a print command. I suppose that would make sense if this was multi-threaded code, but somehow I suspect it isn't, so I'm not sure I understand what is happening.

marcio-ao on 9 Jan 2018

These outputs all happen in the main thread, so there's no chance of an interrupt breaking into the middle of a serial print to print something else. I've just checked all instances of calls to print_heaterstates and they all have SERIAL_EOL() following on the next line, including the auto-temp report.

    void Temperature::auto_report_temperatures() {
      if (auto_report_temp_interval && ELAPSED(millis(), next_temp_report_ms)) {
        next_temp_report_ms = millis() + 1000UL * auto_report_temp_interval;
        print_heaterstates();
        SERIAL_EOL();
      }
    }

Perhaps a communication error…? I recommend first changing the USB cable, then removing all sources of EMF, and reducing serial connection speed to 115K or lower.

thinkyhead on 9 Jan 2018

Would TX_BUFFER_SIZE have anything to do with this? I currently have it set to 32 bytes, which is less than a single autoreport line. That said, I also compiled some custom firmware with that set to 255 and the user reported no differences.

marcio-ao on 9 Jan 2018

I don't think the size of TX_BUFFER_SIZE should matter much. But if you set it to 0 then it uses the non-TX version of serial send.

I'm not too versed on serial comms, so see if this thread offers any additional insight.

thinkyhead on 9 Jan 2018

❤1

@thinkyhead: Thanks. Sounds like the only thing I would see if the buffer got exhausted was more busy-waiting in Marlin. That and the fact that the code is not multi-threaded suggests that the issue is elsewhere, perhaps even an Octoprint bug.

marcio-ao on 9 Jan 2018

I think it's more likely dropped characters due to the 250K serial speed. This line is a dead giveaway:

2017-12-26 16:06:21,830 - Recv: T:190.26 /190.00 B:59.98 /60.00 T0:190.10 /190.00 T1:190.26 /190.00 @:64 B@:18 @0:65 @N7531ok N7531 P0 B3

See how the last value gets cut off? The "1:66\n" never arrives at the host.

Another thing to try is set RX_BUFFER_SIZE to 1024 and enable SERIAL_XON_XOFF handshaking.

thinkyhead on 9 Jan 2018

Wait... RX_BUFFER_SIZE? Why would this make a difference?

marcio-ao on 9 Jan 2018

It depends on what the cause of dropped characters is. Being able to tell the host that the receive buffer is getting full may have a mitigating effect on the dropped characters on the host side. But, yeah, the effect would be indirect. A 115K baud rate is far more likely to stop the errors cold.

thinkyhead on 9 Jan 2018

@thinkyhead Actually, we see a lot more serial errors in 115k than in 250k. 115k is a constant problem on some of our older printers that use that rate. That is in fact why I modified Cura to use ADVANCED_OK, so I could deal with all the lost acknowledgments. I do not know the details, but it was explained to me as mismatch between 115K and the Arduino clock rate which leads to an occasional loss of sync. I suspect the XON/XOFF handshaking may help in the particular problem we are seeing here, except that Cura 2 does not support that yet and I am not sure about OctoPrint.

Anyhow, it sounds like this may not be a Marlin issue. Will close this ticket down and reopen if I find otherwise.

marcio-ao on 9 Jan 2018

A 115K baud rate is far more likely to stop the errors cold.
Actually, we see a lot more serial errors in 115k than in 250k. 115k is a constant problem on some of our older printers that use that rate.

It's true that 250000 allows exact clock-sync while 115200 results in 1.4% guaranteed error rate (which can exacerbate printers with already-marginal error rates). 250000 is the only perfect serial rate for 8-bit AVR. Next one slower is 4800baud [cring] :

http://wormfood.net/avrbaudcalc.php

XON/XOFF handshaking may help in the particular problem we are seeing here, except that Cura 2 does not support that yet and I am not sure about OctoPrint.

@marcio-ao - I submitted FR https://github.com/foosel/OctoPrint/issues/2261 and someone has submitted followup PR https://github.com/foosel/OctoPrint/pull/2346 to implement it. Due to the lack of activity from Gina as of late, I suspect she is heads-down in her famed "_Complete Comms Layer Rewrite_" she's been talking about forever, which I suspect she's hoping will remove the need for flow control. And if it doesn't, I suspect she will request the submitter of the PR to instead "Write a comms layer plugin" :)

-=dave

fiveangle on 10 Jan 2018

👍1

we see a lot more serial errors in 115k than in 250k

Interesting! I always use the SD card because I lost all faith in USB on my first printer build. 😉

It surprises me that there isn't some kind of slewing in the serial library to keep the baud rate closer to the sync. In other words, always run the serial clock at 250K but grab characters from the UART using a Bresenham method.

thinkyhead on 10 Jan 2018

@thinkyhead @fiveangle
I reported something similar with M155 (see #7021) but was less important as I found a workaround for MY problem...
I GUESS that since M155 is totally assynchronous to the Gcode flow, the probability for M155 to issue an ok at the wrong time is NOT null.

To interlock the output on the console would, _I think_, help with the problem, but it would need a major change with a possibly to high a overhead...

Note, I finally reverted to using only M105.
Note, some host software do increase the probability of the issue by setting the M155 delay to 1 second...
Note, in 7021, I wrongly talked about autotemp in reference to M155.

lrpirlet on 11 Jan 2018

@fiveangle

Due to the lack of activity from Gina as of late

It's called a vacation. It's something that sometimes is extremely necessary to stay sane and healthy, and end-of-year is a common time for that.

I suspect she is heads-down in her famed "Complete Comms Layer Rewrite" she's been talking about forever

...

which I suspect she's hoping will remove the need for flow control.

No, she isn't, but she's counting on that being easier to implement then than it is in the current organic wild growth from five years of trying to support every broken firmware variant under the sun.

And if it doesn't, I suspect she will request the submitter of the PR to instead "Write a comms layer plugin" :)

Seriously? Do you intentionally try to antagonize me? Because based on this and past comments from you in other tickets I'm starting to feel you have a problem with me and what and how I do what I do, and I'd like to understand what that is and why. This kind of passive aggressive kicks against my shin are something I can't explain otherwise, and it's frankly starting to become a bit annoying.

foosel on 11 Jan 2018

I'm so sorry Gina. It was not meant as a personal dig. The smiley was intended to lighten the reality of the situation, and that is you don't have much choice given the single-handed nature of your project, and the lack of hours in the day. It, unfortunately, pits you in the seat to have to squelch only the most important-at-the-time fires du jour, and with your plug-in architecture, you quite often use it easy scapegoat do direct others to "help themselves" to the point that I guess I found humorous. It's not any different than other projects like Marlin where a feature request (even a good one) that isn't something anyone is interested in working on at the time, is so often met with a "We'll gladly accept a PR with this improvement !" response.

This is the nature of open source and if we can't laugh about it, we'll cry 😃

I think my suspicions of what you were up to were more of selfish wishful thinking based on your previous expressions of wanting to go "heads-down" to get some of these large features banged out, but a vacation is certainly a better use of time. I'm glad you enjoyed some time away and sorry to trigger you on the negative side on your return with my comments. That wasn't the intention. I'm surprised that this is something you feel is ongoing. I'll try to be more cognizant of your sensitivity to this and try to err on the side of caution in the future.

fiveangle on 11 Jan 2018

Thank you for clarifying and considering. Written communication truly has its pitfalls.

foosel on 11 Jan 2018

I GUESS that since M155 is totally assynchronous to the Gcode flow, the probability for M155 to issue an ok at the wrong time is NOT null.

It is done in the main program flow. And all serial output emanates from there. So there's no chance of an interrupt by M155 (or anything else) in the _middle_ of a line of output.

The probability for M155 to issue an ok at the wrong time is NIL.

thinkyhead on 12 Jan 2018

@thinkyhead,

If I understand you correctly, it is impossible for Marlin to have generated the following sequence of characters (the temperature report is truncated so the "ok" is in the middle of it instead of on a line by itself).

T:190.26 /190.00 B:59.98 /60.00 T0:190.10 /190.00 T1:190.26 /190.00 @:64 B@:18 @0:65 @N7531ok N7531 P0 B3

b-morgan on 12 Jan 2018

@b-morgan Correct. The temperature report comes from one point in the main thread, and the ok is emitted at another point, later, in the main thread. So that output indicates the host never received any characters output by print_heaterstates after the 86th. If this were a case of interruption, we would expect to see the remainder of the temperature report line following the ok line.

thinkyhead on 12 Jan 2018

@b-morgan: The error could also be caused at the receiving end. The characters in the auto-report are being sent as one continuous burst. If the serial library that is listening does not have a big enough receive buffer for the entire line, it could truncate a few characters before those characters were processed by OctoPrint. A few miliseconds later, if Marlin issued the "ok", it would appear at the end of a truncated temperature report, which I suspect is what is happening here. If Marlin were indeed putting the "ok" in the middle of the temperature report, you would see the final characters of the temperature report after the "ok". This is not happening.

marcio-ao on 12 Jan 2018

@thinkyhead
I fully agree with you that serial output is an atomic operation... So a serial output CANNOT be interrupted... Now what about a GCODE issuing SEVERAL serial outputs... Obviously an atomic serial output from a M155 GCODE can be inserted in the middle of a stream of atomic outputs issued by any GCODE and in particular by G29 T1...

When I stumbled across this one, I was under the impression that many GCODEs issuing several atomic serial output before the string "OK" could receive an OK from M155 too early, possibly confusing the host... No I only issued manually and then I never tried G29 T1 in a stream of GCODES... so I have no idea of the result if the host is told too early that the command is done.

Is it host responsibility to handle unrelated atomic output???

Sorry, I may be wrong, if that is the case please excuse me.

lrpirlet on 12 Jan 2018

I built some test FW for @b-morgan that inserts a "safe_delay(100);" in the middle of the temperature auto-report. He has confirmed that this solves the problem. I believe this confirms the theory thatwhen using a dual-extruder, the temperature auto-report line is too big to fit into the RX buffers the host. Adding a delay gives OctoPrint a chance to empty the buffer and resolves the issue. I will experiment with shorter delays.

Anyhow, I realize that this is sort of a kludge, but since the temp auto-report line is the longest line that Marlin issues, just breaking that line up may actually be sufficient.

marcio-ao on 12 Jan 2018

Funny you should go that way! Today I added a patch for serial output to get more time, realizing that might help with the small TX buffer. I don't know if safe_delay(100) is needed. I counted on my fingers the number of milliseconds needed and estimated that a safe_delay(2) between each temperature segment would be enough to get 12 characters out unobstructed. Have a look and see how well the current code works.

thinkyhead on 15 Jan 2018

@thinkyhead Where can I find this patch?

b-morgan on 15 Jan 2018

@thinkyhead: I put in 100ms just to make sure it would solve the problem. I was going to reduce it if it made a difference. If you tell me what commit your patch was, I can backport it into our FW and build some new FW for @b-morgan

marcio-ao on 15 Jan 2018

I can also build the firmware.

b-morgan on 15 Jan 2018

I have opened an issue against OctoPrint (https://github.com/foosel/OctoPrint/issues/2370)

b-morgan on 16 Jan 2018

Where can I find this patch?

The bugfix-1.1.x branch contains it.

thinkyhead on 16 Jan 2018

To catch up on this…. The line delay(2) was added to print_heater_state which is called several times by print_heaterstates. Even this small delay should give the output TX buffer sufficient time to get some characters out and prevent it from overflowing. This might not be true for very low baud rates, so maybe it should be something like delay(250000 / BAUDRATE) to deal with slower connections.

thinkyhead on 22 Jan 2018

@thinkyhead : Can you point out the commit? I can't seem to find it in the bugfix branch. I would like to backport whatever you did to our FW so we stay on the same page.

marcio-ao on 22 Jan 2018

@marcio-ao Let me show you the "Blame" interface… since that's what I will now have to use to find the commit you're looking for…

Aha, here it is!

https://github.com/MarlinFirmware/Marlin/blame/bugfix-1.1.x/Marlin/temperature.cpp#L2203

thinkyhead on 22 Jan 2018

@thinkyhead: I am aware of "git blame", but apparently this isn't in the bugfix checkout I am looking at. Odd. Anyhow, the commit you linked to gives me the information I needed. Thank you!

marcio-ao on 22 Jan 2018

this isn't in the bugfix checkout I am looking at

These aren't the code-droids you're looking for?

thinkyhead on 22 Jan 2018

@thinkyhead : It looks like I was looking at a fork I made on Github to do a PR a while back, rather than one that was pointing to the main Marlin repo. I found the droids I was looking for when I looked in the right place :)

marcio-ao on 22 Jan 2018

This delay does not make any sense to me.
Before i introduced the TX-Buffer, printing just waited actively fir a free tx register before it sended the next char. In my implementation of the TX-buffer, printing just actively waited for a free space in the TX-buffer. So if the buffer is big and empty enough, for the string to print, there was no active waiting. If the buffer was to full, or the string too long, the waiting was the same as before, but on the buffer . not the register. There was no chance for a char to get lost.

This unconditional waitings for 2ms, where at 250000bd 62.5 chars could be send make things much worse than it was without TX-buffer.

Due to the interrupt overhead using the TX-buffer is a bit slower (needs more execution time) than actively waiting for space in the tx-register, but this time is much better distributed. Waiting 2ms every few chars is catastrophic compared to that.

AnHardt on 22 Jan 2018

@AnHardt : The issue is with the other end. Marlin sends faster than the receiving end can process. Since there's no flow control for transmitting, a delay is the easiest way to slow Marlin down. Since the delay is only added when sending the auto temperature report, everything else is not effected. @thinkyhead suggests above that the delay should probably be a function of the baud rate and I agree.

b-morgan on 22 Jan 2018

I am having the same issue

Phits on 3 Feb 2018

@Phits With the current bugfix-1.1.x branch? Darn, thought we patched it.

thinkyhead on 5 Feb 2018

I was able to get it to work by changing the temperature interval (auto report) to 0 in the Octoprint settings

Phits on 5 Feb 2018

@Phits — What baud rate are you using? Try increasing the delay value at the end of print_heater_state (near line 2124 in temperature.cpp). It's currently just 2. Try values from 4 to 10 and see if it improves things.

You can also try disabling AUTO_REPORT_TEMPERATURES and see how this affects it.

thinkyhead on 5 Feb 2018

You can also try disabling AUTO_REPORT_TEMPERATURES and see how this affects it.

This is pretty much what setting the auto report temperature interval to 0 in OctoPrint does - it won't switch to auto reporting in that case, effectively ignoring the auto report capabilities.

foosel on 5 Feb 2018

👍1

I just got the same error with the temperature interval set to 0. The baud rate is set to auto. Any suggestions?

Phits on 10 Feb 2018

I am going to try this setting for terminal filters and see how it goes for now.

Origina:
(Send: (N\d+\s+)?M105)|(Recv:\s+(ok\s+)?(B|T\d*):)

New:
(Send: (N\d+\s+)?M155 S0)|(Recv:\s+(ok\s+)?(B|T\d*):)

Phits on 10 Feb 2018

No luck with those settings
heating-error

Phits on 10 Feb 2018

@Phits : What printer, what extruder, what version of Marlin? What is the source of your Marlin firmware? As @thinkyhead pointed out, there is a "fix" in the Github bugfix-1.1.x branch, but you haven't verified that you are using that branch. What model of Raspberry Pi are you running OctoPrint on, what versions of OctoPi and OctoPrint?

I'm not surprised that your change to the terminal filter didn't work. The "M155 S0" command should be sent by OctoPrint when the temperature interval (autoreport) is set to 0 and it should only happen once. When the autoreport interval is set to 0, "M105" commands are sent by OctoPrint at (one of) the temperature interval (polling) rate(s).

b-morgan on 10 Feb 2018

I have a Lulzbot Taz 6 with the v3 dual nozzle. Marlin 1.1.5. I installed Marlin with the latest version of Cura for the Lulzbot. Raspberry Pi is BCM2708, 0010.

Phits on 10 Feb 2018

@Phits : Marlin 1.1.5.? (the last part is important). In https://code.alephobjects.com/T1598, @marcio-ao provided http://devel.lulzbot.com/software/Marlin/1.1.5.71/ which fixed the problem for me. You might try that version.

b-morgan on 10 Feb 2018

@b-morgan Thanks, I just upgraded my Marlin to 1.1.5.71 and will let you know how it goes.

Phits on 10 Feb 2018

Still failing. I posted a message at https://code.alephobjects.com/T1598
heating-error

Phits on 11 Feb 2018

I posted a message at https://code.alephobjects.com/T1598

Excellent. They're the good people you must contact for support. With our limited resources we can only keep up with current issues in our own latest release.

thinkyhead on 14 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings