At the end of the print, the printer cools its bed, the command of which takes a while to complete.
OctoPrint reports "Communication timeout" in the log while this is happening.
Since the M190 is a long running command, I had expected no communication errors.
Errors:
Send: N6721 G1 E-1 F300*54
Recv: ok N6721 P14 B3
Send: N6722 G1 Z20 E-5 X-20 Y-20 F3000*120
Recv: ok P14 B3
Send: N6723 G1 E6*75
Recv: ok N6722 P13 B3
Send: N6724 M117 Cooling...*82
Recv: ok N6723 P12 B3
Send: N6725 G90*38
Recv: ok N6724 P12 B2
Send: N6726 G1 Y0 F3000*49
Recv: ok N6725 P12 B2
Send: N6727 M190 R45*108
Recv: ok N6726 P11 B2
Send: N6728 G1 Y280 F3000*53
[...]
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
[...]
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
[...]
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
[...]
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
[...]
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
[...]
Recv: ok N6727 P15 B0
Send: N6734 M140 S45*98
Recv: ok N6728 P14 B0
[...]
Send: N6735 M84*40
[...]
Send: N6736 G90*36
Send: N6737 M117 Print complete*73
[...]
Changing monitoring state from "Printing" to "Operational"
[...]
Recv: ok N6734 P14 B3
[...]
Recv: ok N6735 P15 B1
Recv: ok N6736 P15 B2
Recv: ok N6737 P15 B3
[...]
OS X High Sierra, Safari 11.1
octoprint.log is here https://gist.github.com/keybuk/a5d0d07cbfc55a6534ed7b04738e2e31
Included above
N/A
N/A
I have read the FAQ.
Can you please provide an unfiltered Terminal output (or alternatively a serial.log)? It's sadly not possible to determine from what you provided whether the printer behaved as it should during this blocking cooldown that is apparently the cause of the timeout, or not.
Sure thing, I tried to grab a serial.log for you this morning before work, but I learned something interesting in the process.
Enabling serial.log fixed the issue!
With serial.log on, OctoPrint sends the M190 and waits there:
2018-04-20 09:47:13,901 - Send: N1190 M190 R45*97
2018-04-20 09:47:13,931 - Recv: T:204.87 /0.00 B:60.02 /45.00 @:0 B@:0 W:?
2018-04-20 09:47:14,665 - Recv: T:204.82 /0.00 B:60.00 /45.00 @:0 B@:0
2018-04-20 09:47:14,930 - Recv: T:204.91 /0.00 B:60.01 /45.00 @:0 B@:0 W:?
But with it turned off (how it was before), it actually sent the following G1 before waiting:
Send: N6727 M190 R45*108
Recv: ok N6726 P11 B2
Send: N6728 G1 Y280 F3000*53
Notice how the ok there is for N6726 rather than the M190's N6727 command?
While the terminal snippet didn't have timestamps, I can confirm that the Y280 was the point at which it paused, not the M190 before it.
This leads me to two questions:
ok for?This might make sense if serial.log is slowing up transmission/receipt enough for the extra line to be not sent.
ok is for the appropriate command? ie. is it seeing that N6726 ok as the response to the N6727 M190 ?If not, that would explain why it carries on past the M190 and doesn't treat it as a long-running command. I'll keep an eye out for this too, especially if it always goes out of sync - it seems normal for OctoPrint to send lots of lines and wait for the printer to catch up.
One thing I noticed, and it's actually in there Terminal snippet above, is that there's an un-line-numbered ok in there:
Send: N6721 G1 E-1 F300*54
Recv: ok N6721 P14 B3
Send: N6722 G1 Z20 E-5 X-20 Y-20 F3000*120
**Recv: ok P14 B3**
Send: N6723 G1 E6*75
Recv: ok N6722 P13 B3
I've no idea where this comes from? It's not the ok to N6722 which happens later…
Maybe this is where OctoPrint is getting confused? Seeing an un-numbered ok causes it to lose track, even when all lines are numbered.
Logs attached, but I stress these are of OctoPrint working as intended, will try to keep seeing if I can replicate the problem with serial.log on:
https://gist.github.com/keybuk/e0f3aa8bfa8da2299c16825330610074
https://gist.github.com/keybuk/2d0b6832a9907e2665da8e7d9e182a07
I just realized I didn't reply here which I thought I did -_- Sorry.
Is OctoPrint actually checking that a received ok is for the appropriate command? ie. is it seeing that N6726 ok as the response to the N6727 M190 ?
It can't. A lot of firmware variants out there have a bug that makes commands be acknowledge in a different order than they were sent in. And since most firmware still doesn't send a line number with the ok, OctoPrint can't match up commands sent with received oks (or other output). That is also the reason why there's no way to retrieve the output of a command through OctoPrint's API.
That also leads to the tracking of long running commands being tricky to do, as one ok for an earlier sent line (or an extra one as possibly observed here) can throw things off.
In your log I notice that your firmware behaves differently in both cases. Your firmware is the party responsible of a) waiting and b) outputting continuous temperature reports (and ideally echo:busy:processing messages - but that's something most vendor forks don't support yet). In the failure case it looks like it's only doing a) and not b). b) would allow OctoPrint to detect that something's going on here and the printer is still alive, but your firmware simply stops doing anything, which is highly unusual for an M190.
It does send temperature reports in all cases - the only reason it's snipped from the earlier log is that OctoPrint's "suppress temperature reports" checkbox was checked, and that was copy and pasted from the web UI
@foosel - I am experiencing this same problem. Marlin 1.1.8.4, Lulzbot Mini, Octoprint 1.3.9 stable on Octopi 0.13.0.
What appears to happen is this:
Seems like Octoprint is not interpreting the auto-temp reports as "communication"? Maybe only resetting the communications timeout when an OK is received?
Here's an excerpt from the serial.log -- note that there is no lack of communication, auto-temperature reports are being received every 1-2 seconds, but Octoprint still complains of a "communication timeout" 25s after the last "OK" was received (comm timeout in settings is 25s):
2018-10-25 09:44:23,711 - Recv: ok N6659 P15 B3
2018-10-25 09:44:23,717 - Send: N6661 G0 E4 F200*27
2018-10-25 09:44:23,720 - Recv: ok N6660 P13 B3
2018-10-25 09:44:23,725 - Send: N6662 M190 R50*104
2018-10-25 09:44:23,731 - Recv: ok N6661 P12 B3
2018-10-25 09:44:23,735 - Send: N6663 G1 Y80 F1000*11
2018-10-25 09:44:23,738 - Recv: T:240.10 /0.00 B:70.00 /50.00 @:75 B@:32 W:?
2018-10-25 09:44:24,736 - Recv: T:239.22 /0.00 B:70.05 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:25,502 - Recv: T:240.31 /0.00 B:69.95 /50.00 @:0 B@:0
2018-10-25 09:44:25,736 - Recv: T:239.84 /0.00 B:70.01 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:26,736 - Recv: T:240.10 /0.00 B:69.91 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:27,502 - Recv: T:240.42 /0.00 B:69.85 /50.00 @:0 B@:0
2018-10-25 09:44:27,739 - Recv: T:240.10 /0.00 B:69.81 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:28,739 - Recv: T:239.69 /0.00 B:69.75 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:29,501 - Recv: T:239.22 /0.00 B:69.66 /50.00 @:0 B@:0
2018-10-25 09:44:29,738 - Recv: T:239.45 /0.00 B:69.62 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:30,737 - Recv: T:238.91 /0.00 B:69.51 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:31,503 - Recv: T:238.75 /0.00 B:69.49 /50.00 @:0 B@:0
2018-10-25 09:44:31,736 - Recv: T:238.05 /0.00 B:69.47 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:32,736 - Recv: T:237.27 /0.00 B:69.44 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:33,502 - Recv: T:236.80 /0.00 B:69.31 /50.00 @:0 B@:0
2018-10-25 09:44:33,737 - Recv: T:236.41 /0.00 B:69.30 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:34,739 - Recv: T:235.86 /0.00 B:69.19 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:35,502 - Recv: T:235.23 /0.00 B:69.19 /50.00 @:0 B@:0
2018-10-25 09:44:35,739 - Recv: T:234.77 /0.00 B:69.15 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:36,739 - Recv: T:233.91 /0.00 B:69.04 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:37,505 - Recv: T:233.44 /0.00 B:68.96 /50.00 @:0 B@:0
2018-10-25 09:44:37,739 - Recv: T:233.05 /0.00 B:68.95 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:38,737 - Recv: T:232.50 /0.00 B:68.86 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:39,503 - Recv: T:231.64 /0.00 B:68.84 /50.00 @:0 B@:0
2018-10-25 09:44:39,737 - Recv: T:231.17 /0.00 B:68.79 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:40,740 - Recv: T:229.94 /0.00 B:68.76 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:41,503 - Recv: T:230.16 /0.00 B:68.70 /50.00 @:0 B@:0
2018-10-25 09:44:41,741 - Recv: T:229.25 /0.00 B:68.60 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:42,740 - Recv: T:228.50 /0.00 B:68.57 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:43,501 - Recv: T:227.81 /0.00 B:68.48 /50.00 @:0 B@:0
2018-10-25 09:44:43,739 - Recv: T:227.81 /0.00 B:68.46 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:44,738 - Recv: T:227.13 /0.00 B:68.37 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:45,501 - Recv: T:226.50 /0.00 B:68.31 /50.00 @:0 B@:0
2018-10-25 09:44:45,738 - Recv: T:226.19 /0.00 B:68.31 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:46,737 - Recv: T:225.19 /0.00 B:68.22 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:47,503 - Recv: T:224.69 /0.00 B:68.14 /50.00 @:0 B@:0
2018-10-25 09:44:47,737 - Recv: T:224.38 /0.00 B:68.27 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:48,736 - Recv: T:223.13 /0.00 B:68.08 /50.00 @:0 B@:0 W:?
2018-10-25 09:44:48,739 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2018-10-25 09:44:48,750 - Send: N6664 M105*21
2018-10-25 09:44:49,502 - Recv: T:223.06 /0.00 B:68.08 /50.00 @:0 B@:0
2018-10-25 09:44:49,739 - Recv: T:222.25 /0.00 B:68.02 /50.00 @:0 B@:0 W:?
This eventually leads to some other problems -- when communications are reset and sequence numbers are out-of-sync.
Full unfiltered serial.log and octoprint.log are attached.
serial.log
octoprint.log
So, there's a problem with solving this.
The thing is that we have a slighly off-by-one issue here with the communication - for some reason the ok that immediately follows the M190 still belongs to the command before that. Looks like this communication timeout very early in the print is to blame:
2018-10-24 17:20:00,536 - Send: N149 G1 X35.000 Y37.500 F702*113
2018-10-24 17:20:00,545 - Recv: ok N149 P5 B3
2018-10-24 17:20:00,547 - Send: N150 G92 E0*67
2018-10-24 17:20:01,661 - Recv: T:240.63 /240.00 B:70.10 /70.00 @:38 B@:2
2018-10-24 17:20:03,663 - Recv: T:240.10 /240.00 B:69.96 /70.00 @:52 B@:42
2018-10-24 17:20:05,662 - Recv: T:240.52 /240.00 B:70.00 /70.00 @:42 B@:31
2018-10-24 17:20:07,661 - Recv: T:240.83 /240.00 B:70.02 /70.00 @:35 B@:25
2018-10-24 17:20:09,664 - Recv: T:240.42 /240.00 B:69.97 /70.00 @:43 B@:39
2018-10-24 17:20:11,662 - Recv: T:240.31 /240.00 B:70.05 /70.00 @:43 B@:16
2018-10-24 17:20:13,662 - Recv: T:239.84 /240.00 B:70.00 /70.00 @:51 B@:31
2018-10-24 17:20:15,661 - Recv: T:239.45 /240.00 B:70.01 /70.00 @:58 B@:28
2018-10-24 17:20:17,664 - Recv: T:239.77 /240.00 B:70.06 /70.00 @:50 B@:13
2018-10-24 17:20:19,662 - Recv: T:240.10 /240.00 B:69.99 /70.00 @:42 B@:32
2018-10-24 17:20:21,661 - Recv: T:239.45 /240.00 B:70.00 /70.00 @:57 B@:30
2018-10-24 17:20:23,665 - Recv: T:239.92 /240.00 B:70.02 /70.00 @:47 B@:24
2018-10-24 17:20:25,663 - Recv: T:240.00 /240.00 B:69.97 /70.00 @:46 B@:39
2018-10-24 17:20:25,670 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2018-10-24 17:20:25,679 - Send: N151 M105*34
2018-10-24 17:20:27,663 - Recv: T:239.92 /240.00 B:70.04 /70.00 @:48 B@:18
2018-10-24 17:20:29,612 - Recv: X:35.00 Y:37.50 Z:0.30 E:0.00 Count X:3518 Y:3769 Z:771
2018-10-24 17:20:29,619 - Recv: ok N150 P15 B2
2018-10-24 17:20:29,628 - Recv: ok T:239.92 /240.00 B:69.96 /70.00 @:48 B@:40
The G92 there blocked and took a long time to complete for some reason.
Since not all firmwares yet support attaching line numbers to the ok (and they also aren't there in temperature responses to an M105) OctoPrint can't rely on that to associate command with output/acknowledgement (which also has an issue with some firmwares out there that still acknowledge in a different order than the commands were sent - fun fun fun) and just has to hope that send-receive works. So it thinks that ok there right after belongs to the M190 and considers the blocking cool down to be over. It sends the next command, but now the block actually starts. The printer sends temperature reports for both the blocking cooldown and also for the temperature auto reporting. The only difference between the two is the as of now undocumented W parameter.
While this is going on OctoPrint is waiting for an ok. That doesn't arrive and OctoPrint runs into a communication timeout. Why? Because if a command or an ok had actually been lost, unless OctoPrint also has a specific timeout between "command send - ok received" and just looks for traffic on the serial line, things could get stuck but never be detected thanks to autoreporting causing regular traffic while the printer desperately waits for input. Hence the timeout here, even though there's data coming from the printer - it's not an ok and OctoPrint doesn't know it's in a blocking cooldown as the ok it received right after sending that command made it think that one is over.
This is a really tricky issue to solve. If I just consider any kind of traffic fair game to prevent communication timeouts, prints on printers with temperature auto reporting will get stuck. I can't rely on the line numbers being there (and the current architecture inside the comm layer also wouldn't allow tracking which line was just acknowledged). I can't rely on the ok order matching the sent command order.
I could monitor the target temperature before a blocking heatup/cooldown and consider that in progress until I receive a temperature report that looks like things are there now, instead of considering a heatup over once an ok is received. But I have to think very hard on what possible implications that might have in other cases than the one we are currently looking at.
@foosel - Thanks for the analysis.
I understand what you are saying about prints getting "stuck" if auto-temp (alone) is considered communication. There obviously has to be some point where an unacknowledged command is considered an "error", even if temp autoreports are still being received.
The G92 there blocked and took a long time to complete for some reason.
What you note about the G92 is interesting. The firmware does seem to be holding acknowledgement for G92 until it is actually complete. Just prior to that G92, a series of first-layer extrudes was queued (extra slow because ninjaflex), so the 29 seconds until the G92 acknowledgement was just the time it took for all those queued G1's to complete. The G1's were immediately acknowledged, but it appears the G92 isn't being acknowledged until it is actually processed (i.e., meaning G92 acknowledgement waits for all those G1's to physically complete movement).
I am positive this communications timeout did not used to occur with this printer, but I have upgraded both firmware and Octoprint (1.3.8 to 1.3.9) since that time, so could be either or combination of both. There were substantial changes to G92 in the Marlin/Lulzbot 1.1.8, so that may be where that behavior started.
During that first timeout (the wait for G92 acknowledgement), Octoprint sent "N151 M105*34" and printing eventually proceeded without any problems as that M105 was buffered and eventually acknowledged. During the longer wait (during cooldown), something different happened. Because there were so many timeouts and M105 "wakeups" sent during the lengthy cooldown, some commands were lost (assume firmware RX buffer was full), therefore never acknowledged, and that led to sequence number problems. See timestamp 2018-10-24 20:36:36 in that same serial.log.
For now, I have increased my comm timeout value to 60s and increased "max consecutive timeouts while printing" value. That 60s should cover the "G92" condition (waiting for queued moves to complete), and hopefully lower the number of "M105" wakeups being sent during the long cooldown. But that's not really a good solution, so I hope a better method can be engineered.
@foosel -- Minor update to this... I compile my own firmware using Lulzbot's Marlin 1.1.8.4 as a starting point, making minor changes for some bed modifications and personal preferences. So I opened that up, enabled HOST_KEEPALIVE_FEATURE, compiled and flashed.
With that feature enabled, the firmware issues "busy: processing" if it hasn't otherwise sent an OK within the last 2 seconds. As expected, that satisfies Octoprint that everything is fine, and eliminates the "communication timeout" errors discussed above.,
It appears Lulzbot intentionally disabled HOST_KEEPALIVE_FEATURE in their releases. No idea why, but my guess would be perhaps it caused problems with serial comms in "Cura Lulzbot Edition".
I haven't followed the Lulzbot Marlin development since February, but just checked and the latest development version (1.1.9.26) still disables HOST_KEEPALIVE_FEATURE. I will attempt to communicate with their developer and ask if it can be enabled in a future version -- and/or get an explanation as to why it is disabled.
Edit: Question submitted to Lulzbot (https://code.alephobjects.com/T4356)
Update - Lulzbot Marlin development build 1.1.9.27 has been updated to include HOST_KEEPALIVE_FEATURE.
See rMARLIN31127616eb7f: Enabled HOST_KEEPALIVE_FEATURE (T4356)
Another update...
Lulzbot has finished their testing of Marlin's Host_Keepalive_Feature in conjunction with their formally supported print controller (Cura Lulzbot Edition) and found no issues. They have stated that Host_Keepalive_Feature will be included in their next public firmware release.
So at least for Lulzbot owners (like the original poster and myself), a good solution to these timeouts will be to pick up a Lulzbot firmware with version 1.1.9.27 or later.
@ScottWell1 thank you for the update!
I think the recent change in the form of 2263ebde72a864792f0c9fb0cb10f0e731438e64 might have actually cleared up the core reason for the initial getting-out-of-sync here. Tentatively marking this as closed in the next release.
1.4.0 has been released.
Most helpful comment
Another update...
Lulzbot has finished their testing of Marlin's Host_Keepalive_Feature in conjunction with their formally supported print controller (Cura Lulzbot Edition) and found no issues. They have stated that Host_Keepalive_Feature will be included in their next public firmware release.
So at least for Lulzbot owners (like the original poster and myself), a good solution to these timeouts will be to pick up a Lulzbot firmware with version 1.1.9.27 or later.