Prusa-firmware: [BUG]Very long pauses "Full RX Buffer & invalid T code" (30 secs) for gcode commands containing the letter T

Created on 26 Dec 2019  Â·  9Comments  Â·  Source: prusa3d/Prusa-Firmware

MK3S
3.8.1

MMU2S
1.0.6

Describe the bug
Short summary: Late in an octoprint driven print of a multicolor model (the color salamander from the prusa site) the print begins to pause for many seconds at a time (30ish seconds). This behavior doesn't start until a few hours into the print (hour 5 of a 7 hr print). Once the pauses start the occur for about 1 in 5 tool changes.

More details:

At first I thought it might be #1180, but then I investigated further. During this investigation I think I found a clue for the root cause and a workaround.

Just before the error occurs I also saw the message "Invalid T code" in the serial stream from the printer. Which made me think about the GCODE being sent down from octoprint to the printer. My printer is in an enclosure and I use a plugin to control the LED lighting via "OCTOx" commands. These are gcode only understood by octoprint but they are also getting sent to the printer. These OCTO6 or OCTO2 or whatever commands are on their own line and the printer should ignore them because they are gcode not understood by prusa. In my case that accidentally revealed this bug, I was using them to change the LED lighting during a tool change (octoprint has a settings box to let you add gcode when tool changes occur and I was having it add OCTO4 to do something with my LEDs). However, I think the printer is incorrectly trying to parse these commands as a "TO6" command and generating the error message that is printed here:

https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/Marlin_main.cpp#L7813

I also think that (though I haven't looked through the marlin code in depth) because we entered the block for handling T codes:
https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/Marlin_main.cpp#L7805-L7986
but because we only printed that (somewhat incorrect) error message we will not run the standard error handling code in the catch all else clause here:
https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/Marlin_main.cpp#L8204-L8209

After looking at the code, I tried an experiment of not adding these OCTO commands and the print proceeded perfectly, with no error messages and no pause events.

I think some layer higher up either in Marlin or Octoprint is getting confused (in some TBD fashion - I have not investigated) when it sent down this command didn't receive either an okay or the standard error response. The error messages for "Invalid T code" appears since the beginning of the print, but it isn't until hour 5 when they also change to often cause "Full RX Buffer" messages and they very long pauses. Do ya'll know of a plausible reason for this? (I've never read the docs on gcode or even really looked at the marlin code for anything other than quick fixes)

I think the correct fix for this bug is to change the code_seen() function to require either some form of whitespace immediately before the character we are searching for or it being the first character in CMDBUFFER_CURRENT_STRING. See current implementation here.
https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/cmdqueue.h#L76

It seems to me the current implementation will misparse commands that happen to contain but do not start with the letter T (for instance). true? If I fix and test this, does this project take PRs? If you do I'll send one in.

To Reproduce
Run a multihour print that includes OCTOx commands. But to see the (IMO necessary precondition) bug on command parsing paste an OCTO7 command into the gcode serial stream (no need to use octoprint).

You should see something like this:

Send: OCTO6
Recv: Invalid T code.
Recv: ok

But if instead you send an unrecognized command that doesn't contain the letter T you see

Send: FISH
Recv: echo:Unknown command: "FISH"(2)
Recv: ok

Something in octoprint seems to be confused when it doesn't get back that echo. I bet because the MK3 should have instead sent "echo: Invalid T code. (2)" (even if bug in code_seen() is fixed). I'll note that the reprap site doesn't list "invalid t code" as one of the legal gcode error responses: https://reprap.org/wiki/G-code#Replies_from_the_RepRap_machine_to_the_host_computer

Expected behavior
Prints should not pause for many seconds at a time because the octoprint rx machine is confused. I presume they based their parser on the reprap documentation of gcode error responses, so changing the prusa "invalid t code" to one of the standard error forms would probably be a good fix. Another good fix is to not misparse commands that happen to contain the letter T. I can send in two PRs for this change (and if you have unit tests in source control I can update them also). Do you want them?

G-code
Currently my test case occurs with commands added to the gcode stream on the fly by octoprint (by using their 'on tool change' script). But if it would help, let me know and I'll hand code a test that shows it. I thought I'd ask for initial feedback on the comments above first, because it will take a bit of time to craft a test and then I'll need to run it for a 7ish hr print. ;-)

Video
If ya'll want one let me know and I can work on a more optimized test case.

*octoprint log when the error occurs *

typical octoprint log when the problem occurs:

2019-12-25 04:18:27,730 - octoprint.util.comm - INFO - Ignoring resend request for line 355063, that still originates from lines we sent before we got the first resend request
2019-12-25 04:22:52,717 - octoprint.util.comm - INFO - Got a resend request from the printer: requested line = 357010, current line = 357015
| Last lines in terminal:
| Send: N357006 G1 X122.636 Y110.451 E-0.10125*125
| Recv: ok
| Recv: ok
| Send: N357007 G1 X122.964 Y110.503 E-0.07657*118
| Send: N357008 G1 X123.073 Y110.544 E-0.02695*127
| Recv: ok
| Send: N357009 G1 X123.292 Y110.678 E-0.05909*114
| Recv: ok
| Send: N357010 G1 X123.524 Y110.910 E-0.07590*127
| Recv: ok
| Send: N357011 G1 X124.624 Y109.162 E-0.47666*113
| Recv: ok
| Send: N357012 G1 X124.430 Y109.180 E-0.04494*113
| Recv: ok
| Recv: Full RX Buffer
| Send: N357013 G1 E-0.07000 F2100.00000*1
| Recv: ok
| Send: N357014 G1 X122.080 Y111.826 F10800.000*88
| Recv: Error:Line Number is not Last Line Number+1, Last Line: 357009
| Recv: Resend: 357010
2019-12-25 04:22:52,736 - octoprint.util.comm - INFO - Ignoring resend request for line 357010, that still originates from lines we sent before we got the first resend request
2019-12-25 04:22:52,747 - octoprint.util.comm - INFO - Got a resend request from the printer: requested line = 357010, current line = 357015
| Last lines in terminal:
| Send: N357011 G1 X124.624 Y109.162 E-0.47666*113
| Recv: ok
| Send: N357012 G1 X124.430 Y109.180 E-0.04494*113
| Recv: ok
| Recv: Full RX Buffer
| Send: N357013 G1 E-0.07000 F2100.00000*1
| Recv: ok
| Send: N357014 G1 X122.080 Y111.826 F10800.000*88
| Recv: Error:Line Number is not Last Line Number+1, Last Line: 357009
| Recv: Resend: 357010
| Recv: ok
| Send: N357010 G1 X123.524 Y110.910 E-0.07590*127
| Recv: ok
| Send: N357011 G1 X124.624 Y109.162 E-0.47666*113
| Recv: Error:Line Number is not Last Line Number+1, Last Line: 357009
| Recv: Resend: 357010
| Recv: ok
| Recv: Error:No Line Number with checksum, Last Line: 357009
| Send: N357012 G1 X124.430 Y109.180 E-0.04494*113
| Recv: Resend: 357010
bug

All 9 comments

@geeksville Hi. Thanks for filing this report. I'll have to do some more testing after I get back from holiday, but I think you're close to figuring out the cause. I need to figure out one more thing: from your report it seems that octoprint is NOT expecting an ok after the OCTOx command. I'm saying this since Full RX buffer triggers when octoprint sends more commands in advance than the printer can handle and usually happens when ok is sent incorrectly (it happened before with temperatre polling) and octoprint sends the next command when it shouldn't. The serial buffer can handle more commands in advance, but at some point the buffer fills and the error is triggered.

Also, are the OCTOx commands included in the gcode file or are inserted on the fly by the plugin? They might not be handled correctly by octoprint and it might not expect the printer to reply to these commands.

As a workaround we might also try checking for such commands before the else if(code_seen('T')) with something like:

else if(code_seen("OCTO"))
{
  KEEPALIVE_STATE(NOT_BUSY);
  return;
}

No error mesage needs to be returned, no ok is sent back. Please answer to the questions mentioned above so I can determine the cause and also send me a log from octoprint from the serial terminal on how it handles OCTOx while printing (~10 commands before and after OCTOx)

@leptun thank you for your note.

re: serial logs
I totally agree - I'll get a log of the failing case before making any changes. I'll attach it to this bug. I'll also try to make a smaller test case than a 7 hr print - I bet if I pump in a lot of extra OCTOx cmds for each tool change it will happen much faster)

re: workaround for checking for OCTO cmds
I'm not sure that is really needed (though a fine idea) - I bet octoprint is just getting confused by the response being neither ok nor echo, but rather a "invalid t code" string.

In the next couple of days I'll send you two PRs:

  • One to change the "invalid t code" response to conform to the reprap gcode response syntax (i.e. send "echo:Invalid T code(2)" like the other errors)

  • make a field_seen() to require a match to either be at the start of CMDBUFFER_CURRENT_STRING or be immediately preceded by a whitespace character. (i.e. matching the gcode definition for the beginning of a field: https://reprap.org/wiki/G-code#Fields )

It would be used for the toplevel calls (currently calling code_seen) in process_commands(), but not for the non command usages of code_seen. Because those usages are looking for 'arguments' of a command - i.e. M681? in this example: https://github.com/prusa3d/Prusa-Firmware/blob/MK3/Firmware/Marlin_main.cpp#L7346

This seems to be good hygiene because it will prevent future bugs (i.e. running some other gcode because someone sends SUM or AGO) and will let the proper "unknown command" error handling always fire the same way - not be missed just because a bogus command happened to contain a letter used by a real command.

Cool?

As for the strings sent by the octoprint enclosure plugin. I think a more elegant fix is to have that plugin not send these commands to the device at all (after it has consumed/used them for its purpose). Rather than ya'll have to check for an ignore "OCTOx" commands.

The octoprint API has an atcommand hook which would service this nicely. In the next couple of days I'll send that plugin a PR to handle atcommands. This small change would be backwards compatible with all existing usage and would also not be sending the commands to the printer at all. I'll reference this bug in my PR for that project so you can track it.

Also seem reasonable to you?

@geeksville Hmmmm. I'm not sure that the field_seen() fix would work in practice. Anyways, let's first determine what is actually causing the confusion of the cmdqueue/octoprint in the first place. Let's not assume anything before we have the logs at hand as the conclusion might be entirely different.
As for filtering the OCTOx command, that should be a last resort since doing so for every single command every plugin implements is impossible.
If it's determined that the plugin mishandles the sent gcode, then filing a PR to that plugin seems reasonable to me.

Btw, since you seem to play around with the firmware and you have the MMU2, have you ever tried the cutter functionality in the firmware?

agree on all counts (especially that adding a filter to marlin for OCTOx
should definitely be avoided). I'll look into these things tomorrow.

re: mmu cutter
nope. I only got my MMU2 a couple of weeks ago. I purchased during the
black friday sale because it sounded like most of the early problems have
been ironed out. I assembled it per plans except for a few mods that the
reddit crowd said help a lot (mainly 2.5mm ID PTFE tubes) and I'm
basically happy with it and it seems to work okay. I had read that the
cutter wasn't being used so for the time being I didn't install the blade.
Does it approximately work? If so I wouldn't mind turning it on and trying
it. What is the expected behavior?

On Thu, Dec 26, 2019 at 9:45 AM leptun notifications@github.com wrote:

@geeksville https://github.com/geeksville Hmmmm. I'm not sure that the
field_seen() fix would work in practice. Anyways, let's first determine
what is actually causing the confusion of the cmdqueue/octoprint in the
first place. Let's not assume anything before we have the logs at hand as
the conclusion might be entirely different.
As for filtering the OCTOx command, that should be a last resort since
doing so for every single command every plugin implements is impossible.
If it's determined that the plugin mishandles the sent gcode, then filing
a PR to that plugin seems reasonable to me.

Btw, since you seem to play around with the firmware and you have the
MMU2, have you ever tried the cutter functionality in the firmware?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prusa3d/Prusa-Firmware/issues/2388?email_source=notifications&email_token=AABXB2JBN5VMXFHZJZPJRA3Q2TUTJA5CNFSM4J7G5DNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHV4SIA#issuecomment-569100576,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABXB2IXAG3EKNUT4FO2VCLQ2TUTJANCNFSM4J7G5DNA
.

The cutter cuts the tip of the filament on the last filament load try before giving up and giving an error. It worked ok for me with pla. Petg might be more tricky. You can also cut manually in the main menu.

btw - I just tried a quick test with more OCTOx commands added and yep it makes the problem much more reproducible and the serial log is super interesting. I can now cause the (incorrect) "Full RX Buffer" error in cmdqueue.cpp at will - I'll write more fully in the next couple of days (and attach annotated logs and proposed fix).

The command parser does need a revise. I met a lot of problems while I was trying to implement some custom changes to the M600 command.

In summary, the code_sheen functions are not working as they supposed. If I send the M600 G1 X100, the printer will execute the G1 X100 part ignoring the filament change request. The solution to check for white-space before the command would not work either. Since the G1 is not at the beginning of the statement, it should not be executed.

I've pushed the https://github.com/prusa3d/Prusa-Firmware/pull/2380, which already contains a proposal for using double quotes to enclose strings. Without it, if I executed a M600 C PRUSA PLA Lila, nothing was executed because of the PRUSA keyword. Now, I can work with M600 C "PRUSA PLA Lila".

BTW, is there a page/doc about the command syntax of the PRUSA firmware?

@leptun

I tried three cases (all data files and serial logs in the attached zip).

  1. two-color-no-fail-without-OCTOcmds - This test confirmed that if I didn't add OCTOx commands into the gcode, I never received "Full RX Buffer" messages.

  2. two-color-fail - This test confirmed that if I did add many OCTOx commands (10) before each filament change command, the Fail RX Buffer message was far more easily reproduced than in my original 7 hour print test case.

  3. color-test-octofish. Since I wanted to take as much of 'octoprint' variability out of the problem as possible, in this case I just used PrusaSlicer to add FISHTOx commands to the gcode directly. I chose to call this FISHTO because the T would trigger the same parsing bug, but the rest would make it clear that it wasn't the octoprint cmd. This case failed in the same way as the version above, but it is more usable for others (you?) who want to debug.

A typical failure in color-test-octofish (see comments in square brackets):

...
2019-12-27 22:40:39,587 - Recv: NORMAL MODE: Percent done: 3; print time remaining in mins: 110
2019-12-27 22:40:39,591 - Recv: SILENT MODE: Percent done: 3; print time remaining in mins: 111
2019-12-27 22:40:39,593 - Recv: ok
2019-12-27 22:40:39,597 - Send: N579 G1 E-35.0000 F2000*39
2019-12-27 22:40:40,059 - Recv: ok
2019-12-27 22:40:40,066 - Send: N580 G1  Y110.286 F2400*62
2019-12-27 22:40:40,227 - Recv: ok
2019-12-27 22:40:40,244 - Send: N581 G4 S0*98
2019-12-27 22:40:40,267 - Recv: ok
2019-12-27 22:40:40,275 - Send: FISHTO1
2019-12-27 22:40:40,281 - Send: FISHTO1
2019-12-27 22:40:40,287 - Send: FISHTO1
2019-12-27 22:40:40,294 - Send: FISHTO1
2019-12-27 22:40:40,301 - Send: FISHTO1
2019-12-27 22:40:40,306 - Send: FISHTO1
2019-12-27 22:40:40,314 - Send: FISHTO1
2019-12-27 22:40:40,320 - Send: FISHTO1
2019-12-27 22:40:40,329 - Send: FISHTO1
2019-12-27 22:40:40,335 - Send: FISHTO4
2019-12-27 22:40:40,345 - Send: N582 T3*54
2019-12-27 22:40:42,419 - Recv: echo:busy: processing
2019-12-27 22:40:44,483 - Recv: echo:busy: processing
2019-12-27 22:40:46,547 - Recv: echo:busy: processing
2019-12-27 22:40:48,611 - Recv: echo:busy: processing
2019-12-27 22:40:50,676 - Recv: echo:busy: processing
2019-12-27 22:40:52,740 - Recv: echo:busy: processing
2019-12-27 22:40:54,806 - Recv: echo:busy: processing
2019-12-27 22:40:56,869 - Recv: echo:busy: processing
2019-12-27 22:40:58,934 - Recv: echo:busy: processing
2019-12-27 22:41:01,015 - Recv: echo:busy: processing
2019-12-27 22:41:03,079 - Recv: echo:busy: processing
2019-12-27 22:41:05,144 - Recv: echo:busy: processing
2019-12-27 22:41:07,207 - Recv: echo:busy: processing
2019-12-27 22:41:09,273 - Recv: echo:busy: processing
2019-12-27 22:41:11,337 - Recv: echo:busy: processing
2019-12-27 22:41:13,401 - Recv: echo:busy: processing
2019-12-27 22:41:14,950 - Recv: ok
2019-12-27 22:41:14,964 - Recv: Invalid T code.
2019-12-27 22:41:14,971 - Send: N583 M105*41
2019-12-27 22:41:14,973 - Recv: ok
2019-12-27 22:41:14,982 - Send: N584 G1 X58.749992 Y11.500001*33
2019-12-27 22:41:14,984 - Recv: Invalid T code.
2019-12-27 22:41:14,988 - Recv: ok
2019-12-27 22:41:14,996 - Send: N585 M900 K60*79
2019-12-27 22:41:15,000 - Recv: Invalid T code.
2019-12-27 22:41:15,002 - Recv: ok
2019-12-27 22:41:15,009 - Recv: Invalid T code.
2019-12-27 22:41:15,025 - Recv: ok
2019-12-27 22:41:15,033 - Send: N586 G4 S0*101 [ this line gets lost and needs resend, see below ]
2019-12-27 22:41:15,036 - Recv: Invalid T code.
2019-12-27 22:41:15,039 - Send: N587 G1 E12.0000 F1140*8
2019-12-27 22:41:15,043 - Recv: ok
2019-12-27 22:41:15,051 - Recv: Invalid T code.
2019-12-27 22:41:15,053 - Send: N588 G1  X138.879 E42.0000 F1162*120
2019-12-27 22:41:15,056 - Recv: ok
2019-12-27 22:41:15,066 - Send: N589 G1  X197.004 E6.0000 F814*119
2019-12-27 22:41:15,069 - Recv: Invalid T code.
2019-12-27 22:41:15,072 - Recv: ok
2019-12-27 22:41:15,080 - Send: N590 M73 Q3 S110*26
2019-12-27 22:41:15,081 - Recv: Invalid T code.
2019-12-27 22:41:15,085 - Recv: ok
2019-12-27 22:41:15,092 - Recv: Invalid T code.
2019-12-27 22:41:15,094 - Send: N591 M73 P3 R109*19
2019-12-27 22:41:15,096 - Recv: ok
2019-12-27 22:41:15,103 - Send: N592 G1  Y110.786*88
2019-12-27 22:41:15,103 - Recv: Invalid T code.
2019-12-27 22:41:15,106 - Recv: ok
2019-12-27 22:41:15,110 - Send: N593 G1  X197.504 Y110.286*4
2019-12-27 22:41:15,111 - Recv: mmu_get_response - begin move: T-code [ see theory below ]
2019-12-27 22:41:15,113 - Recv: MMU <= 'T3'
2019-12-27 22:41:15,116 - Recv: Unloading finished 2
2019-12-27 22:41:17,060 - Recv: echo:busy: processing
2019-12-27 22:41:17,133 - Recv: mmu_get_response - begin move: load
2019-12-27 22:41:19,124 - Recv: echo:busy: processing
2019-12-27 22:41:21,188 - Recv: echo:busy: processing
2019-12-27 22:41:23,257 - Recv: echo:busy: processing
2019-12-27 22:41:25,341 - Recv: echo:busy: processing
2019-12-27 22:41:27,426 - Recv: echo:busy: processing
2019-12-27 22:41:29,491 - Recv: echo:busy: processing
2019-12-27 22:41:31,560 - Recv: echo:busy: processing
2019-12-27 22:41:33,624 - Recv: echo:busy: processing
2019-12-27 22:41:35,688 - Recv: echo:busy: processing
2019-12-27 22:41:36,532 - Recv: MMU <= 'A'
2019-12-27 22:41:37,118 - Recv: MMU => 'ok'
2019-12-27 22:41:37,120 - Recv: mmu_get_response() returning: 1
2019-12-27 22:41:37,839 - Recv: echo:busy: processing
2019-12-27 22:41:39,904 - Recv: echo:busy: processing
2019-12-27 22:41:41,968 - Recv: echo:busy: processing
2019-12-27 22:41:43,053 - Recv: MMU can_load:
2019-12-27 22:41:43,676 - Recv: OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO succeeded.
2019-12-27 22:41:43,679 - Recv: ok
2019-12-27 22:41:43,690 - Send: N594 M105*47
2019-12-27 22:41:43,693 - Recv: Full RX Buffer [ This marks that we probably lost some inbound chars (and also the first time we've called get_command in a while?) ]
2019-12-27 22:41:43,697 - Recv: ok T:234.6 /235.0 B:90.2 /90.0 T0:234.6 /235.0 @:40 B@:93 P:47.3 A:36.9
2019-12-27 22:41:43,718 - Send: N595 G1  X138.504 E2.2424 F1600*68
2019-12-27 22:41:43,722 - Recv: ok
2019-12-27 22:41:43,731 - Send: N596 G1  X139.254 Y109.786  E0.0343*85
2019-12-27 22:41:43,734 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 585 [ last good line was 585 ]
2019-12-27 22:41:43,737 - Recv: Resend: 586
2019-12-27 22:41:43,748 - Recv: ok
2019-12-27 22:41:43,752 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 585
2019-12-27 22:41:43,755 - Recv: Resend: 586
2019-12-27 22:41:43,756 - Send: N586 G4 S0*101
2019-12-27 22:41:43,764 - Recv: ok
2019-12-27 22:41:43,771 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 585
2019-12-27 22:41:43,772 - Send: N587 G1 E12.0000 F1140*8
2019-12-27 22:41:43,775 - Recv: Resend: 586
2019-12-27 22:41:43,783 - Recv: ok
2019-12-27 22:41:43,788 - Send: N588 G1  X138.879 E42.0000 F1162*120
2019-12-27 22:41:45,868 - Recv: echo:busy: processing
2019-12-27 22:41:47,932 - Recv: echo:busy: processing
2019-12-27 22:41:47,993 - Recv: echo:Advance K=60.00

Theory 1: serial printing from Marlin can stall serial reads long enough for the 128 byte rx FIFO to fill

After I had these logs, I looked at the Marlin code a bit...

Theory: printing "invalid t code" (extra messages printed to host) + the MMU debug messages over the marked 20ms was enough to fill the AVR UART TX queue to the serial->usb converter and stall the main thread when it tried to print more characters.

Marlin seems to be using only the standard 2 bytes of TX fifo built into the AVR (true? - this is just based on me looking at uart2.c) so depending on the implementation of the code in the Atmega32u2 serial->usb code possibly that could happen?

If the main thread stalls waiting to TX a character, possibly the 128 byte RX_BUFFER_SIZE could fill up? From looking at the log the number of characters octoprint has sent since line N582 seems about like 128 (though I haven't counted carefully).

How big is the internal software ring buffer in the Atmega32u2 code from Prusa? I presume the implementation in that micro is guaranteed to almost immediately consume a character when the ATmega provides one (and I assume they are using hardware flow control)? If that software fifo is short, presumably it could pushback on the hw flow control from the atmega and cause a stall until the USB client (octoprint running on a rpi3) has consumed the bytes.

Proposed next steps:
1) remove the debug messages for "invalid t code" and confirm the rx buffer overflow (which seems a necessary precondition to whatever is causing the many second stall in octoprint) goes away. I bet it it will.
2) change the mmu_get_response debug message to be much longer (100 chars?) to stress UART TX in a similar way, if this theory is correct such a change will also cause the "Full RX Buffer" error to occur (without even sending FISHTOx commands). And it means that uart printing can cause a stall long enough for 128 bytes of RX chars to get buffered in the RX fifo (bad).

Based on the outcome of step 2, it seems like the following steps should be considered:

3) Look at/test the implementation of the prusa Atmega32u2 serial->USB bridge. It seems to me (as a marlin novice) that if that implementation is up-to-snuff it should be able to buffer _many_ characters of TX data from the atmega and it should be able to pull out a TX character as soon as it is available. But if it is not up to snuff it could cause this problem. (but OMG - so many deployed devices out in the wild with no way to update that firmware).
4) Punt and turn off the debug strings sent for mmu_get_response and can_load to hide the problem.

Btw: None of this explains why Octoprint eventually starts stalling for a very long time when it sees the Resend messages, and yes octoprint should be able to cope with an infinite number of these faults without that happening - but such is life ;-).

Theory 2: MMU debug string printing corrupts parsing of ek/echo/resend responses

Another possible theory: It is worth noting that if you look at the full log you'll see lines like this:

2019-12-27 22:41:43,053 - Recv: MMU can_load:
2019-12-27 22:41:43,676 - Recv: OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO succeeded.
2019-12-27 22:41:43,679 - Recv: ok

Or even lines where the debug Os are split with "ok" strings sent in the middle. Which makes me think there something is a bit non-atomic wrt how these debug messages inserted into the serial output stream. Which might pose risks of breaking serial parsing on the octoprint side when it is looking for "ok", "echo", "error" and "resend". Possibly if you are unlucky enough (say in my 7 hr print case), eventually a 'resend' message gets slightly stomped on and therefore never responded to by octoprint.

Other theories are welcome - this is simply the only things I could come up with without being able to run the code under a debugger.

Note: On my previous printer (a hand rolled thing with a RAMPS board) I used to build and install marlin to debug/fix issues with serial or jtagish debugging. And normally I'd be happy to offer to do the same here, alas - I only have one printer now (my mk3) and I kinda need to use it for the next week to print some parts. So I'm a bit loathe to start putting in debug strings or hooking up atmel ICE stuff and taking my printer offline.

An aside

btw - Three unrelated possible problems I found while investigating this. I'll open bugs if/when I'm able to reproduce with a clean test case: < mentioning only in case someone else has reported something similar, no need to reply >

  1. While commenting out my OCTOx commands in the octoprint settings screen I prefixed the line with a semicolon and added some text about why it was commented out. When I next tried to print, at every attempt the printer would reboot when it reached that line. Alas, I hadn't yet turned on serial logging and since I was looking for other things I figured "just delete that commented line and go look at it later". Well - when I went to go look at it later it didn't reboot, so possibly something about the length of the comment or its contents was relevant. However, I took a quick look at cmdqueue.cpp and everything looked fine to me.

  2. Octoprint seems somewhat fickle on which lines it opts to add Nxxx prefixes to. Some lines (such as the FISHTO1 lines in my sample gcode that shows the bug) it does not prefix, but most lines it does. This doesn't seem to cause a problem but it is odd.

  3. In order to get cancelled print jobs to properly unload the filament, I had added this code (from prusaslicer) to the octoprint "extra gcode to run after cancelling":

G1 X0 Y210 F7200
G1 E2 F5000
G1 E2 F5500
G1 E2 F6000
G1 E-15.0000 F5800
G1 E-20.0000 F5500
G1 E10.0000 F3000
G1 E-10.0000 F3100
G1 E10.0000 F3150
G1 E-10.0000 F3250
G1 E10.0000 F3300
M702 C
G4 ; wait

Which had the funny effect of _also_ attempting an unload of filament 1 (which wasn't even installed in my printer - I had been using filament 2 & 4 for the print).
bug.zip

@geeksville Now that I see the log with many FISHTO1 commands, the issue is very clear. Octoprint is not parsing these commands as regular gcode for some odd reason. It sends them without queuing and as a result without expecting a response from the printer (very important). Because the printer replies with "ok" messages for FISHTO1, octoprint assumes that it can send another queued command (another correct gcode command).

When the FISHTO1 command gets processed by the printer/octoprint, the cmdqueue becomes somewhat out of sync with octoprint's queue (fast forwarded) and will stay like that until a "Resend" request re-synchronizes everything. One FISHTO1 command is not really enough to trigger the issue, but when more of them add up over time, the out-of-syncness of the queues will eventually overflow the cmdqueue and force a resend.

Another apparent thing from the FISHTO1 log is that FISHTO1 is treated as if it wasn't gcode (which is somewhat true) and gets no N+checksum. If there were even more FISHTO1 commands you might have been able to cause the "Full RX buffer" by just spamming these in the toolchange gcode.

JTAG debugging is probably impossible since the required pins are used.

The OOOOokOOO... should be ok I think. I had resend issues at some point even without the MMU and it was because at some point between fw releases M105 was sending back two "ok" and was causing the same behavior as with OCTOx before toolchange (but was far more frequent because temperature polling is sprinkled throughout the print, not only specific parts of it).

Sorry for having to stop here, I'll come back to this issue after the new Year. Happy holidays!

Was this page helpful?
0 / 5 - 0 ratings