Marlin: MKS sbase pc printing broken

Created on 1 Jul 2018 · 79Comments · Source: MarlinFirmware/Marlin

Description

On recent platformio the pc communication and printing broken again, the print suddenly stops. Last lines from communication log:

19:58:36.808 : Printing layer 23 of 94
19:58:36.813 : X:20.61 Y:12.44 Z:4.60 E:0.00 Count X:6771 Y:1795 Z:3520
19:58:37.848 : X:13.16 Y:11.14 Z:4.60 E:0.00 Count X:7065 Y:2976 Z:3680
.....
19:58:48.096 : X:16.58 Y:11.17 Z:4.60 E:0.00 Count X:3263 Y:2476 Z:3680
19:58:48.450 : X:16.55 Y:11.57 Z:4.60 E:0.00 Count X:3200 Y:2466 Z:3680ook

It seems to me the text from motion.cpp's report_current_position() somehow mixed with ok response and host was unable to recognize this response.

The host then waits and after configured time out continue to print adding to log "Comminication time out, resetting buffer" leaving huge blob of plastic where pause were

I have copy of firmware build 21 june 2018 and this copy have no such issues

Steps to Reproduce

Build current version of marlin for MKS sbase using current platformio with Atom
Start print normally
After some minutes printing stops

Expected behavior: [What you expect to happen]

Normal completion of print

Actual behavior: [What actually happens]

Print stops, last lines of communication:

19:58:36.808 : Printing layer 23 of 94
19:58:36.813 : X:20.61 Y:12.44 Z:4.60 E:0.00 Count X:6771 Y:1795 Z:3520
19:58:37.848 : X:13.16 Y:11.14 Z:4.60 E:0.00 Count X:7065 Y:2976 Z:3680
.....
19:58:48.096 : X:16.58 Y:11.17 Z:4.60 E:0.00 Count X:3263 Y:2476 Z:3680
19:58:48.450 : X:16.55 Y:11.57 Z:4.60 E:0.00 Count X:3200 Y:2466 Z:3680ook

Source

CageFox

Most helpful comment

It works the same as the sdcard eeprom emulation (or a real EEPROM on a board with one), just without the complication of sharing the storage medium with the host computer.

p3p on 6 Oct 2018

🎉1 👍1

All 79 comments

Configs in attach

Marlin.zip

By the way, why the report_current_position() called so often? It just flooding the host log with unnecessary information. In previous versions there no such flood, host just printed like:

20:29:41.477 : Printing layer 17 of 94
20:30:13.907 : Printing layer 18 of 94
20:30:49.524 : Printing layer 19 of 94
20:31:25.414 : Printing layer 20 of 94
20:32:02.222 : Printing layer 21 of 94
20:32:39.824 : Printing layer 22 of 94

For now I just commented contents of report_current_position() in motion.cpp and started new print. Let's see...

CageFox on 1 Jul 2018

The print with commented report_current_position() in motion.cpp finished without problems

CageFox on 1 Jul 2018

why the report_current_position() called so often?

No idea. It should only be called in response to commands that alter the current position in a manner the host can't see, or in response to the host sending M114 requests.

Try some other host software and see if the problem is present on all of them.

It seems to me the text from motion.cpp's report_current_position() somehow mixed with ok response…
19:58:48.450 : X:16.55 Y:11.57 Z:4.60 E:0.00 Count X:3200 Y:2466 Z:3680ook

If that's happening, it's a regression. There used to be an issue like this affecting all boards, but it was fixed a while ago.

@ejtagle — Could something like this (dropped bytes) occur as a result of recent changes to the serial code?

thinkyhead on 3 Jul 2018

@thinkyhead : This is using MKS, so it is using the LPC1768 serial class, that we never touched ... Maybe the class has some sort of race condition... I don´t have an LPC board to test... Maybe it has a non-atomic setting of the serial head and tail queues...

ejtagle on 3 Jul 2018

👍1

@thinkyhead I tested the modifications I did recently quite thoroughly (the changes were made before the reported working commit, but that isn't spamming position reports) and had no loss of data in the data structures, there could be losses on the host end, in the usb stack, or some interaction I'm not testing ..

p3p on 3 Jul 2018

👍1

there could be losses on the host end

I have two firmware binaries: one I build 21 july and new built. The old one works without problems, and new one spamming position report and periodically stalls. The new one build using same config files.

if I comment function report_current_position() contents new one built works without problems as well.

I have mks and can do any checks and tests by your request

There were platformio update recently, maybe it broke something again?

CageFox on 3 Jul 2018

@CageFox I have no idea why you are getting the report_current_position() spam, I was only referring to the apparent serial stream corruption due to the large amount of data it was creating, I don't understand how the spam could be related to the platform / board you are using and I can't reproduce it on my MKS SBase.

p3p on 3 Jul 2018

Since last staurday my CoreXy with MKS SBASE 1.3 is not work with new firmware release. Try with marlin from last begin of june and the result is the same. It stop after homing with the error.

"Error reported by printer
Your printer's firmware reported an error. Due to that OctoPrint will disconnect. Reported error: Printer halted. kill() called!"

Because i tried some old released that was working very good and now they fail, i'm pointing my suspect to some problem with ATOM that was upgrade last saturday, but i don't know.
After display teh error All led in motherboard are switched ON

maar1201gh on 3 Jul 2018

I confirm @maar1201gh suspections, I rolled back commits to month ago and tryed to build the firmware - and it shows the same errors and problems, thou restored from backup binary firmware works excellent

I think it's the same problem that was in near past when platformio builds became broken

CageFox on 3 Jul 2018

What else can readily compile marlin for 32 bit boards? Atom and platform-ide are awful spyware ridden junk. I just got my build environment going with git pulls of updates and its sad to find out the binaries will be broken after fighting with the above set of tools.

forkoz on 4 Jul 2018

@maar1201gh — It may help to tell OctoPrint not to disconnect on errors. That option can prevent seeing further messages from the firmware or allowing recovery from non-fatal errors.

thinkyhead on 4 Jul 2018

Atom and platform-ide are awful spyware ridden junk

@forkoz I think you might be a being a tad harsh on PlatformIO IDE (Atom), it's not the best editor (I hesitate to call it an IDE) in the world but it is considerably better than Arduino. If you don't want to use one of their supplied editors just use the cli and your preferred editor.

There was a day or so when binaries were being built incorrectly due to a change in the build chain, this was fixed quickly and is no longer an issue, that error caused binaries to be completely unusable and can not cause any other problem.

I don't know of any current issue caused by the tool chain, other than UI annoyances.

p3p on 4 Jul 2018

@thinkyhead-OctoPrint tunned to not disconnect on errors but the same problem.
Error:
State: Offline (Error: SerialException: 'device reports readiness to read but returned no data (device disconnected or multiple access on port?)' @ comm.py:_readline:2417)
Tried with repetier and the same problem

maar1201gh on 4 Jul 2018

Could someone please make a binary from last Marlin 2.0 and send to me the binary? Thanks

Marlin.zip

maar1201gh on 4 Jul 2018

@thinkyhead -- I don't know where my machine have problem and everything seem to be with USB PORT. After trying and trying i check endstops with M119 gcode
I have dual Z ENDSTOP optical switch.
One of them give me open 0.4V, TRIGGERED 4.3V
The other give me open 1V, TRIGGERED 4.3V - M119 always show this endstop is TRIGGERED.
Because everything was working so good with this endstop, i ask you to check ENDSTOP code to verify if something was different from begin of june and now.
Regards

maar1201gh on 5 Jul 2018

@maar1201gh It seems to me this problem have no relations with issue opened by me. May be you open new issue or check your hardware?

CageFox on 5 Jul 2018

👍1

@maar1201gh : One of your optical endstops is obviously broken. ANY VOLTAGE ABOVE 0,7v will read as TRIGGERED. So, check your build, and your hw, because that endstop that is giving 1v when not triggered WILL NEVER WORK

ejtagle on 5 Jul 2018

@ejtable: i changed endstop, tested, m119 show everything ok., i did new compilation but it still continue the same problem and i have no clue. I opened a new issue.

maar1201gh on 7 Jul 2018

@CageFox, could you please compile using my marlin.zip abouve and put here the binary in order to let me test.

maar1201gh on 7 Jul 2018

@maar1201gh No problem, here it is
marlin_firmware.bin.zip

Built with my modifications, and this is not latest commit, it what I using for now

CageFox on 7 Jul 2018

@CageFox, Thank you. #11224 report my issue
I changed #define Z_DUAL_ENDSTOPS_ADJUSTMENT 0.025 and after 2 weeks i finally have my printer live again.

maar1201gh on 10 Jul 2018

Just tried latest commit with atom and platrofmio - nothing changed, position spamming

marlin_position_spam

Screenshot with spam

I noticed the spamming of position happens in g-code like this:
G1 X60.222 Y63.079 E2.6718
G1 X60.538 Y63.326 F5700
G92 E0
G1 X59.751 Y64.283 E0.0206 F4560
G1 X58.920 Y65.200 E0.0412

The whole g-code generated by s3d

I attaching the g-code file
marlin_YetAnotherUtahTeapot-lid.gcode.zip

CageFox on 16 Jul 2018

Removing lines of

G92 E0

from g-code stops position spam. Of course the g-code in whole became unusable, but it seems the promblem lives near the extruder length zero command

CageFox on 16 Jul 2018

G92 has always reported the current position as far as I'm aware, this is a feature not a bug ;) (not that it wont be changed if it is necessary), it's interesting Simplify3D is using it so often for no apparent reason?

p3p on 16 Jul 2018

I dunno, i see no reason G92 should report current position at all. I think the current position should be reported only by codes directly asking or indirectly requires it

The problem is not reporting of position, the problem is sometimes this reports mergind with OK response and as result becoming halt of printing

CageFox on 16 Jul 2018

The problem is not reporting of position, the problem is sometimes this reports mergind with OK response and as result becoming halt of printing

I know, that is what you originally reported and I have been looking into it, if there is a problem in the serial transmit buffer I haven't found it yet, I mentioned that the position report is not a new change because you stated that in the previous build of Marlin you used you didn't get the output.

p3p on 16 Jul 2018

I've seen this merging of codes on the re-arm board as well. mostly in the M503 output.

cdedwards on 16 Jul 2018

@cdedwards recently and on the usb serial?

p3p on 16 Jul 2018

This is happening to me every once in a while now. The printer just stops. Printer doesn't get the OK back and I can force continue. Just get last position and no OK. Also using S3d.

so to bootleg patch this:

motion.cpp
/*
SERIAL_PROTOCOLPGM("X:");
SERIAL_PROTOCOL(LOGICAL_X_POSITION(current_position[X_AXIS]));
SERIAL_PROTOCOLPGM(" Y:");
SERIAL_PROTOCOL(LOGICAL_Y_POSITION(current_position[Y_AXIS]));
SERIAL_PROTOCOLPGM(" Z:");
SERIAL_PROTOCOL(LOGICAL_Z_POSITION(current_position[Z_AXIS]));
SERIAL_PROTOCOLPGM(" E:");
SERIAL_PROTOCOL(current_position[E_AXIS]);

stepper.report_positions();
*/

Despite commenting out this code, I still had a freeze:

SENT: G92 E0
READ: ok
READ: ok
SENT: G1 X115.560 Y117.782 E0.0569 F3900
READ: ok

forkoz on 18 Jul 2018

@p3p define recently. right now my printer is in pieces and will be for awhile yet.. busy rebuilding a 14x40 metal lathe and working on getting pathpilot working for my milling machine.

this was occurring all during my serial disconnect issues, so up until 2 weeks ago when I disassembled the printer.

cdedwards on 20 Jul 2018

Checked current version - no problems, this issue can be closed. Thanx!

CageFox on 5 Oct 2018

O, I was too hasty closing - the position spam still there
issue

Is there plans to fix it?

CageFox on 5 Oct 2018

Do you have an SD card inserted into the reader on the board and mounted on your PC? If you do then you may be seeing an occasional reset of the USB stack, this can cause loss of data in the USB serial stream. Try running with the on-board SD card not present and see if things are better or worse.

gloomyandy on 5 Oct 2018

Yes, this board requires sd card inserted, as far as i know. Sd card mounted on pc. I have no plans to print from sd - I always printing from PC.

The position spam appers not always and not randomly, but when some gcode (generated by S3D) executed, for example:

G1 X60.222 Y63.079 E2.6718
G1 X60.538 Y63.326 F5700
G92 E0
G1 X59.751 Y64.283 E0.0206 F4560
G1 X58.920 Y65.200 E0.0412

I posted larger fragment Jul 16

CageFox on 5 Oct 2018

The board only needs the SD card to be present when you update the firmware (or if you are using the SD card to store the eeprom settings - you can use flash for that instead if you have a recent build). So it should happily boot and work without the card. Give it a try without the card and see if you still get hangs.

The position reporting may not be what is causing your problems. If you really don't want it then @forkoz provided details of how to disable the output above. Have you tried doing this? Did it stop the hangs?

gloomyandy on 5 Oct 2018

Give it a try without the card and see if you still get hangs

Ok, I will try

The position reporting may not be what is causing your problems

When I disable reporting by commenting function code the freezes disappears totally.

The forkoz's patch actually my patch - I started this issue and found workaround - I just wonder noone except me have no problem with position spamming?

Have you tried doing this? Did it stop the hangs?

Of course, since I found workaround I can print only with it. Without patch the print freezes too often

It also need to be noticed - the need of patch appeared after some point of time - before 1 Jul there was no such problems and no position spam, while report_current_position() function code remains the same long long time.

I think first thing we need to understand, why the position reported, while there no command to request it. If I just remove G92 E0 commands from instructions the position reporting disappears, but the G92 E0 command itself not requesting position report.

CageFox on 5 Oct 2018

As far as I can see G92 has been reporting the position for a long time (at least a year looking at the sources). You can see that it calls the report_current_position function here in the code from June 2017: https://github.com/MarlinFirmware/Marlin/blob/6c45d0fd818d41c386237e1edd58107a2c7414a8/Marlin/Marlin_main.cpp#L5638. It would probable make more sense to comment out this call rather than remove the prints. The current location is here: https://github.com/MarlinFirmware/Marlin/blob/bugfix-2.0.x/Marlin/src/gcode/geometry/G92.cpp#L98

gloomyandy on 5 Oct 2018

G92 has been reporting the position

Is this intended behaviour? What the reason to report position right after set position?

I remember interesting thing: I using tx buffer:

define TX_BUFFER_SIZE 256

while by default this buffer turned off, and I have second port enabled:

define SERIAL_PORT_2 0

Can it be with some combinations of enabled ports report position writing to port directly, whil ok_send thru buffer?

The response with freezed print looks like:

19:58:48.450 : X:16.55 Y:11.57 Z:4.60 E:0.00 Count X:3200 Y:2466 Z:3680ook

So it seems to me sometimes position report text "mixes" with ok response. The report position function writes not atomically:
SERIAL_PROTOCOLPGM("X:");
SERIAL_PROTOCOL(LOGICAL_X_POSITION(current_position[X_AXIS]));
SERIAL_PROTOCOLPGM(" Y:");

And Stepper::report_positions() also writes not atimically

Maybe there is possibilities the ok response writen in between those other writes?

CageFox on 5 Oct 2018

Sorry I have no idea why it reports the position for G92, only that it has been doing so for a long time. I don't think that the output should get mixed as in theory they should both be being written by the same thread. I don't think either the "ok" or the position report is being written from any sort of interrupt context (but I could be wrong as I'm not that familiar with this code).

What do you have on the second serial port? Do you still get the hangs if you only have a single port enabled?

gloomyandy on 5 Oct 2018

What do you have on the second serial port?

For the sbase LCD

BTW I tryed to print from LCD and have same freezes

Do you still get the hangs if you only have a single port enabled?

Yes, I enabled second port and tried to print from LCD when I started to take freezes

I don't think either the "ok" or the position report is being written from any sort of interrupt context

The enabled buffer makes me think there is such interrupt or timer context to flush this buffer. Maybe it will be safer to modify this functions so them will build response string and then write it as whole?

CageFox on 5 Oct 2018

I think I just disable tx buffer. With all this asynchronous non atomic writes it will be much safier

After disable buffer I will uncommit function and will print like this

Maybe just add comment about enabling tx buffer dangerous and don't do it, guyz?

CageFox on 5 Oct 2018

The board only needs the SD card to be present when you update the firmware (or if you are using the SD card to store the eeprom settings - you can use flash for that instead if you have a recent build). So it should happily boot and work without the card. Give it a try

I checked - the board not booting without sd card at all

CageFox on 5 Oct 2018

The main question here is... Does this problem only happen in LPC based boards ? ... Maybe it is a problem of the serial class implementation of that board...

ejtagle on 5 Oct 2018

The main question here is... Does this problem only happen in LPC based boards ? ... Maybe it is a problem of the serial class implementation of that board...

Option 1. Taking into account the very small number of victims of this problem, I assume that the other platform is not affected.

Option 2. Or maybe no one, except me, has ever tried to turn on the tx-buffer.

I personally bet on option 2

CageFox on 5 Oct 2018

BTW can the behaviour of reporting position after G92 code became configurable? Something like

define DISABLE_POSITION_REPORT_ON_G92

option? Even without freezes it just useless:
issue1

CageFox on 5 Oct 2018

I checked - the board not booting without sd card at all

Really? Mine boots fine, though I'm not using the current build (I've been working with p3p on the USB side of things for a future update). Are you using the SD card for eeprom emulation? If so you could try switching to using flash memory instead.

Oh and I don't think setting TX_BUFFER_SIZE (on an LPC1768 based board) will make any difference to the USB based serial port it will always use a 128 byte ring buffer. It will however impact how the other serial port works.

Have you tried not having the LCD enabled so only using the USB serial port?

When you used the LCD to print did you have the printer connected to a PC via USB? If not can you try this?

gloomyandy on 5 Oct 2018

Are you using the SD card for eeprom emulation?

Have you tried not having the LCD enabled so only using the USB serial port?

Yes, I enabled second port to check freeze source

When you used the LCD to print did you have the printer connected to a PC

When I tried to print from LCD I disconnected printer from PC to minimize influences, but it also freezed and I understood the problem is in another place

CageFox on 5 Oct 2018

That is very strange. If you remove the call to report_current_position() from the G92 code can you print OK from the LCD or does it still freeze?

gloomyandy on 5 Oct 2018

If you remove the call to report_current_position() from the G92 code can you print OK from the LCD or does it still freeze?

For now I not tried to comment code in G92. If I comment code in report_current_position() (whole function) I can print both from PC (and I didnt' tried from LCD)

Now I disabled TX Buffer and all printing OK, except dirty printing log because report position spam

CageFox on 5 Oct 2018

I'm hearing reports that LPC1768 is broken for ReARM. I have this board, so I'll try installing on it to see if I can figure out what's broken.

thinkyhead on 6 Oct 2018

I'm hearing reports that LPC1768 is broken for ReARM

I compiled today many times and using it on my sbase - all except this issue seems ok

CageFox on 6 Oct 2018

@thinkyhead I watch for LPC176x issues quite closely, anything specific you're getting reports about? admittedly I'm running on the new framework now so wouldn't pick up issues caused by Marlin commits as quickly.

p3p on 6 Oct 2018

thinkyhead on 6 Oct 2018

This is a known issue with the usb stack resetting itself, depending on hardware, OS and the sdcard it seems to vary in severity, upto locking up the board.

The first mitigation is enabling flash EEPROM which should probably be the default at this point, the verbose logging needs cleaned up but it seems stable, the USB stack doesn't seem to report the disconnection to the CDC Serial so it locks up the mcu in its transmit loop, this could be probably be fixed with a timeout in case the disconnect flag is never set, but the root cause really should be found.

The USB stack and other parts of the framework are being actively worked on in the new external framework, but I'l see if I can at least put some mitigation in for the problem until that gets merged.

p3p on 6 Oct 2018

The first mitigation is enabling flash EEPROM

Is this board still requires #define FLASH_EEPROM when using #define EEPROM_SETTINGS ?

CageFox on 6 Oct 2018

EEPROM_SETTINGS turns on the saving settings to EEPROM, by default the LPC176x boards use the sdcard to save these settings, the FLASH_EEPROM setting makes it use the new flash eeprom emulation instead. This resolves any conflict between the usb drive and the sdcard eeprom emulation during boot and saving.

p3p on 6 Oct 2018

FLASH_EEPROM setting makes it use the new flash eeprom emulation instead

Emulation? And where the settings stored actually?

CageFox on 6 Oct 2018

The program flash memory in the MCU.

p3p on 6 Oct 2018

The flash memory in the MCU

Will reflashing board with new firmware reset this stored settings?

CageFox on 6 Oct 2018

No, the bootloader only overwrites the required blocks, the settings are stored at the end.

p3p on 6 Oct 2018

And if I want to reset to firmware settings I just need M502 and then M500?

CageFox on 6 Oct 2018

It works the same as the sdcard eeprom emulation (or a real EEPROM on a board with one), just without the complication of sharing the storage medium with the host computer.

p3p on 6 Oct 2018

🎉1 👍1

There are a few problems with the current LPC1768 USB stack. One is that the serial port does not always detect the host side closing the port (including a USB suspend/resume), another is that in some situations the disk access code will cause the USB host to reset the entire connection, which can result in serial packets being lost. There are fixes for both of these problems (and a few others) in p3p's updated framework. Using the flash for eeprom definitely helps a lot and If needed I could probably put together a PR for the easier fixes to be applied to the current branch, but I won't be able to do that for a week or so as I'm away from home. But really it would be better to switch to the new framework.

gloomyandy on 6 Oct 2018

👍1

@CageFox The underlying issue from your original post (the single byte corruption) has now been fixed and will be applied to bugfix2.0.x soon, was a hard problem to diagnose.

p3p on 12 Oct 2018

👍1

was a hard problem to diagnose

It's like any rare sporadic error, just to find it requires much hard work, thanx

What's about make G92 report position behaviour configurable?

CageFox on 12 Oct 2018

What's about make G92 report position behaviour configurable?

I tend to work on the behind the scenes things that's more of a @thinkyhead discussion, though without the newlines going missing it shouldn't be as much of a problem for you after the patch.

p3p on 12 Oct 2018

that's more of a @thinkyhead discussion

maybe create one more issue about this?

though without the newlines going missing it shouldn't be as much of a problem for you

Configuration option wich makes print log more clear is not a bad option, as for me

CageFox on 12 Oct 2018

the single byte corruption

And a question: this problem happened only when tx buffer activated? Or what combination of settings required to catch this problem?

CageFox on 12 Oct 2018

The problem was always there on the usb serial transmission no matter the configuration, the compiler was inlining then optimising some code that was designed to be interrupt safe, but only as ordered, so depending on timing of interrupts and the amount of data transferred it would be more prevalent.

p3p on 12 Oct 2018

The problem was always there

Then it's a bit strange the problem has been reported so few times.

CageFox on 12 Oct 2018

G92 reports the position because it isn’t always the host who sends it, and we need to keep the host apprised of the current position whenever it changes. If the behavior were to be changed we would need to coordinate with host developers and get them to adopt new G92 sub-codes or parameters.

thinkyhead on 13 Oct 2018

And so I propose not to disable this behavior at all, but to make it configurable, enabled by default. Users who do not need it, like me, will simply turn off this reporting, instead of commenting on the code every time they compile a new version.

CageFox on 13 Oct 2018

Is this still a problem with sBase at all? On both andy's builds and now bugfix usb issues were gone. I left the printer sitting for multiple days and maybe once it reset itself. Disconnects stopped, etc. I've since reverted the G92 removal as I wasn't using it on andy's repo anyway.

Furthermore, I've found a better workflow by enabling sd-card lockout only while printing and just sending the files to the SD card. This way, if anything happens with the host or someone pulls out the USB, a 12 hour print won't be ruined.

All the benefits of USB and all the benefits of SD printing with only a slight risk of SD corruption that can be fixed with scandisk easily and can still happen anyway if your printer crashes.
in main.cpp

     if (IS_SD_PRINTING() || IS_SD_FILE_OPEN())
  //  if (card.cardOK)
      MSC_Aquire_Lock();
    else
      MSC_Release_Lock();
  #endif

If you name your files right, you can pre heat through USB and fire off the print using a command then disconnect and let it do its thing.

forkoz on 21 Oct 2018

Indeed the usb fixes have all been merged into 2.0.x at this point or rather the framework is no longer part of Marlin and is maintained elsewhere as a 'more' general Arduino framework with Marlin now building with that.

There is no need to risk the unrecoverable FAT corruption that could occur when a simple M21 would switch the sd card over to Marlin from the Host, (M22 for the reverse) but that's your choice.

p3p on 21 Oct 2018

That's working now?

forkoz on 21 Oct 2018

@forkoz sorry I don't understand your last question. Is what working now?

gloomyandy on 21 Oct 2018

@CageFox still having this issue?