Marlin: [2.0.x] A lot of resend errors

Created on 3 Jun 2018  路  118Comments  路  Source: MarlinFirmware/Marlin

With last 2.0, using repetier, I see so many USB errors:

mega2560 and ramps 1.4 + XON/XOFF

print seems to be working due to resend action

Confirmed ! Solved

Most helpful comment

Ok, I'm testing this latest 2.0 bugfix (RX buffer=1024) but I'm almost sure it will solve the issue.
Time 2 hour and I'll have the answer

All 118 comments

Done some test, issue appeared after #10911 (serial files changes) I'll try to figure out why

Note: my cnf has XON/XOFF, RX = 1024 and TX = 0

I think it can be related to missing 'writeNoHandshake' function by I'll have to check (No xoff char sent)

@GMagician ; The XOFF char should be sent inside the store_rxd_char() function.

if (rx_count >= (RX_BUFFER_SIZE) / 8) {

Maybe i missed something...

the writeNoHandshake() was removed because, after the cleanup, it ended calling the write() function directly...

The logic behind XON/XOFF is quite simple: Read buffer fills only in store_rxd_char(), so that is the only place that can cause an XOFF char to be sent. And read buffer only empties in read(), so that is the only place where XON should be sent.

Slic3r sees that line number error all the time on connect

I tried XON/XOFF and the codes are being sent as expected... Maybe something has changed in the timing?

  • But i don麓t see it...

@ejtagle I'm investigating and I have to ask you why have you added

              // clear the TXC bit -- "can be cleared by writing a one to its bit
              // location". This makes sure flush() won't return until the bytes
              // actually got written
              SBI(M_UCSRxA, M_TXCx);

when Tx interrupt is disabled? since interrupts are not used this should be useless (and remove it save some cycles)

well, there is nothing wrong with this code..just a question. when cleaning receive interrupt bit, if some char will arrive, interrupt will be pending and generated after bit is re-set?

@ejtagle I think about a race condition. What about MarlinSerial::write after data register empty test if rx interrupt occurs with buffer full? inside interrupt will be sent XOFF but immediately after returning from it is written a new byte

@GMagician : On your first question... Probably, if not using TX interrupts the SBI(M_UCSRxA, M_TXCx); can be removed, as it is unneeded 馃憤
On your 2nd question, yes: The interrupt reception bit is just a mask. It uses the RXCn bit of the UCSRnA register to determine if an interrupt should be generated or not. The RXCn is readonly, the only way to clear it is by reading the received data from the serial port

Race conditions is what could be happening... If the RX buffer is full and you get an RX interrupt, then the XOFF char was already sent by store_rxd_char() when the buffer was becoming full

   // for high speed transfers, we can use XON/XOFF protocol to do
      // software handshake and avoid overruns.
      if ((xon_xoff_state & XON_XOFF_CHAR_MASK) == XON_CHAR) {

        // calculate count of bytes stored into the RX buffer
        ring_buffer_pos_t rx_count = (ring_buffer_pos_t)(rx_buffer.head - rx_buffer.tail) & (ring_buffer_pos_t)(RX_BUFFER_SIZE - 1);

        // if we are above 12.5% of RX buffer capacity, send XOFF before
        // we run out of RX buffer space .. We need 325 bytes @ 250kbits/s to
        // let the host react and stop sending bytes. This translates to 13mS
        // propagation time.
        if (rx_count >= (RX_BUFFER_SIZE) / 8) {

The first 'if' clause makes sure once an XOFF is sent, it will be never sent again unless an XON was previously sent.

The conditions to send an XOFF is if the count of unread chars is
rx_count >= (RX_BUFFER_SIZE) / 8
, in your case, 128 bytes. And the buffer has margin for 896 extra bytes

The only way to send an XON is if the count of queued chars is at the read() function

if (rx_count < (RX_BUFFER_SIZE) / 10) {

You could try 2 things: Changing the limits (for example, use an XOFF limit of RX_BUFFER_SIZE) / 16 instead of RX_BUFFER_SIZE) / 8

And use an XON limit of 0, instead of if (rx_count < (RX_BUFFER_SIZE) / 10) {

Also, could try to decrease speed ... The idea is to find out the culprit... I don麓t know if there is a way to log the communication from the PC side.
On Windows, i have successfully used Realterm with XON/XOFF control to send files,..

Getting a raw log would be useful, as you would get the XON/XOFF chars (coded 0x11 and 0x13) to analyze...

Also, the Marlin code already has counters for dropped chars, with that you can ensure the problem is dropped chars (=the host either not receiving the XON/XOFF chars or sometinh else that causes delay in the processing of those chars...)

I've done some test and restoring some "CRITICAL_SECTION" calls will solve...continuing

@ejtagle what i'm supposing about race is this:
when 'write' function is sending something, it test for data register free, if after this test and char send an rx interrupt is received and rx buffer is full, rx interrupt will send XOFF but as soon as interrupt exits, 'write' will overwrite xoff with 'c' and xoff will be lost (and maybe also 'c').
This was already present in previous version, but since this may be race issue, different execution times may raise new pending bugs

it seems that CBI/SBI(M_UCSRxB, M_RXCIEx); are not doing what expected. Replaced them with CRITICAL_SECTION_?!? will solve issue (in 'available' and 'read' functions is enough to see a big difference)

Humm... That is very strange... But maybe you are giving the answer... ;)

I have to admit that it does not make sense, unless the interrupt is somehow already flagged at the interrupt controller...
I really would like to avoid critical sections, because they add latency to the Stepper ISR...

I disassembled the code itself, and there is an interesting race condition here...

 ring_buffer_pos_t MarlinSerial::available(void) {
    #if RX_BUFFER_SIZE > 256
      const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
      CBI(M_UCSRxB, M_RXCIEx);
    #endif

      const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;

    #if RX_BUFFER_SIZE > 256
      if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
    #endif

    return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
  }

This is assembled to the following:

 ring_buffer_pos_t MarlinSerial::available(void) {
    #if RX_BUFFER_SIZE > 256
      const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
   1a938:   20 91 c1 00     lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
      CBI(M_UCSRxB, M_RXCIEx);
   1a93c:   80 91 c1 00     lds r24, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
   1a940:   8f 77           andi    r24, 0x7F   ; 127
   1a942:   80 93 c1 00     sts 0x00C1, r24 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
    #endif

      const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;
   1a946:   80 91 44 11     lds r24, 0x1144 ; 0x801144 <rx_buffer+0x400>
   1a94a:   90 91 45 11     lds r25, 0x1145 ; 0x801145 <rx_buffer+0x401>
   1a94e:   40 91 46 11     lds r20, 0x1146 ; 0x801146 <rx_buffer+0x402>
   1a952:   50 91 47 11     lds r21, 0x1147 ; 0x801147 <rx_buffer+0x403>

    #if RX_BUFFER_SIZE > 256
      if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
   1a956:   27 ff           sbrs    r18, 7
   1a958:   05 c0           rjmp    .+10        ; 0x1a964 <_Z22get_available_commandsv+0x66>
   1a95a:   20 91 c1 00     lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
   1a95e:   20 68           ori r18, 0x80   ; 128
   1a960:   20 93 c1 00     sts 0x00C1, r18 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
    #endif

    return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
   1a964:   84 1b           sub r24, r20
   1a966:   95 0b           sbc r25, r21
   1a968:   93 70           andi    r25, 0x03   ; 3
   1a96a:   89 2b           or  r24, r25
   1a96c:   09 f4           brne    .+2         ; 0x1a970 <_Z22get_available_commandsv+0x72>
   1a96e:   21 c1           rjmp    .+578       ; 0x1abb2 <_Z22get_available_commandsv+0x2b4>
    false);
}

You will notice that the instruction CBI(M_UCSRxB, M_RXCIEx); is implemented as a

-Read register
-Clear bit7
-Write register.

The same register also holds the TX interrupt, so, if for some reason a TX interrupt happens between the read and the write, and that was the last TX interrupt, and it is disabled in the ISR, then TX interrupts will be reenabled and the TX isr will be called when no char is available. That is a big problem.... So, there is a problem here...

@GMagician : After careful analysis: You are absolutely right!! ... There is a hidden race condition in the code... Unfortunately CBI/SBI(M_UCSRxB, M_RXCIEx); are not atomic operations on AVR :(
That means that in MarlinSerial::available() / MarlinSerial::read(), there is a chance the RX isr will take an incorrect state, because the M_UCSRxB is in the middle of being altered.
If that happens, then the XON/XOFF could perhaps end not being sent..
This is a very unfortunate event. I am rethinking if there something that could be done...

AVR does have ops to set/clear individual bits in I/O ports atomically and judging by the macro names marlin code did assume that we'll end up with those instructions. I wonder why compiler didn't generate them.
https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_SBI.html
https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_CBI.html

@Artem-B that can be done only with I/O registera 0-1f....this one is c1 (if correctly understood chip DS)
@ejtagle how did you get disassembler? this could be useful instead of trying code changes....
@ejtagle also peek has the same issue....

@Artem-B Because, unfortunately, AVR has no instructions to set the variables or the UCSRnA register bits atomically, and that is what we need. Don麓t worry: After a lot of thought, the race conditions can be safely handled (i think!)

Attached is my attempt to handle and fix them. If anyone @GMagician or you want to test them, it would be great! (you will see i am trying to avoid critical sections, to improve pulse jitter of the stepper drivers)
MarlinSerial.zip

@GMagician : If you have the Marlin.ino..elf file, you can use avr-objdump -S -d Marlin.ino.elf and you will get the disassembled version with source code intermixed. It麓s very handy sometimes.

@ejtagle I'll give a try this evening but in my configuration TX interrupt should not be enabled (TX buffer is 0).

It should work also for TX buffer = 0. Unfortunately, the race condition is still there. The idea is to learn to live with it. The only doubt i have right now is if there is any latency between disabling the RX interrupt and no RX isr being called...

(if there is any latency, then this file will show the same problems as before...) Right now, the idea is never to touch the RX ISR enabling at the ISR itself. And handle disabling of TX isrs with an implicit retry if it fails

The idea is to learn to live with it

can't agree, sorry.
When host will get 'wrong line error' will resend line and that's ok
but when marlin says 'echo:unknown command' reporting some parts of a real G1 line maybe this will lead to something wrong...
If I have to choice about jitter on stepper (some distortion on printed part) or missing printed parts (missed commands) I prefer first, what if missed command is T1?

Let me correct myself: The idea is not to learn to live with the serial errors. My idea is to implement the serial class in such a way that, even if there is a race condition, it perfectly works and sends the XON/XOFF chars reliably and does not lose any chars.
I also think that is unacceptable to have unreliable communication, and if it can麓t be solved without the critical sections, then be absolutely sure we will use them! - I repeat, unreliability of the Xon/Xoff algorithm is not acceptable at all. I am just trying to figure out if there is a way to avoid the critical sections, if possible, without comprimising integrity of the communication 馃憤 -
Sorry for not being as clear as i should. English is not my native language... ;)

@ejtagle

Sorry for not being as clear as i should. English is not my native language...

we have the same problem, so forgive me if I ever will be rude with my sentences

I still have a doubt about this race condition, already present before:

void MarlinSerial::write(const uint8_t c) {
  while (!TEST(M_UCSRxA, M_UDREx)) sw_barrier();
  --> rx interrupt here that will send xoff
  M_UDRx = c;
}

even if less probable, if h/w can't shift out 'XOFF' byte new char 'c' will corrupt sent data

@GMagician: You are absolutely right on that race condition... Sending Xoff from the RX isr is a big problem...I think i have a solution for that...
Excellent analysis !!

@ejtagle I tested attached zip but it doesn't solve, I expected that since TX_BUFFER = 0 means no tx interrupt.

@GMagician : I am working on a fix for the race condition. I will stress test it to avoid wasting your precious time... 馃憤

@ejtagle

use avr-objdump -S -d Marlin.ino.elf

I'm using platformio where do I find avr?

@ejtagle

I will stress test it to avoid wasting your precious time...

don't worry for my time, since I'm the only one to have this issue i'm glad to test it but I'll not be available for next 2 days.

@GMagician

MarlinSerial.zip
This is my latest version. I wrote an sketch to test it, and XON/XOFF are being sent as neccesary. I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF

I've tested code not solved, still get checksum errors

Note that _written and flushtx are not used when TX_BUFFER = 0 so you can restore that parts

I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF

no xon/xoff with 512 bytes...look below

#if RX_BUFFER_SIZE >= 1024
  // Enable to have the controller send XON/XOFF control characters to
  // the host to signal the RX buffer is becoming full.
  #define SERIAL_XON_XOFF
#endif

@ejtagle where did you get avr.exe? which app install it?

Let麓s go in order...
_written and flushTX are not used in the class itself, but there is a macro referencing them.. After close inspection, seems the macro (at core.h) is only defined when a TX buffer is used... In that case, you are right, and we can remove them
-I tried by also changing that conditional #if RX_BUFFER_SIZE >= 1024..., just to stress a little bit more the library...

I use Arduino to compile it, and it already installs the AVR compiler. But, you can get the avr-objdump by installing
http://www.microchip.com/mplab/avr-support/avr-and-arm-toolchains-(c-compilers)
or
http://blog.zakkemble.co.uk/avr-gcc-builds/

Attached an standalone Arduino sketch that i used to test the MarlinSerial hal.. It just counts characters received ...
sketch_jun05a.zip

So, i can just test the serial class with no Marlin or other interrupts running...

I can see errors only when printer is printing...With no marlin, repetier is not able to proceed

I know, i know. Right now the problem is how to proceed to determine the cause of the problems...
Is it Marlin (not sending the XON/XOFF chars) ? ... Is it repetier, not recognizing the characters on time ?

I also tried with buffer = 512 (so no xon/xoff) and I got again crc error. So is not an issue related to xon/xoff

A solution for this bug would be most welcome, since it's one of the critical issues holding back the 1.1.9 release. Reverting to CRITICAL_SECTION_START is also acceptable, because serial handshaking is not a major bottleneck in achieving reasonable print speeds.

But all the race conditions you pointed out were right.
Does repetier use XON/XOFF even if the printer does not support it ?
Maybe they are adding to the CRC the XOFF/XON chars...

@thinkyhead : My main concern here is not if reverting or not will fix it. It is not understanding the cause of the problem. Critical sections could be masking a more subtle problem..
All the tests i have done, transferring HUGE amounts of information on purpose, just to make it fail, did not fail at all. That麓s something to worry about...

The thing is, we have an obligation to the public and we cannot wait forever for a solution, allowing one issue to restrain all the other fixes and improvements that we need to get out to users. If the previous code works better in the real world, we will simply have to accept it and go back to the old code, whether we understand why it works or not.

The same will have to go for the endstop code now leading to inexplicable errors in dual-axis homing. If we can't figure out why the new and "better" code is broken for these users, then we must go back to the old and "worse" code that has the virtue of working.

Maybe they are adding to the CRC the XOFF/XON chars...

Very god point.
What makes you think RepetierHost would work in XON/XOFF mode at all?
XON/XOFF is relative new to Marlin.
The "ok"-protocol always worked without XON/XOFF. Regardless of using ping-pong or the advanced buffersize (default) method. (Important is to tell RepetierHost about Marlins RX buffer size and it seems to help is you tell it only about 3/4 of Marlins buffer size (not tested in RH versions above 2.0).)

@repetier
Can RH handle XON/XOFF?

I redid all my tests. Using a Windows Serial terminal called YAT (https://sourceforge.net/projects/y-a-terminal/), i sent a full gcode file to the SD card (with line numbers and checksums). 2megabytes. Not a single error or retransmission or anything. If disabling XON/XOFF i get lots of overruns.
So, i suspect @AnHardt could be right here...

No, we do not support XON/XOFF handling. We only rely on our own traffic counting a sthis is compatible with all versions.

@ejtagle

But all the race conditions you pointed out were right.

yes, they were really presents

@repetier

No, we do not support XON/XOFF handling

that's bad, XON/XOFF gives safer communication (but it's also ok since, maybe, only Marlin support it)

@thinkyhead

we have an obligation to the public and we cannot wait forever for a solution

The same will have to go for the endstop code now leading to inexplicable errors in dual-axis homing

I agree for 1.1.8 stable branch, but stopping this kind of work for "debug" branches will fossilize Marlin

@thinkyhead, @AnHardt, @ejtagle I suspect that, since I still have crc error when XON/XOFF is disabled, that my mega may have reached its limits (I get these error only when printing). ejtagle fix has reduced such errors (now I saw only crc, before more and different ones)
When we enable critical sections we stop every interrupt this, I suspect, will give to serial communication a little bit of breath, enough to "not miss" chars.
I'll continue to investigate to see if I'm able to identify cause.

Why should XON/XOFF be safer? Safety is only by checksums to protect against communication errors. XON/XOFF is maybe even more error prone as a com error hitting the control char would cause com errors. All I see is that communication might be faster, if you get no errors. On the other side it uses special chars and therefor it is not implementable in firmwares using binary transfers (repetier-firmware).

@repetier is safer because it can detect when receiver has its own buffer full and stop sending data.

EDIT XON/XOFF may be also used in binary protocols (using escape codes). Maybe safer is not correct "word" but I home concept is passed

We have no buffer overrun problem as we count the bytes send and bytes acknowledged with "ok". That is what disbaling ping pong is for. It is essentially the same speed gain without need of XON/XOFF chars just by doing the data counting on our side.

@GMagician

is safer because ...

No. The used protocol ensures not to overfill the buffer.

PingPong

  • The host sends a line of code and stops.
  • Marlin reads the buffer executes/queues the command and then sends an 'ok'.
  • The host receives the 'ok' and knows from that the buffer is free again and sends the next line of code.

As long the host never sends a line longer than the rx-buffer all is well and can't go wrong.
This has large latencies between the commands.

To improve the latencies you can tell the host about the rx-buffer size.

  • The host sends a few command lines - until the buffer is full - and keeps track for every command line how long it has been.
  • Marlin executes one command line and sends an 'ok'.
  • The host receives the 'ok' and knows from that, he can send that many bytes the first command was long.
    That is about what RepetierHost uses by default. (grbl's g-code sender application has the details.)
    Here the command to command latency is much better. Usually Marlin can parse the next command immediately. On the other hand the latency until a 'manual' command arrives is much worse.

Against that I invented the "Emergence Command Parser".
A few important commands are parser directly in the RX-interrupt bypassing the RX-buffer.
So now a "stop" can overtake the queue in the buffer. To make that work the host has to support that. He has to violate the above protocols. He has to know that he can send, this few commands, even if not have received an 'ok'. (As far as i know, up to now, no host implemented this. But it's relatively easy to implement - compared to)

XON/XOFF
is usually implemented in the serial driver (on the host side). Application programmers usually use the ready made drivers for a device. Violating the XON/XOFF protocol by purpose, (for the host developer) would mean to make his own device drivers. That will never happen.

@AnHardt thanks, detailed explanation

Disabled xon/xoff and increased buffer in repetier to 768 (3/4 of 1024) but same errors:
'checksum' and 'last line number+1'

I'm trying to analyze traffic but I can't find a valid "free" sniffer for win10

@GMagician: For some reason, you are still losing chars. May i ask if your Ramps board has a CH340 USB to serial converter, or if it has an Atmega8u2 acting as a USB to serial converter ?
I seem to recall that there was a problem with older firmwares used for the USB to serial conversion, and there were reports of lost characters...
(https://www.arduino.cc/en/Hacking/DFUProgramming8U2)...

You could try the following sketch:
test_serial_sketch.zip

This sketch implements a basic Gcode parser, that only verifies checksums (and will also print the received line if an error is detected)

The following file is just a GCode file with proper line numbers and checksums
gcode with checksums.zip

You must use a Serial terminal that supports the XON/XOFF protocol to send it: I use YAT
https://sourceforge.net/projects/y-a-terminal/

The idea is the following:
-Write the sketch to the Mega2560 board (powering it from USB is fine, you don麓t need or want to power motors, as ports are not being initialized)
-Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence
-In YAT, select the given Gcode file (decompress it to a suitable location) and send it
-You should get tons of "ok" but no errors.

Alternatively, you can use Marlin to test this: In that case you will need an SD card plugged in the SD slot (if you don麓t have any, then you cannot use this method, i really doubt you would actually want to execute the Gcode)
-Just flash marlin, but configured for XON/XOFF support
-Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence
-Send the next command to start saving all the received commands to the SD card in a file called gtest.gco:
M28 gtest,gco
-Finally send the supplied file. You should get a bunch of "ok" but no errors.

The 3rd alternative that you can try is a dry run:
-Just flash marlin, but configured for XON/XOFF support
-Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence
-Send the next command to disable extruder
M111 S255
-Finally send the supplied file. You should get a bunch of "ok" but no errors. - The printer will actually move, but no heating of the bed or extruder and no extrusion. You should get no errors.

In all cases, YAT allows to LOG all the sent and received data (Enable it in the LOG menu), and that can help to find out in what case or what kind of corruption is happening...

Thanks @ejtagle, on my mega I don't have for sure ch340 chip but I don't know what is used.

Sending file I get:
error.txt
I also get errors...

I don't think it's a USB bridge bug. Replacing serial code with CRITICAL_SECTION it works perfectly

N32149 G1 X96.172 Y104.346 E67
Error:checksum m
ismatch: SENT:32 COMPUTED:9
RXD line: 897 G1 X1*099760 7 3  5Y1 X4Y3 13Y1E1303163 G1 X11G1500 E679.Y202
D:77 Q:1023
9.41064*95
N32150 G1 X96.43 Y104.4 E679.41503*101

This specific piece is telling that there were dropped characters (D:77 means 77 bytes were dropped), and Q:1023 means all the RX buffer was used at that time.

The main question here is why...

This means either the XON/XOFF was not sent, was not received, or the terminal is ignoring it, or something is filtering it out...

In YAT: New terminal->
yat-1
Then Terminal->Settings->Text Settings->
yat-2

Seems there is something in the middle that is buffering the data... The Atmega2u8 that is used in the Arduino Mega2560 has a 128 byte buffer for both TX and RX, and a polling interval of 1mS, so, going up to 250Kbits per second, it could introduce a latency of up to 25 * 2 = 50bytes. Having a 1024 bytes buffer should be more than enough to handle them

The other thing would be to know where exactly a critical section "solves" it -- Maybe it is a workaround for something more serious...

Image 1 is what I have..when terminal open it says XON.
I have to set EOL to LF otherwise I see ok < LF >

but really I never see XON/XOFF during file sending and this is really misterious

But what about same errors when XON/XOFF protocol is off? repetier has its own logic to limit buffer....

I have to check one thing....
I hope my mega don't have FTDI chip...I recently read something weird about company behaviors
https://hackaday.com/tag/ftdi-gate/

second attack is maybe what I could facing....

Yes, Repetier has its own protocols, i agree... It should not happen, but something is happening, i have no idea what.
On the YAT log i do not have XON/XOFF chars. The XON/XOFF chars are probably handled by the windows serial driver, that is why i dont see them.
But if I disable Xon/Xoff on YAT, and keep exactly the same sketch running, then the transfer goes way faster and gives tons of error and invalid checksums.
YAT seems not to be displaying those XON/XOFF chars. But, if you enable raw logging,
yat-3
And open the log with a hex editor, you can look for the 0x11 and 0x13 chars and you should see them: In my case, if disabling Xon/Xoff in YAT, i get:
yat-5

You can also switch YAT to Manual Software XON/XOFF. The difference is that instead of trusting the windows driver, YAT implements its own XON/XOFF parsing. And in that case you WILL see the XON/XOFF chars. This is an screenshot on how it looks here:

yat-6

This could also be a driver issue...

Wow....I don't understand.... my yat setup:
i1
i2

log got:
YAT-Log-20180609-082505.zip

it seems that XOFF doesn't stop anything

With "manual xon/xoff" it works. Then from xon/xoff point of view it seems that my arduino drivers don't handle xon/xoff.
With this we can close xon/xoff issue part. But Why disabling xon/xoff (define commented) I still get errors using repetier but with critital section not? (Yesterdeay I had a 2 hours print with critical section active with no error at all)

Kind of sounds like the same discussion:
https://github.com/grbl/grbl/issues/50#issuecomment-4143431

@Sineos in my case is not only an xon/xoff issue. I also programmed printer with no XON/XOFF and my errors are still there. If I restore CRITICAL_SECTION in serial.c everything works (Yesterday with current 2.0.x branch and such "fix" I made a 2 hours print with no issues)

Now another test.
@ejtagle I used your sketch modified to let me run repetier. I modified your gcode file removing crc and I printed it...
I don't use XON/XOFF in your skecth so I use repetier buffer handling (I set cache to 256)

This is what I get:
a.txt

Wait Wait Wait...maybe I got it
I modified your sketch to echo acknowledged line number, this is what I get when idle

10:21:15.386 : N50 M105*18
10:21:15.402 : echo:50
10:21:15.402 : ok
10:21:15.402 : echo:0
10:21:15.402 : ok

double 'ok' ack, this can destroy repetier buffer logic handling

EDIT: Ok, your parser don't handle 0x0D in line feed. adjusting parser to handle it and everything works... Now I'll check Marlin Parser (just to be sure)

EDIT2: Marlin parser is immune to that

Ok, let's resume current status:
1) XON/XOFF is working correctly with these fix
2) XON/XOFF is not handled by repetier not by my drivers
3) checksum error/line error are not seen using "test" software, even with repetier
4) checksum error are seen using "marlin" only when RX interrupt are disabled, when CRITICAL_SECTION is used everything works as expected

-The fact that XON/XOFF only works in manual mode, means your (windows/macos/linux) drivers are non compliant and do not handle XON/XOFF (linux is known to have that issue, Windows should work unless the driver itself is not handling them (incomplete driver)!)
-Critical section disables not only RX interrupts, all interrupts, including stepper and temperature isrs...
-The test software (sketch) is using exactly the same code as Marlin ...

The conclusion to this could be that perhaps the problem is caused by something else inside Marlin... Either an ISR altering something, or memory corruption, or....

You can try writing to the SD card using Marlin and YAT.

Alternatively, you can use Marlin to test this: In that case you will need an SD card plugged in the SD slot (if you don麓t have any, then you cannot use this method, i really doubt you would actually want to execute the Gcode)
-Just flash marlin, but configured for XON/XOFF support
-Open YAT, select port, set port speed to 250000, Protocol to Manual Software Xon/Xoff software flow control and in "text settings" you must select as EOL sequence
-Send the next command to start saving all the received commands to the SD card in a file called gtest.gco:
M28 gtest,gco
-Finally send the supplied file. You should get a bunch of "ok" but no errors.

But i guess it will show the same problems.

Next steps would be to disable the stepper ISR and repeat the above. Or disable the temperature ISR, disable watchdog and repeat the above.
We must find out the cause, that is the main issue here..

We could also add the supplied sketch inside Marlin, trigger its execution with M111 and see if it works. The idea is to find out if there is something else causing these problems....

Hi,
a couple more data points for you....
I've been testing the bugfix-2.0.x branch from a couple of days ago (the last commit I have is eb1026d). So this version will not have the changes in @ejtagle private build. My printer uses an mega2560 based MKS base 1.3/1.4 clone control board. The PC side windows drivers are the standard Arduino USB serial port setup.

I've been using the latest release of Repetier-Host (version V2.1.2) and a sample gcode file of approx 30000 lines (takes about 2 hours to print using dry run mode). With the default settings of baud rate 250000, TX_BUFFER 0, RX_BUFFER not defined and XON turned off. Then I can run the print with zero errors. However if I change the RX_BUFFER setting to be 1024 (still with XON turned off) then I get 8 errors while running the same print. I've only performed this test once with each setting so far, so perhaps not conclusive in any way, but I thought the result may be of interest as I noticed that the code seems to follow a different path depending upon the buffer size setting.

I'm happy to run more tests if you think this will help.

@ejtagle I'm using arduino windows drivers to communicate and I confirm that they don't manage correctly XON/XOFF.

RX_BUFFER_SIZE > 256 check shouldn't be RX_BUFFER_SIZE >= 256?

@GMagician : Basically, the condition RX_BUFFER_SIZE should be > 256
The reason is that we are using a circular FIFO, so an 8bit index is able to represent up to 256 different values.

My Arduino Mega is a chinese clone that is using the CH340 chip as a USB to serial bridge. And the windows driver works perfectly with XON/XOFF handshake

@ejtagle but since you test and array size, 256 items in array are 2 bytes

@ejtagle also mine mega is chinese but I'm sure it doesn't use ch340 but something else. Looking at COM dirvers windows says "copyright arduino.ccc"

@GMagician : You will always find (if you find something different, its a bug)

value & (ring_buffer_pos_t)(RX_BUFFER_SIZE - 1);

This always translates to a number < 256 if RX_BUFFER_SIZE <= 256, always fitting in 8bits

I think that the problem we are facing here is that for some reason, RX interrupts are not being serviced as fast as needed under some "undetermined" situation.

If it was related to RX_BUFFER_SIZE, defining RX_BUFFER_SIZE to 128 bytes should solve all problems... Specially with the "ok" polling protocol used by Repetier...

@ejtagle perhaps you missed my comments above. I see no errors with a default buffer size (which I think ends up being 128). Perhaps this is just a coincidence but maybe worth taking a closer look at?
@GMagician have you tried testing with a RX_BUFFER_SIZE undefined (which is the default setting I think).

@gloomyandy : Quite to the contrary: Your report could reinforce a small theory i have about this problem: If the RX_BUFFER is 256 bytes or less, then there is no need for the critical sections at all.
This could mean either that the "critical sections" = disabling the RX interrupts is not working, or that there is some kind of latency between disabling the RX interrupt and the actual disabling of the interrupt.
I麓ve been looking for any kind of documentation on this kind of latency, but didn麓t find anything yet...
But, if there is some latency, it could mean reading or writing a half inconsistent value of the rx buffer index head or tail, and that would explain a lot of things

@ejtagle I also added a "nop" after CBI to try to enforce your theory...but none.
Now I'm testing SD transfer. I get error when normal transter. I created now this:

#define DISABLE_TEST bool tisr = TEMPERATURE_ISR_ENABLED(); \
                       DISABLE_TEMPERATURE_INTERRUPT(); \
                       bool sisr = STEPPER_ISR_ENABLED(); \
                       DISABLE_STEPPER_DRIVER_INTERRUPT(); \
                       bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx); \
                       CBI(M_UCSRxB, M_RXCIEx)
  #define ENABLE_TEST  if (tisr) ENABLE_TEMPERATURE_INTERRUPT(); \
                       if (sisr) ENABLE_STEPPER_DRIVER_INTERRUPT(); \
                       if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx)

and i'm going to test it

I was surprised at the relatively low error count I got with a buffer of 1K. It could just be that getting no errors with the default setting is a coincidence? I'm happy to test further if it will help. If you would rather I test with your modified build I can do that, or with a different gcode file or Marlin configuration if you think that would make a difference.

Basically, i麓ve had success even with a 512byte buffer and XON/XOFF enabled...

@ejtagle SD tests done....when XON/XOFF is used and with YAT I send gcode file, everything is ok, even with interrupts enabled (your original file is working).
Before these test I noticed some errors, don't know if normal.
I had power off (only 5V via USB) and since I have bltouch that is not powered (external 5V on servo) printer send an error. Now I have line number errors!
a.txt

Either we are losing chars, or noise, or something... There is an small modification that can tell the cause: Reading the UCSR0A.DORn . If there is a buffer overrun, then this means something is blocking the RX interrupts...

I麓ll add support for such debug feature... :+1:

Well @GMagician ... Me again, abusing the poor M111 command to print serial statistics...

serial-debug.zip

You need to add to your configuration_adv.h file the following:

#define SERIAL_STATS_RX_BUFFER_OVERRUNS 1
#define SERIAL_STATS_RX_FRAMING_ERRORS 1
#define SERIAL_STATS_MAX_RX_QUEUED
#define SERIAL_STATS_DROPPED_RX

Probably the latter 2 are already enabled.

Then, by issuing M111 you will be able to know if there was a buffer overrun (that means the RX interrupt was disabled for too much time an a RX char was lost.
You also will get the Framing Error counter (noise can cause RX problems), and also RX queue maximum size and if some chars were lost due to RX being full at the time of adding an extra char to it...

Hope this allows to know the cause of the CRC errors/line errors...

The serial changes are now isolated in the PRs #10982 and #10983.

Yes, the previously attached files were exactly the ones you separated into those PRs. We are still trying to figure out the problem.
The problem seems only to happen when printing, but not when transferring files to the SD card. There is no such huge difference between scenarios, except that the stepper ISR rate, when not printing, is reduced to 1000 interrupts per second, as compared with printing, that probably forces an increased rate...
I have written an independant sketch that uses the Marlin serial classes and also creates a timer running at 12khz with a delay inside that emulates the same load. And i am unable to make it fail...
That is why i added those debug options. We need to know if it is the RX FIFO becoming full, or an internal UART buffer overrun (that will indicate the RX isr is not being run fast enough...)

More testing using @ejtagle updated code with debug stats. Same gcode as above (approx 2 hour print), RX_BUFFER_SIZE 1024, XON/XOFF not enabled. M111 reports the following:
12:02:20.315 : echo:DEBUG:offBuffer Overruns: 2
12:02:20.315 : Framing Errors: 0
12:02:20.315 : Dropped bytes: 2
12:02:20.315 : Max RX Queue Size: 201

I had two checksum errors reported by Repetier. Going to try the same test again with the default buffer size.

@ejtagle I think you got it....
15:27:57.381 : echo:DEBUG:offBuffer Overruns: 3
15:27:57.381 : Framing Errors: 0
15:27:57.381 : Dropped bytes: 9
15:27:57.381 : Max RX Queue Size: 1023

EDIT: RX queus size = 1023 but repetier has 768 as cache...how this is possible?

Completed a run with the default RX_BUFFER_SIZE (and obviously no XON/OFF). No errors reported by Repetier. M111 gives:
17:23:00.744 : echo:DEBUG:offBuffer Overruns: 0
17:23:00.748 : Framing Errors: 0
17:23:00.748 : Dropped bytes: 0
17:23:00.749 : Max RX Queue Size: 126

Interesting that the max queue size is lower in this mode as well as no errors.

So I assume this means that the RX interrupts are off for too long in some situation? Any idea what that situation may be? Is it just a combination of all of the other interrupts happening and making the off period exceptionally long? Or do you think it is just one of the other interrupts is taking an especially long time? As I said before I'm using the main bugfix-2.0.x from a few days ago. I have S curve enabled and also junction deviation. Is there any way that some combination of events could be leaving RX ints turned off after the call has completed?

@gloomyandy : In your case, on the first try, you got 2 overrun errors - Those are counted both as Overrun errors and as Dropped bytes - So, the problem is just that some RX isrs were not serviced in time, and the received character was lost - Going at 250000 bits per second means a 25Khz serial ISR rate... Pretty high...

On your 2nd try, there are not errors at all. No buffer overruns (means all RX interrupts were serviced in time), No framing errors (no noise in the communication), No dropped bytes (Means that the FIFO queue never became full)... As you mention, there is something strange: Your 2nd try, even using an smaller FIFO, is using less entries ... 126 bytes maximum, and you get no overruns at all..

@GMagician : Your case is even more strange! ... There were 3 overruns (RX isrs not being serviced fast enough, and 9 -3 = 6 dropped bytes just because the FIFO became full and there was no space to store the received bytes...

What is the difference between things ? ... If the RX FIFO size is less than or equal to 256, no disabling of the RX interrupt is done... Maybe that helps a bit with the RX isr ?

I have 2 different theories here: One of them is that disabling the RX interrupt and reenabling it sometimes loses RX interrupts... It should not happen, but the documentation on this is missing.. And Dropped chars could be explained by that - And that would also explain why a critical section solves those issues...

The other is that something is just taking too long to complete with RX interrupts disabled.

@GMagician Can you try setting your RX fifo size to 256 or less and see if it solves the problem ?

We all agree that "ok" protocol, as used by Repetier should prevent complete buffer overruns. And it should be bomb-proof .. The only thing that could fool that protocol is if the serial (windows) driver is either repeating received lines, or, if it delays the answer... and Repetier fails to drain it, so essentially, Repetier could be taking a previous "ok" and associating it to an incorrect line...

RX buffer = 256 1hr print and no issues...I have to wait 1 more to finish but never happened without critical sections

@ejtagle

The other is that something is just taking too long to complete with RX interrupts disabled.

I will try to disable also temperature and isr interrupt (code posted some post above) and test.
Don't know how many other interrupts are enabled but this test should check if your 2nd theory is right.

At the moment it really seems that disabling RX interrupt sometime miss pending interrupts (answer given by 256 buffer size)

Believe it or not, there IS a way to perform atomic reads of 16 and/or 32bit values without disabling ISRs. The implementation consists on reading until 2 consecutive reads return the same value... I'll do the implementation for the case where RX_BUFFER > 256, just for completness sake and XON/XOFF handling... ;)

Yep I was thinking this might be a way forward. The code is not pretty (and you need to be careful that the compiler does not optimize it in some unexpected way!). I'm currently testing disabling the RX int with a buffer size of 256 just to see if I still get errors. This will hopefully confirm that it is not simply a case of the 256 byte buffer code being more efficient.

The 256 byte buffer is marginally mor efficient, but has the added benefit that under AVR, setting an 8bit memory variable, or reading an 8bit variable is an atomic operation. That is why there is no need to disable the RX isr when reading or writing an 8bit index,
The 'volatile' keyword and some clever usage of memory barriers can solve and ensure the "atomic" read of 16bit variables by reading more than once until you get the same value does not get optimized out...

@ejtagle this
rx_buffer.tail = t;
will be a problem and can't be solved "atomically" since is done outside interrupt and interrupt can occurs in between 8 bits update.
Only read (outside isr) can be done twice

Yes I understand that (I've worked a lot with the PIC range of devices in the past), I was just curious as to what exactly was going on here and wanted to see if disabling the RX ints on the working 256byte buffer case will make it fail as well. If it does then that hopefully means that using a multiple read for the 16 bit case will be a good fix. I assume that you will still need to disable ints when updating the tail pointer?

There is a way also to solve that problem ;) - Two copies of the index, update one, set a flag so the ISR uses that copy, update the other, clear the flag... 馃憤

Final confirm.. 2h:15 print without any issue with RX=256

I am removing all the RX isr disabling code from the Marlin Custom Serial ... With the explained tricks, there is no need for them and that will solve the RX_BUFFER > 256 case also

@GMagician : But as you have already seen, a 256 byte RX buffer is more than enough for Repetier.

Nevertheless, let麓s fix the general case, just in case someone in the future implements XON/XOFF (that certainly helps transferring data to the SD card... ) .. ;)

256 is enough, I just needed to set cache to 192, I was using 1024 to let me enable XON/XOFF. Now that I know I can't I can live with 256.

Just as a note, I used test with disable stepper, temperature and RX interrupt at once with code:

#define DISABLE_TEST bool tisr = TEMPERATURE_ISR_ENABLED(); \
                       DISABLE_TEMPERATURE_INTERRUPT(); \
                       bool sisr = STEPPER_ISR_ENABLED(); \
                       DISABLE_STEPPER_DRIVER_INTERRUPT(); \
                       bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx); \
                       CBI(M_UCSRxB, M_RXCIEx)
  #define ENABLE_TEST  if (tisr) ENABLE_TEMPERATURE_INTERRUPT(); \
                       if (sisr) ENABLE_STEPPER_DRIVER_INTERRUPT(); \
                       if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx)

and errors are still present. So the issue is missed RX interrupt when interrupt is disabled.

I already have a fix for the case where RX_BUFFER > 256 ... ;) -- But i do agree with you: It is pointless to use more than 256 bytes as a buffer, except if transferring contents to the SD card with XON/XOFF enabled ..

This is the latest version of the serial class, with all the fixes, even for the case with RX_BUFFER>256
MarlinSerial-Latest.zip

Sounds good. I've just finished my test and can confirm that disabling RX interrupts causes data overruns even when using buffers <=256 bytes.

I wonder what exactly is happening perhaps the other interrupts can occasionally all happen one after another during the period the RX int is turned off and this total period is long enough such that we get the overrun? But I would have thought that the test above by @GMagician would have stopped this from happening? Thought actually it may have been better to enable the RX interrupt first rather then last? With it being last that code would ensure that all of the other interrupts will run before the RX one gets a chance (assuming they are all triggered of course). But it probably doesn't matter now!

I will test the new version of the code tomorrow and report back. Thanks @GMagician and @ejtagle for all of your good work on this!

@gloomyandy : To be honest, I didn麓t expect this behaviour of the atmega. RX interrupts, according to the manual, "seem" to be level sensitive, so as soon as reenabling the RX isr, its interrupt should be attended. Both the stepper and the temperature ISRs run with interrupts enabled, just to prevent missing an RX interrupt.
But all the experiments made so far lead to the idea that RX isrs are edge triggered, so, if the RX interrupt is masked, then reenabling it is not enough...
I am cleaning the code a little to prepare a PR for it. Of course, all testing to this new code (its the same that is attached, just comments are a little cleaner) is welcome.
I just want to get the comm issues sorted out. This MUST work flawlessly under all circumpstances!

Yes the documentation I have read does seem to indicate that the RX ints are level triggered, but as you say it seems that this may not be he case (or some combination of events means that it is not always so).

Hey @thinkyhead ... This is not a potential bug. It is confirmed!! 馃憤 -- If RX buffer is set to >256, there are characters lost and buffer overruns. Even if XON/XOFF is disabled...

Ok, I'm testing this latest 2.0 bugfix (RX buffer=1024) but I'm almost sure it will solve the issue.
Time 2 hour and I'll have the answer

Also testing the latest bugfix-2.0.x build. So far I've tested the default configuration and that worked fine (as expected). Now running a test with the buffer at 1024 and no XON/OFF.

It seems it is really working.. Thank you very much @ejtagle for all the efforts you put into this.
Also thanks to @gloomyandy for all testing/reports done.

I'll close this after @gloomyandy confirmation

Test completed, no errors, no data overruns. Works fine for me with the default and 1024 buffer sizes! Thanks again @ejtagle and @GMagician .

Hurrah!

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings