Marlin: [BUG + FIX] USE_USB_COMPOSITE on STM32F103RC while using serial causes firmware to freeze

Created on 9 Jan 2020  路  40Comments  路  Source: MarlinFirmware/Marlin

Bug Description

Using SKR 1.2 Mini E3 (STM32F103RC) on Ender 3 Pro.
When using STM32F103RC_bigtree_512K everything is OK
When using STM32F103RC_bigtree_512K_USB firmware can freeze and cause overheat.
Using 2.0.x-bugfix branch

config files attached.

Steps to Reproduce

  1. Flash Merlin 2.0.x-bugfix using enviroment STM32F103RC_bigtree_512K_USB
  2. Connect USB (tried with and without 5v supply from USB host)
  3. Connect to Serial Port (I used Pronterface)
  4. Issue a command (I used M303 S240 C10 to start PID autotune)
  5. Serial will perform normally for a time (60sec perhaps)
  6. LCD will freeze, Serial traffic will stop. Hotend continues to heat.
  7. If you reconnect to the serial port the LCD display will start working and serial will respond again. It will freeze up again around 60 sec later. You can repeat this process to keep the firmware running

I have overtemp protection on which is still functional and causes alarm on the printer.
Switching to non USE_USB_COMPOSITE environment appears to be OK

Loafdude Config.zip

STM32 Confirmed ! Serial Comms Fix Included

Most helpful comment

I'll try a linux machine too. I've tried two windows machines with different platforms and different cables so it's def not hub or host USB controller related. I really don't think it is a host / host hardware issue.

The SD card becomes unresponsive when serial is in use too. Perhaps it is a multitasking issue on the STM32. It runs serial for a while then goes to run mass storage and no serial cpu time causes merlin to freeze because serial is part of it's program loop. just speculating tho

All 40 comments

can confirm, serial disconnects while pidtuning, but the pidtuning completes.

but other commands work fine.

Tried disabled emulated EEPROM (EEPROM.DAT) but still get freezing when running M303 S240 C10.

Here @Loafdude I included those PIDs to the compiled firmware. It's the latest bugfix. If it still freezes on your setup then it I guess it might be some random hardware issue.
Marlin_conf_plus_bin.zip

tuning

For me PID tuning seems to work alright also in Octoprint tried 240C and 200C with 10 iterations. Is the issue happening to you every time you try PID tuning?

pid

Fails with your firmware.bin as well.
Yes it fails consistently every time.

I think I have narrowed it down though!
I believe it is a conflict between the storage and serial devices on the STM32F103RC.
If I disable the storage device in device manager in windows and only use serial it works!

I suspect you are not seeing this error because you are using Octoprint.
That could be because of a few reasons

  • Octoprint is not mounting the fat32 volume
  • Windows is polling the filesystem and linux is not and therefor it does not cause a crash
  • Linux USB stack is not triggering the error

I bet if you used a windows machine and did the same command it would freeze.

I have tested it also on Windows using Repetier and Printrun (pronterface), working with both of them. What I investigated about similar kind of issues the recommended solutions varied mostly between trying different baud rates and different USB cables. Seems like a quite wicked problem as it's difficult to locate if it's Marlin, hardware, drivers or the combination of the all.

I'll try a linux machine too. I've tried two windows machines with different platforms and different cables so it's def not hub or host USB controller related. I really don't think it is a host / host hardware issue.

The SD card becomes unresponsive when serial is in use too. Perhaps it is a multitasking issue on the STM32. It runs serial for a while then goes to run mass storage and no serial cpu time causes merlin to freeze because serial is part of it's program loop. just speculating tho

Have you tried what happens if you try PID tuning without SD card or by enabling NO_SD_HOST_DRIVE which does kind of negate the USB composite, but might help narrowing down the issue?

USE_USB_COMPOSITE is not marlin related, but a STM32 (experimental) framework feature...

@tpruvot so this is not marlin related after all?

It is difficult for me to provide a bug report upstream to the framework as I am not familiar enough with Marlin codebase to provide accurate information. I'm happy to work with someone to test and provide the needed feedback in a timely fashion.
Regardless I can still reproduce freezing consistently and it has been confirmed by @reloxx13 but does not affect all SKR 1.2 E3 Mini users apparently.
This does have the potential for thermal runaway if protections are not turned on. (My unit gets heating error while frozen).
At this point I cannot reliably use serial interface while USE_USB_COMPOSITE is in place.

Can confirm this issue for SKR E3 v1.2 with the latest Marlin 2.0.1. During PID tuning USB version freezes but the machine keeps(!) heating resulting in runaway error on next boot if I delay with reset! This is how it works under Win10 in Cura & Simplify3D.

However in linux environment things may run different. Only serial connection freezes, but PID tuning finishes ok. It seems the USB device gets reset every ~30 seconds....
```
...
Jan 17 18:31:13 octopi kernel: usb 1-1.2: reset full-speed USB device number 12 using xhci_hcd
Jan 17 18:31:14 octopi kernel: cdc_acm 1-1.2:1.1: ttyACM0: USB ACM device
Jan 17 18:31:44 octopi kernel: usb 1-1.2: reset full-speed USB device number 12 using xhci_hcd
Jan 17 18:31:44 octopi kernel: cdc_acm 1-1.2:1.1: ttyACM0: USB ACM device
Jan 17 18:32:15 octopi kernel: usb 1-1.2: reset full-speed USB device number 12 using xhci_hcd
Jan 17 18:32:15 octopi kernel: cdc_acm 1-1.2:1.1: ttyACM0: USB ACM device
Jan 17 18:32:46 octopi kernel: usb 1-1.2: reset full-speed USB device number 12 using xhci_hcd
Jan 17 18:32:46 octopi kernel: cdc_acm 1-1.2:1.1: ttyACM0: USB ACM device
...

A good start would be to find a way to toggle it by software, but without reseting the whole composite stack (which cut serial connection too)... maybe a simple "ignore USB SD calls" boolean could help.... I believe win10 is doing something weird on disk detection... every fast usb keys are slow to be available in win10...

However in linux environment things may run different. Only serial connection freezes, but PID tuning finishes ok. It seems the USB device gets reset every ~30 seconds....

Actually it doesn't freeze - it just gets reset. I'm not sure where the reset is getting initiated (by the firmware or the Linux OS?).

Very Interesting, I won't be able to confirm for a couple weeks I'm away on business.
There appears to be some USB debug functionality baked into the kernel. Maybe we can glean some more info there. https://wiki.kubuntu.org/Kernel/Debugging/USB

I'm coming from #16653, having exactly the same issue.

@Loafdude have you tried a pid autotune from Pronterface using BTT stock firmware? It works from there, probably because it is a NOUSB env.

Looked at this a bit more. It appears that Linux is must be issuing the USB device reset because the SCSI device is either unresponsive, or producing invalid responses, while the PID autotune is in progress (for some reason). I deleted the SCSI device (echo 1 > /sys/block/sda/device/delete) and PID autotune was able to proceed with no resets (and no serial connection drops).

@Deses No I have not tried the BTT fw. With USB_COMPOSITE disabled serial and PID autotune seems to work fine for me so I would expect the same if BTT does not have USB COMPOSITE enabled

@dseven On windows 10 if you disable the storage device in device manager it also allows PID autotune to proceed without freezing with no serial drops.

It appears that Marlin firmware is not giving up compute time to other services (serial and usb) often enough? It could easily be a platform level issue too.

Interesting. I wonder why it hangs the display/encoder for some and not others.

AFAICT, it's an issue with the storage part of composite only - it seems that the serial part continues to work. I don't know enough about the STM32 platform architecture to guess what might be going on. I wonder how we can find an expert to look at this......

It seems I've found a solution. This needs more testing, but I was able to run PID autotune with no problems on Win10 + Simplify3d while coping large file onto SD simultaneously. AFAIU, STM32F1 SDK suppose to have USBMassStorage.loop() to be called for idle processing. But Marlin is missing it. I put in into ui.update(). May be there is a better place, please advise & test. Here's a diff:

diff --git a/Marlin/src/lcd/ultralcd.cpp b/Marlin/src/lcd/ultralcd.cpp
index e8176c1..b266351 100644
--- a/Marlin/src/lcd/ultralcd.cpp
+++ b/Marlin/src/lcd/ultralcd.cpp
@@ -827,7 +827,11 @@ void MarlinUI::update() {

   #endif // HAS_LCD_MENU

-  #if ENABLED(INIT_SDCARD_ON_BOOT)
+    #ifdef __STM32F1__
+        MarlinMSC.loop();
+    #endif
+
+    #if ENABLED(INIT_SDCARD_ON_BOOT)
     //
     // SPI SD Card detection (and first card init when the LCD is present)
     //

Ok. I've made some more digging into the problem. Loafdude pointed me on a HAL_idletask() function which only purpose by now is to process USB mass storage device class loop (see src/HAL/HAL_STM32F1/HAL.cpp). It turn out that Marlin itself executes HAL_idletask() in its our idle() routine in src/Marlin.cpp.

So the problem as I see it now is in M303 loop itself. There is a function void Temperature::PID_autotune() in src/module/temperature.cpp and it has its own while (wait_for_heatup) loop blocking everything else from executing. They call ui.update() at the end of the loop. However in STM32F1 case it breaks USB cardreader.

The proper solution would be something like replacing ui.update() with idle() in temperature.cpp. But that break PID_autotune() completely since heater is constantly being switched off by thermalManager.manage_heater() call in the idle() routine (since PID_autotunedoes not affect global temperature state and manage heater internally).

So main problem is internal blocking loop in void Temperature::() that does not call for HAL_idle().

Kludgy solution would be adding HAL_idle() directly to PID_autotune() like in example below. But IMHO it should be done in more general way, since there could be other services that become broken by autotune internal loop.

The current solution:

diff --git a/Marlin/src/module/temperature.cpp b/Marlin/src/module/temperature.cpp
index daaa008..008998f 100644
--- a/Marlin/src/module/temperature.cpp
+++ b/Marlin/src/module/temperature.cpp
@@ -605,7 +605,12 @@ volatile bool Temperature::temp_meas_ready = false;

         goto EXIT_M303;
       }
+
       ui.update();
+
+      #ifdef HAL_IDLETASK
+        HAL_idletask();
+      #endif
     }

     disable_all_heaters();

Thanks for digging into this, @Phisik (and @Loafdude)! If I'm reading this right (too tired to look at the code tonight), it sounds like there needs to be a way to suspend the turning off of the heater while PID autotune is in progress - some sort of "leave it alone" flag, that would get turned off again when autotune has finished... then call the normal idle routine(s)....

Well, here is a solution with is_pid_autotune_running flag pid_patch.zip. But I'm not really fond of introducing extra flags, since it's easy to forget to clear them one day.

On the other hand we can add an extra parameter to idle() routine, e.g. idle(bool skipThermalManagement = false), and call it from PID_autotune() like idle(true).

Don't know what will be less painful for future updates.

So I found this when debugging timeouts in the serial communication with Octoprint. The symptoms look like the following when doing an active print:

Send: N14955 G1 F3600 X114.083 Y115.768 E230.61067*9
Recv: ok N14955 P0 B63
Send: N14956 G1 F1500 E226.61067*14
Recv:  T:210.16 /210.00 B:49.59 /50.00 @:68 B@:29
Recv: ok N14956 P0 B62
Send: N14957 G1 F300 Z7.1*33
Recv:  T:210.04 /210.00 B:49.87 /50.00 @:71 B@:16
Recv: ok N14957 P30 B63
Send: N14958 G0 F5400 X117.796 Y110.997 Z7.1*20
Recv: ok N14958 P29 B63
Send: N14959 G1 F300 Z6.8*39
Recv: ok N14959 P28 B63

The freeze is at or before the T: autoreport line. You can see the planner buffer start to empty - get to 30 free slots, then things start working again and the planner buffer starts filling again.

This seems to happen on multiple occasions.

I'm wondering, for the sake of debugging, I'm currently building against the STM32F103RC_bigtree_512K_USB target, would switching to the STM32F103RC_bigtree_512K give any indication as to if its related to this?

Just to add another data point, I don't have serial timeout issues when compiling against STM32F103RC_bigtree_512K - so saying this only affects a PID Autotune is probably wrong.

I tested with a ~25 minute print using STM32F103RC_bigtree_512K and not a single freeze. The same print using STM32F103RC_bigtree_512K_USB gave the same problem over and over and over again.

As such, this needs to be a wider fix than just isolated to the PID Autotune routines...

EDIT: As another data point, this is using the latest OctoPi image on a Pi 4B 4Gb... so its based on Buster...

So I think I have unravelled the problem here.. (and for octoprint bug report #16036 )

The platform uses a composite USB object.
This object takes plugins which in our case are USBCompositeSerial and USBMassStorage (see msc_sd.cpp)
It appears that USBCompositeSerial will break if USBMassStorage does not have it's loop(); called on a regular basis.

Marlin currently calls the USBMassStorage.loop() during idle() via HAL_idletask().
Marlin also does not appear to guarantee that idle() will be called on a regular basis.
PID Autotune is one example of it not being called, but it appears other are running into the issue when printing from Octoprint as well.

The solution in my eyes is one of the following.

  • Guarantee idle() will be called on a regular basis (but there are implications with PID autotune and temperature management, perhaps a temperature_management_enabled flag is required)
  • Change/Add HAL_maintenance() and have it called on a regular interval somewhere other than idle()
  • Change/Add HAL_serial_maintenance() and have it called just prior to the serial buffer being read.

I have yet to code anything up to test.

I hit this (or something very very similar) when not tuning, by using wildly improper junction deviation settings. https://github.com/foosel/OctoPrint/issues/2647#issuecomment-575386271 has details, but it manifested as task hang reports and USB resets.

It did cause runaway heating. I measured 120C on the bed (set to 60C) with non-contact after unplugging the printer. (I did not think to check the extruder.)

I suspect this is another case where idle() is not being called regularly.
This is a core architecture issue and is difficult for us to patch without guidance from a core dev.
Where should USBMassStorage.loop() be called that it guarantees it is run on a regular basis?

Any further progress on this? I realize a core arch issue like this can take a long time to get resolved, but might it be possible to put in something quick and janky to get folks unblocked?

@Loafdude

  1. Serial will perform normally for a time (60sec perhaps)
  2. LCD will freeze, Serial traffic will stop. Hotend continues to heat.

thank you! this is exactly where the problem is,

if you will check write function
https://github.com/rogerclarkmelbourne/Arduino_STM32/blob/b5cd37696b7057f8489c0a17801420756b44a4e7/STM32F1/libraries/USBComposite/USBCompositeSerial.cpp#L74

you will find
this function is checking !this->isConnected() usb connection but this is always true for composite

second, while (txed < len) { - this function is blocking in contrast to
https://github.com/rogerclarkmelbourne/Arduino_STM32/blob/3db3bccf006aeb8123a05194094ff125645ad959/STM32F1/cores/maple/usb_serial.cpp#L130

with that changes working fine

--- /tmp/USBCompositeSerial.cpp 2020-05-30 21:55:15.317833776 +0300
+++ ./STM32F1/libraries/USBComposite/USBCompositeSerial.cpp     2020-05-30 21:55:19.693820340 +0300
@@ -80,11 +80,11 @@
     }

     uint32 txed = 0;
-    while (txed < len) {
+    //~ while (txed < len) {
         txed += composite_cdcacm_tx((const uint8*)buf + txed, len - txed);
-    }
+    //~ }

-       return n;
+       return txed;
 }

 int USBCompositeSerial::available(void) {

and

--- /tmp/usb_composite_serial.c 2020-05-30 21:58:03.493265125 +0300
+++ ./STM32F1/libraries/USBComposite/usb_composite_serial.c     2020-05-30 21:58:25.397184052 +0300
@@ -327,7 +327,7 @@
        }
        vcom_tx_head = head; // store volatile variable

-       while(usbGenericTransmitting >= 0);
+       //~ while(usbGenericTransmitting >= 0);

        if (usbGenericTransmitting<0) {
                vcomDataTxCb(); // initiate data transmission

@linvinus Sorry but this did not fix the issue on my end. Same symptoms.
After issuing M303 S240 C10 via pronterface
Temps raise
Serial reporting stops
LCD freezes
Temp Watchdog triggers. Heater failure.

I edited the following files with your diff
C:UsersUser.platformiopackagesframework-arduinoststm32-mapleSTM32F1librariesUSBCompositeUSBCompositeSerial.cpp and usb_composite_serial.c
This was built against today's branch of bugfix-2.0.x

@Loafdude thank you for testing! looks like my bug is another issue, which is related only for USBComposite.serial.

could you check following scenario with unmodified mariln2:
1) connect USBserial and run in terminal M155 S2 (report temperature every 2 seconds)
2) close terminal application (don't disconnect usb)
3) wait for 30 seconds
4) check LCD menu, does it working? if not, run terminal application again and check LCD menu, is it working now?

with PID_autotune probably you are right, i believe marlin should implement main loop with tasks paradigm, so while loop in PID_autotune is running it should disable manage_heater task (and some others if needed) but still call main loop idle(), to run others necessary tasks. i believe simple task state flags in uint32 will be enough.

without heavy modification this problem also could be resolved inside thermalManager, just make bool flag - PID_autotune_running, set this flag to true at begin of PID_autotune, and false at return points in this function.

instead ui.update(); at the end of pid loop in PID_autotune() function, cal idle()

but in function manage_heater() after line if (!raw_temps_ready) return;
, add new line if (PID_autotune_running) return; //simple return from manage_heater function

I have not had time to test your exact scenario but previously I tested the same scenario with PID autotune (M303 S240 C10).
If the serial port is not connected (but USB is still connected) it works correctly and LCD does not freeze. Once serial is connected it will freeze after a short time. If I disconnect and reconnect serial it starts working again.

@Loafdude
Please test the bugfix-2.0.x branch to see where it stands.

I'm having a similar problem with the new Bigtree tech E3 mini 1.3 v2, though only when I try to do a PID autotune. I updated to the bugfix branch as I was seeing random freezes.

Downloaded and compiled bugfix branch last night 23/06/2020 and still have the issue.

If I compile using STM32F103RC_btt_512K, the PID autotune works as expected.
If I compile using STM32F103RC_btt_512K_USB, the PID autotune between about 40-60 seconds the printer freezes.

Just noticed though that with the USB option the com port is one up from the non USB option i.e. COM9 for non USB, and COM10 for USB.

To see if this is in any way related to idle not getting called properly, please disable the USE_WATCHDOG option and do the same things that led to a disconnection in the past. The watchdog will trigger a shutdown any time there's a delay longer than 4 seconds between calls to the watchdog reset.

If it is not specifically watchdog-related then we should look deeper. Of course, as always, make sure to download and use the current bugfix-2.0.x code.

I'm observing this problem still on the latest bugfix-2.0.x, on the SKR Mini E3 v1.2. I compiled for the STM32F103RC_btt_512K_USB env, and when running PID auto tuning over the USB-serial port the board freezes, and t he behavior exactly matches what's described in the thread above.

This issue needs to stay alive until it is fixed. Commenting to prevent it from going stale.

It looks like some good research has happened into this? Is anyone involved still actively pursuing fixing it?

I think it鈥檚 fixed on bugfix now. PR #19671

I think it鈥檚 fixed on bugfix now. PR #19671

I can confirm that it's fixed in that PR/latest bugfix-2.0.x

Was this page helpful?
0 / 5 - 0 ratings