Marlin: FPU usage?

Created on 2 Nov 2017 · 29Comments · Source: MarlinFirmware/Marlin

Hi all!

Does Marlin benefit from hw FPU on SOC?

Build / Toolchain Question

Source

alexxy

Most helpful comment

I'm well aware there are Cortex M4 with an FPU, but Cortex M4 is not a SOC. Generally they are referred to as MCU.

Anyway, it's quite irrelevant how the chip is packaged as far as Marlin is concerned, so let's assume the question is "Does Marlin benefit from a HW FPU?"

I think ESP32 and Teensy 35/36 have hardware FPU, but AFAIK the performance critical code in Marlin avoids float, so the answer is basically "not very much".

bobc on 2 Nov 2017

👍2

All 29 comments

FPU is normally part of the CPU, rather than SOC. Do you have a CPU in mind?

bobc on 2 Nov 2017

@bobc there is a lot of SOC's that has FPU's, you can get all Cortex M4 and up with or without FPU...

Spawn32 on 2 Nov 2017

I'm well aware there are Cortex M4 with an FPU, but Cortex M4 is not a SOC. Generally they are referred to as MCU.

Anyway, it's quite irrelevant how the chip is packaged as far as Marlin is concerned, so let's assume the question is "Does Marlin benefit from a HW FPU?"

I think ESP32 and Teensy 35/36 have hardware FPU, but AFAIK the performance critical code in Marlin avoids float, so the answer is basically "not very much".

bobc on 2 Nov 2017

👍2

It's probably just a word confusion, in my head a MCU with sd-card interface, Ethernet interface and a display port, etc.. is a SOC :) but i think we mean the same :)

Spawn32 on 2 Nov 2017

There are 2 points:

1) Marlin has a lot of FP calc, more than I think necessary. So yes, a FPU
will fasten them up.

2) Do you need a faster processing? I mean, a modern 32 bits CPU is more
than enough for Marlin, Don't pay more for an FPU.

Anyway, you can enable the use of a Cortex M4 FPU just passing a
compilation flag to compiler.

Bear in mind using the FPU in M4 will increase a lot the use of stack (If I
remember, about 16 bytes per call).

Cheers.

Alex.

Em 2 de nov de 2017 19:45, "Morten" notifications@github.com escreveu:

@bobc https://github.com/bobc there is a lot of SOC's that has FPU's,
you can get all Cortex M4 and up with or without FPU...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/MarlinFirmware/Marlin/issues/8221#issuecomment-341567831,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE-ZE59g_Njok2OppGwebFX1_0KFfArTks5syjgUgaJpZM4QQRv4
.

alexborro on 2 Nov 2017

By the way, SoC means System on a Chip. "On a Chip", not "on a Board" :)

Em 2 de nov de 2017 20:21, "Alex Borro" alexborro@gmail.com escreveu:

There are 2 points:

1) Marlin has a lot of FP calc, more than I think necessary. So yes, a FPU
will fasten them up.

2) Do you need a faster processing? I mean, a modern 32 bits CPU is more
than enough for Marlin, Don't pay more for an FPU.

Anyway, you can enable the use of a Cortex M4 FPU just passing a
compilation flag to compiler.

Bear in mind using the FPU in M4 will increase a lot the use of stack (If
I remember, about 16 bytes per call).

Cheers.

Alex.

Em 2 de nov de 2017 19:45, "Morten" notifications@github.com escreveu:

@bobc https://github.com/bobc there is a lot of SOC's that has FPU's,
you can get all Cortex M4 and up with or without FPU...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/MarlinFirmware/Marlin/issues/8221#issuecomment-341567831,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE-ZE59g_Njok2OppGwebFX1_0KFfArTks5syjgUgaJpZM4QQRv4
.

alexborro on 2 Nov 2017

Leaving aside the SoC discussion, would Marlin benefit from an FPU? Likely not. float calculations would be faster, but current Marlin is idle enough time on 32bit processors.
Can FPU be used? sure, why not. I have started testing in an STM32F4 MCU with FPU enabled.
I don't think I need it, it's just enabled by default in the core I use.
Would it penalize? the MCU will consume more power, but again not a concern for Marlin, any stepper with eat more power than the FPU.

So to answer the original question "Does Marlin benefit from hw FPU on SOC?"
Not currently. Perhaps in the future is Marlin has to do more things to the point the CPU is busy enough, but not now. It doesn't hurt either.

victorpv on 3 Nov 2017

👍1

It doesn't hurt either.

If I remember correctly there are a few downsides to enabling the M4 FPU, stack overhead for the fpu registers, and extra overhead on interrupts, also only the float type is supported.

p3p on 3 Nov 2017

From a totally different perspective... When we started the 32-bit effort, one of the design constraints was if you have to do something sub-optimally, penalize the 32-bit processor, not the 8-bit one. If it runs on the 8-bit one, it won't have a problem running on the 32-bit one.

Roxy-3D on 3 Nov 2017

Note that on many of these processors, you can't use the FPU from interrupt context. So our best optimizations in the Stepper ISR come from pre-calculating in the planner and using integers as much as possible in the stepping logic.

thinkyhead on 3 Nov 2017

1) Marlin has a lot of FP calc, more than I think necessary. So yes, a FPU will fasten them up.

I expect the fixed-point planner/stepper (coming soon) will help to reduce load significantly. There are a few forks that use this approach. Non-contributing forks…

thinkyhead on 3 Nov 2017

I think you all are right about some limitations and over head, but I haven't looked much at ram usage differences.
I can do some tests with the F4 once I have it a bit more advanced, with and without FPU, if anyone wants to do any comparison on any code.
In stm32duino.com there is a thread about drystone and whetsone results for different stm32 MCUs. Starting on this page we started getting the hardware FPU to work and show differences:
http://www.stm32duino.com/viewtopic.php?f=3&t=76&hilit=drystone&start=130#p26641
Obviously the FPU makes quite a bit of difference, but only with SP calculations. DP still works, but is totally software, so performance with DP is exactly the same as without FPU.
With FPU enabled, under similar test conditions, whetstone went from:
C Converted Single Precision Whetstones:14.22 mflops
to:
C Converted Single Precision Whetstones:64.17 mflops
We did a lot more testing, overclocking, optimization flags, splitting code on multiple files to reduce compiler optimizations...
So yeah SP performance is 4x with PFU, but I don't think Marlin really spends so much time in FP calculations.
But like I said once the F4 HAL is functional to some degree (may be already, but i haven't tested anything other than compiling), I can run tests.

Anyone has any test to suggest? I can toggle a pin up and down to measure time with a logic analyzer.
I haven't seen any loop in Marlin doing FP calculations repeatedly though, seems like it's mostly one operation here one there, so the time difference may not be large enough.

victorpv on 3 Nov 2017

What if higher order polynomials are used for velocity ramp generation, I mean jerk limiting velocity ramp or an S-shape velocity ramp then in this case an FPU might just be actually necessary.

alfredanil on 6 Feb 2018

One could also use a relatively small cosine table to generate an S shaped curve, if that was needed, and save on realtime computation, hence electrons and heat.

thinkyhead on 14 Feb 2018

@thinkyhead
Well, in that is the case even a simple six point velocity ramp can do much better than a small cosine table.
sixpoint_ramping

alfredanil on 22 Feb 2018

Nice drawing - except A1/D1 is larger than AMAX/DMAX.

AnHardt on 22 Feb 2018

😄1

@AnHardt
That's not my art but it's the ramp generated by Trinamic ramp generator chip called TMC5130 and TMC5072.

alfredanil on 22 Feb 2018

Nice drawing - except A1/D1 is larger than AMAX/DMAX.

Yes. But isn't it possible that the Jerk is higher than the max allowed acceleration? Can A1 and D1 be the Jerk ???

Roxy-3D on 22 Feb 2018

What we call Jerk is the sudden jump in speed from 0 to VSTART/VSTOP with (in theory) infinite acceleration. A1/D1 is our max acceleration. AMAX/DMAX is a somewhat lower acceleration taking into account the lower torque the steppers have at higher speed. VMAX is higher than you could reach without the additional flatter ramp.
It's really just the naming what confused me for a while.

If planning is easier/faster with a 5 point trapezoid or a cos() is questionable to me.
Even more frightening is the suggestion to have a 7 point trapezoid for linear advance V3.

AnHardt on 22 Feb 2018

@AnHardt Can you comment on the artifacts seen in this picture:
https://github.com/MarlinFirmware/Marlin/issues/9529#issuecomment-367496720

Are these artifacts caused by the ramping up and down of the stepper motors?

Roxy-3D on 22 Feb 2018

@Roxy-3D
Sorry. No - can't comment. Too much annoyed by the z-wobble amplifiers. ;-)

AnHardt on 22 Feb 2018

It's possible that changes in speed can lead to artifacts, especially if the extrusion is a little high. Built-up pressure in the nozzle extrudes (oozes) in regular time, but the axis is speeding up and slowing down. Linear Advance is supposed to help with this.

thinkyhead on 23 Feb 2018

👍1

Linear Advance is supposed to help with this.

That's exactly what I'm trying to convey. The extrusion width tends to be thicker at the beginning of the stroke and narrows down towards the end of the stroke. The velocity profile I mentioned earlier might be a solution to this without much complex computation. The extrusion rate should synchronize with each segment of the velocity ramp.

alfredanil on 23 Feb 2018

The extrusion rate should synchronize with each segment of the velocity ramp.

The E stepper movement is directly proportional to the XY steps and does follow the acceleration curve exactly. But under these conditions, only when the melt chamber reaches a certain pressure does E movement lead to a controlled extrusion, and if you stop moving E (and don't retract) there will still be ooze.

Marlin is basically a controlled ooze machine. Retraction at the end of a line plus "extra recover" at start of the next are the crude tools it uses normally. With LIN_ADVANCE Marlin scales the E movement to overcome what you're seeing and give a more regular line. It works very well.

Of course, it also sounds like you might be under-extruding a little.

thinkyhead on 23 Feb 2018

The extrusion width tends to be thicker at the beginning of the stroke and narrows down towards the end of the stroke.

Are you sure about that? The "normal" behaviour is the other way around. Under-extrusion at the beginning, over- at the end.

With more speed dependent acceleration phases and different ramps per axis this gets complex very soon. Do you have an idea for an algorithm handling more than one axis?

AnHardt on 23 Feb 2018

Well, if the viscoelastic behaviour of the mechanical power transmission element ('belt drive') is considered then there is a difference between the motion that the controller generates and actual motion of the driven member. Longer the belt lower the stiffness and that's what causes a lag and harmonics in the actual motion. With increasing stroke length also increases the dynamic forces on the belt and this results in a different motion profile than what the controller inputs to the motor. Marlin generates a ramp that probably might work exactly as expected on a linear motor.

Do you have an idea for an algorithm handling more than one axis?

Not sure exactly. there is a flavor of marlin for the TRAMS board which uses an integrated ramp generator and driver. TMC5130 generates ramp with both trapezoidal and so called 'Six point' profiles but not sure which one is implemented in the marlin for TRAMs firmware.

alfredanil on 23 Feb 2018

a difference between the motion that the controller generates and actual motion of the driven member

If you have rubber bands for belts. But generally-speaking with typical and even cheap belts the transferred motion is faithful within a fraction of a mm at all times.

thinkyhead on 23 Feb 2018

😄1