Marlin: [FR] Marlin ARM binary size reduction (with proposed solution)

Created on 17 Sep 2020  路  5Comments  路  Source: MarlinFirmware/Marlin

Description

The size of an ARM compiled binary tends to be fairly large, especially for a full featured Marlin build.

Marlin 2.0.6.1 for an SKR Mini E3 builds to roughly 261K (with additional features enabled) which in itself doesn't fit into the official 256K of supported flash the STM32F103RC has, let alone the 28K of bootloader at the start, and 4K of EEPROM emulation at the end.

While there's 512K mode for the STM32F103RC chip, which mitigates this a little bit, it essentially means we're using an area of the flash that the vendor does not warrant to be properly functional, even if it seems so. In practical terms, even if there's problems in that second 256K block of flash, as long as the last 4K are fine, and the first few tens of K are fine, most people won't have issues with Marlin. Still it's not a particularly great solution.

So initially I was looking for compiler flags to reduce the binary size, but quickly found out that there were few gains to be made, however at the bottom of this article, there was a little gem however:
https://thborges.github.io/blog/marlin/2019/01/07/reducing-marlin-binary-size.html

Proposed Solution

I've tested this specifically for my SKR Mini E3 V1.2, like so:

echo 'Import("env")'                                                      > buildroot/share/PlatformIO/scripts/nanolib.py
echo 'env.Append(LINKFLAGS=["--specs=nano.specs"])'                      >> buildroot/share/PlatformIO/scripts/nanolib.py
sed -i 's@  buildroot/share/PlatformIO/scripts/STM32F103RC_SKR_MINI.py@&\n  buildroot/share/PlatformIO/scripts/nanolib.py@' platformio.ini

This reduces the resulting binary by several tens of K, where a featureful Marlin build can easily fit in 256K of flash with bootloader and EEPROM emulation accounted for.

And it seems to work just fine, since I've completed a few prints already. Though further broader testing is still needed.

Additional Info

Though looking a bit further it may not actually be that surprising why it works as seemingly flawlessly as it does, given that newlib-nano is a blend of two libraries that Marlin actually uses separately, specifically newlib and avr-libc:

https://keithp.com/newlib-nano/

Here's a library you can use when developing a system using a 32-bit processor with only a few kB of memory. You don't need an allocator, and you can still have stdio to a console and even other devices. This is a fork of newlib, with the stdio bits replaced with the stdio bits from avr-libc.

I would love to hear your thoughts on newlib-nano, and more specifically consider it as the new (future) standard libc for ARM boards, as it may benefit boards other than the SKR mini E3 as well. And if all goes well (fingers crossed), possibly might not even have few if any tangible disadvantages.

Build / Toolchain Feature Request

Most helpful comment

Marlin is already very good at sorting unused code out. But sometimes you can squeeze a little more kB if you remove some (unused) object files from the linker statement. You may also gain some bytes if you advise the linker to remove debugging information .. put this to the linker script

/* Remove information from the standard libraries */
   /DISCARD/ :
{
    libc.a ( * )
    libm.a ( * )
    libgcc.a ( * )
}

Few bytes could be also gained by removing arguments to function calls and using global variables instead. But this may make the code more unreadable and more difficult to maintain.

I found out that giving -mthumb and -flto (link time optimization) needs to be given to both compilation as well as to linking.
The optimization for size (-Os) is best for both, code size and RAM usage. The program gcc should be used for linking not ld. This affects optimizing and commandline syntax. When I tried to optimize code size I ended up with the following set of flags.

CFLAGS
-ffunction-sections -fdata-sections -mthumb -fsingle-precision-constant -fmerge-all-constants --specs=nano.specs --specs=nosys.specs -falign-labels=4 -falign-jumps=4 -falign-functions=4 -mtune=cortex-m3 -fno-non-call-exceptions -ffreestanding -finline-small-functions -findirect-inlining -Os -mcpu=cortex-m3

CXXFLAGS
-std=gnu++17 -std=gnu++17 -fno-rtti -fno-exceptions -fno-use-cxa-atexit -fno-common -fno-threadsafe-statics

LDFLAGS
-mthumb -flto -u_printf_float -Wl,--gc-sections,-Map,Marlin.map,--cref,--check-sections,--unresolved-symbols=report-all,--warn-common,--relax -TLPC1768.ld --specs=nano.specs --specs=nosys.specs -static -Wl,--start-group -lstdc++ -lgcc -lc -lm -Wl,--end-group

IMPORTANT

  • Avoid -funwind-tables -mpoke-function-name, those create more debugging symbols.
  • Using -fshort-enums -funsigned-bitfieldsor aligning to something smaller than 32bit will _not_ help reducing the code size.

  • Do not use any of -mint8 -fsigned-char, those can render the binary unuseable. Full review of of all used _int_-types is needed to avoid overflowing or comparison mismatch.

  • Be careful that you don't optimize constructors for static instantiated classes away when you play with the linker script. Gcc removes strictly anything what is not referenced. The result will be smaller but nothing will work anymore. I had a hard time to figure out what went wrong.

I was using gcc version 9.3.1 and building for BigTreeTech SKR v1.4 Turbo with LPC1769.

All 5 comments

Marlin is already very good at sorting unused code out. But sometimes you can squeeze a little more kB if you remove some (unused) object files from the linker statement. You may also gain some bytes if you advise the linker to remove debugging information .. put this to the linker script

/* Remove information from the standard libraries */
   /DISCARD/ :
{
    libc.a ( * )
    libm.a ( * )
    libgcc.a ( * )
}

Few bytes could be also gained by removing arguments to function calls and using global variables instead. But this may make the code more unreadable and more difficult to maintain.

I found out that giving -mthumb and -flto (link time optimization) needs to be given to both compilation as well as to linking.
The optimization for size (-Os) is best for both, code size and RAM usage. The program gcc should be used for linking not ld. This affects optimizing and commandline syntax. When I tried to optimize code size I ended up with the following set of flags.

CFLAGS
-ffunction-sections -fdata-sections -mthumb -fsingle-precision-constant -fmerge-all-constants --specs=nano.specs --specs=nosys.specs -falign-labels=4 -falign-jumps=4 -falign-functions=4 -mtune=cortex-m3 -fno-non-call-exceptions -ffreestanding -finline-small-functions -findirect-inlining -Os -mcpu=cortex-m3

CXXFLAGS
-std=gnu++17 -std=gnu++17 -fno-rtti -fno-exceptions -fno-use-cxa-atexit -fno-common -fno-threadsafe-statics

LDFLAGS
-mthumb -flto -u_printf_float -Wl,--gc-sections,-Map,Marlin.map,--cref,--check-sections,--unresolved-symbols=report-all,--warn-common,--relax -TLPC1768.ld --specs=nano.specs --specs=nosys.specs -static -Wl,--start-group -lstdc++ -lgcc -lc -lm -Wl,--end-group

IMPORTANT

  • Avoid -funwind-tables -mpoke-function-name, those create more debugging symbols.
  • Using -fshort-enums -funsigned-bitfieldsor aligning to something smaller than 32bit will _not_ help reducing the code size.

  • Do not use any of -mint8 -fsigned-char, those can render the binary unuseable. Full review of of all used _int_-types is needed to avoid overflowing or comparison mismatch.

  • Be careful that you don't optimize constructors for static instantiated classes away when you play with the linker script. Gcc removes strictly anything what is not referenced. The result will be smaller but nothing will work anymore. I had a hard time to figure out what went wrong.

I was using gcc version 9.3.1 and building for BigTreeTech SKR v1.4 Turbo with LPC1769.

Thanks for adding that...

To be more specific regarding newlib vs newlib-nano, using my custom but otherwise identical Marlin configuration resulting in 261316 - 199072 = 62244 bytes size reduction.

It's not really a Marlin issue but a libmapple (which is used for HAL STM32F1)
Switch to HAL STM32 and you'll get about 30% side reduction.
I was able to fit Delta math with graphics UI into 128KB of STM32F103RBT6.

@jmz52 unless I'm looking at the wrong thing, it's the same difference, it seems STM32 uses newlib nano by default

We are beginning to transition STM32F1 boards over to the STM32 HAL. We are very near feature parity between the two, and two MKS Robin boards have environments in available for both HALs.

It likely won't be worth investing effort into improving the Maple-based STM32F1 builds, since our intent is to discontinue use of that framework, since it is deprecated by PlatformIO.

Was this page helpful?
0 / 5 - 0 ratings