Zephyr: mempool is expensive for cyclic use

Created on 16 Jul 2019  路  3Comments  路  Source: zephyrproject-rtos/zephyr

Is your enhancement proposal related to a problem? Please describe.
There is one issue with Zephyr heap allocator which we will be fighting soon. If you application is iteratively allocating and deallocating multiple small chunks of data the process of breaking and recombining blocks is REALLY expensive.
E.g.

/* heap is free - no allocations */
while (true) {
   /* wait for signal */
   chunk = k_malloc(16); /* makes heap break down through all levels to give this chunk */
   process_chunk(chunk); /* simplified processing - you can imagine N events created at this moment that will be freed before cycle ends */
   free(chunk); /* makes heap recombine itself up to the top */
   /* no allocations on heap at this moment */
}

My understanding is that the idea behind implementing it this way was to avoid fragmenation. It could be a problem if some chunks are allocated for long periods of time. However I would expect something like that in a PC application not in embedded where all long-lived buffers and data are statically allocated in RAM.
The real problem is however that for iterative allocating of data (e.g. on upcoming event) Zephyr has to break down entire heap before giving the chunk. Later when event is served it will recombine the heap back leading to drastic waste of CPU time.

Describe the solution you'd like
Have a possibility to disable block recombination unless explicitly needed.

Describe alternatives you've considered
One can grab a small chunk of memory and never release it. This way you can force some blocks to remain broken and ready for allocation.

Additional context
N/A

Enhancement Memory Management

Most helpful comment

Oh man, do I have the patch for you. Trying to get a prototype up today, though it might stretch to tomorrow.

The splitting and merging isn't really "slow", but it's definitely a drawback of buddy allocators[1] when used in a "mostly empty" situation. These things tend to work best in large heaps with mixed usages where memory is fragmented and you can "usually" find a block of the size you want already in the free list. In the circumstances where all the blocks are maximal size, this sees pessimal performance.

Fancy heaps definitely use buddy allocators, but usually with optimizations around the edges to cache blocks for this case, or to make use of wasted space at the end of buddy blocks, etc... We can't afford that complexity.

Instead, I have a comparatively simple, very conventional segregated fit allocator almost ready. It's somewhat smaller, benchmarking as MUCH faster (precisely because of the issue you identify here), is constant time and can be used without elaborate (and historically error-prone) log relaxation in an ISR, works with arbitrary blobs of memory into which it places all its metadata (i.e. no crazy static allocation macros: "here's 50181 bytes at 0x0301ba90, put a heap in it"). And it comes with an elaborate test and benchmark rig which can exhaustively test all the edge cases and fragmentation behavior.

[1] We have a four-way buddy allocator, which is... just weird. It takes the space efficiency that is normal considered a feature of buddy allocators and throws it out. Normally power of two allocation is employed. I don't know why we chose four originally, nor why I didn't try to change it before.

All 3 comments

Oh man, do I have the patch for you. Trying to get a prototype up today, though it might stretch to tomorrow.

The splitting and merging isn't really "slow", but it's definitely a drawback of buddy allocators[1] when used in a "mostly empty" situation. These things tend to work best in large heaps with mixed usages where memory is fragmented and you can "usually" find a block of the size you want already in the free list. In the circumstances where all the blocks are maximal size, this sees pessimal performance.

Fancy heaps definitely use buddy allocators, but usually with optimizations around the edges to cache blocks for this case, or to make use of wasted space at the end of buddy blocks, etc... We can't afford that complexity.

Instead, I have a comparatively simple, very conventional segregated fit allocator almost ready. It's somewhat smaller, benchmarking as MUCH faster (precisely because of the issue you identify here), is constant time and can be used without elaborate (and historically error-prone) log relaxation in an ISR, works with arbitrary blobs of memory into which it places all its metadata (i.e. no crazy static allocation macros: "here's 50181 bytes at 0x0301ba90, put a heap in it"). And it comes with an elaborate test and benchmark rig which can exhaustively test all the edge cases and fragmentation behavior.

[1] We have a four-way buddy allocator, which is... just weird. It takes the space efficiency that is normal considered a feature of buddy allocators and throws it out. Normally power of two allocation is employed. I don't know why we chose four originally, nor why I didn't try to change it before.

Ideally, the recombine operation would not be performed in block_free()
but rather in block_alloc(), and only up to the required level, and only
if no blocks of the requested level are available. This way you'd get a
nice hysteresis in the readily available block sizes, and recombine
would then be a fairly rare event. I'm not sure about how not to make it
overly costly when it happens due to the need to scan entire levels
though.

I'm really looking forward to the replacement implementation from @andyross.

Hi @andyross , I would definitely want to try it and it does come not a moment too soon :)
So far I was applying a workaround and thought that we will tweak the current design. I always felt however that in the end it would be better to create something new (e.g. optionally selected, simpler allocator).
So thanks in advance :)

Was this page helpful?
0 / 5 - 0 ratings