I'm doing some tests so see if my extruder speed calculation for LIN_ADVANCE is right. Therefore I'm printing the initial and final rate of a block to serial when the stepper ISR is poping the block from the buffer.
During the first moves I noted there is something wrong with the calculation of the two values. I execute single movements over Pronterface along one axis only. So there is only one block in the buffer every time and no complicated math about multiple axis.
I expect to see an inital and final rate of Jerk setting * steps/mm according to the new jerk calculation. This would be 8mm/s*100.5 = 804 steps/s.
This is always true for block->initial_rate, but not for final_rate: Sometimes I get 804, sometimes 120. There is nothing I do different when 120 rises, I even move the axis in the same direction.
I checked if it might have something todo with MINIMUM_PLANNER_SPEED, but this would be 0.05*100.5=5 steps/s.
I can also say it has nothing todo with the planner reverse pas, as it is only called if movesplanned() > 3. I thought the forward pass might be the issue:
block_t* block[3] = { NULL, NULL, NULL };
for (uint8_t b = block_buffer_tail; b != block_buffer_head; b = next_block_index(b)) {
block[0] = block[1];
block[1] = block[2];
block[2] = &block_buffer[b];
forward_pass_kernel(block[0], block[1]);
}
forward_pass_kernel(block[1], block[2]);
It's not checking if there are at least 3 blocks in the buffer, is this done intentionaly? I'm afraid it will point somewhere in the memory if only one move is issued, but up to now I'm not deep enough inside the forward/reverse pas thing to say that for sure.
But the problem stays the same if I instert an if movesplanned() > 3 into this block. So it's not related to this bug.
@thinkyhead you imported the jerk code from the MK2, do you have an idea where to look first?
Edit: It's due to #define MINIMAL_STEP_RATE 120 inside planner.cpp. Now the question is, why final rate is sometimes below 120..
I guess I have the answer now:
There is a flag if the stepper ISR is using a block, but not one if a block is ready to be executed by the stepper ISR.
At the end of the planner, we are manually triggering a calculate_trapezoid_for_block()
for the new block just added. Inside this call, initial rate and final rate are set to jerk speed.
A few lines later, we are calling recalculate()
, which in the end will also recalculate the trapezoid for our new block - but this time Marlin decides to set the final rate to MINIMAL_STEP_RATE
:
// Last/newest block in buffer. Exit speed is set with MINIMUM_PLANNER_SPEED
. Always recalculated.
If the stepper ISR is executed before recalculate()
is called, I see a final_rate
of jerk speed. If the ISR happens afterwards, the move ends with a rate of 120.
I think this can't be a wanted behavior. The manual calculate_trapezoid_for_block()
at the end of the planner is useless as recalculate()
will do the same a few ticks later. For me, it looks like it tries to be faster than the stepper ISR so it will not execute a non-calculated trapezoid.
This may also be the root cause why the E stepper is rattling in arcs with LIN_ADVANCE
: When the buffer runs dry, it will sometimes stop at MINIMUM_PLANNER_SPEED
and immediately when the buffer has something in it, it will start at jerk speed again. A lot of unwanted speed changes.
Is there a reason why there is no "block_calculation_finished" flag?
Is there a reason why there is no "block_calculation_finished" flag?
None that I know of. Let me review your analysis and see if I can also make sense of it.
There is a flag if the stepper ISR is using a block, but not one if a block is ready to be executed by the stepper ISR.
The reasoning for this seems to be, the block is only added to the buffer once it's "ready" so there's no need for a flag. The calculate()
function apparently needs the block to be already in the buffer, which is why it's called afterward.
Quick question… I'm having trouble getting a definite answer by benchmarking because the compiler is optimizing. Is the following change going to save any cycles? Instead of this:
if (next) {
float nom = next->nominal_speed;
calculate_trapezoid_for_block(next, next->entry_speed / nom, (MINIMUM_PLANNER_SPEED) / nom);
next->flag &= ~BLOCK_FLAG_RECALCULATE;
}
…this (14 bytes more code):
if (next) {
float inv_nom = 1.0 / next->nominal_speed;
calculate_trapezoid_for_block(next, next->entry_speed * inv_nom, (MINIMUM_PLANNER_SPEED) * inv_nom);
next->flag &= ~BLOCK_FLAG_RECALCULATE;
}
Notes online say a float division takes ~34µs (544 cycles) and multiplication only ~9µs (144 cycles). So, theoretically, this should save a lot of time (256 cycles) on each usage. The extra 14 bytes of code might cost a little more. So ~242 cycles might be saved.
this time Marlin decides to set the final rate to
MINIMAL_STEP_RATE
I assume it does this because, as the last block in the planner, it should stop at the end.
@thinkyhead : Yes the use of inv_nom will be faster. There is quite some room for improvements in the planner / stepper calculations, that's why I also started an investigation some days ago. For example it would be even better to store inv_nom in the block as it's never changed after the block is added. So you can safe the 1/ next->nominal_speed inside recalculate_trapezoids completely.
I see I skipped a important line yesterday:
// Move buffer head
block_buffer_head = next_buffer_head;
So you are right, there is no need for an extra flag. Nevertheless we are inconsequent here, first we say the block should start and stop at jerk speed, later on we say stop at MINIMUM_PLANNER_SPEED. There should be one, true value and not something depending on throwing a dice (if the stepper executes the block before recalculate()).
It will make no big difference on normal use, but with LIN_ADVANCE this results in a lot of uneccesary moves when the buffer drains during circles: With about 50% probability the last block ends at MINIMUM_PLANNER_SPEED, but the next block starts at jerk speed. That means there is a delta_v jump that LIN_ADVANCE tries to compensate.
I would therefore recomend to check if jerk speed > MINIMUM_PLANNER_SPEED, and if yes use jerk speed also when the last block is calculated. I will test this and create a PR if it works.
I know that Grbl has evolved a lot since Marlin was born from it. Maybe it's worth taking a look to see what insights have been applied there. Some of them might be apropos to this thread.
The planner of grbl isn't very different regarding the way they are calculating things.
What's realy new in grbl is that they calculate no trapezoids after doing the reverse and forward pass, they calculate it somewhat "on the fly" when the block is called by the stepper ISR. They can even modify a block that is in execution. I'm quite sure this is not possible in Marlin due to the other things the FW has todo, we will not be able to do such fine just-in-time code due to the delays.
So do you think it's actually possible that Marlin can't complete its trapezoid calculation before the block starts being processed in the stepper isr? The general idea is to always stay ahead of that, and to prevent having tiny movements too.
I wasn't checking it up to now. At the moment i want to concentrate at LIN_ADVANCE
, then I will come back to the planner optimisations. Except for the stop speed I found no error in the code up to now. Even the stop speed would be ok if the buffer never runs dry.
A simple check will show us the truth: check the amount of blocks in the buffer when the isr tries to get a new block. If there is only one, print something to serial. At least that's my plan to do.
check the amount of blocks in the buffer when the isr tries to get a new block. If there is only one, print something to serial.
Sounds like a good plan! Note that you can't print to serial in the ISR. But you can set a flag or store some values and print to serial elsewhere, such as in loop()
.
Well, than I was lucky not to know that up to now - I'm printing debug information inside the stepper isr quite often and it works for me?
Of course it makes the printer stuttering like hell, but thats ok for test runs without filament and a few lines of gcode.
@thinkyhead I just checked if the buffer becomes empty in circles: It clearly does. It can be seen and heard when the printer movement isn't smooth any more and the test described above proves it's the buffer.
The segment length was about 0.3-1mm, so nothing incredibly small and the print speed was set to 60mm/s.
So another one good thing todo would be to check the slowdown feature.
The next 1-2 days are blocked by other things, so here is a last update if someone wants to think about them:
After some tests it seems that the buffer drains because planner.buffer_line
is simply not called regularly enough. Slowdown can't do anything in this case, the last time a block is added the buffer was full enough. If the buffer drains until the next buffer_line
is called, the printer comes to a full stop for a short time.
I'm not having very much knowledge about other Marlin parts beyond planner and stepper, but I guess buffer_line
is called from a normal loop if there is a new command and sometimes the loop takes longer due to LCD update, heater management etc. So it would help to increase the "priority" if possible, or, in other words, it has to be called at a frequency not less than it takes for short blocks to be executed until the buffer is empty.
Two main things I tested:
Would it be possible to check for new commands and execute buffer_line
with an ISR that is not-blocking other ISRs at a fixed frequency? This way things that takes long like an LCD update wouldn't be too bad.
So it would help to increase the "priority" if possible, or, in other words, it has to be called at a frequency not less than it takes for short blocks to be executed until the buffer is empty.
Sorry... There is only one main 'Task' and it drains the buffer of commands as fast it can. In a perfect world, it blocks waiting for space to become available towards the start of planner.buffer_line()
function.
Right now, block_t Planner::block_buffer[BLOCK_BUFFER_SIZE]
is an array of size 16. It is too big to make 32 with a Graphics LCD Panel. And for speed the management of this buffer needs to be a power of two. I'm thinking it might make sense to change how the block_buffer[]
is managed so we can bump it up to being 24 items in size. That extra size probably would help.
And I'm also toying with the idea of not giving the LCD Panel as much CPU time when ever the block_buffer[]
is less than half full and there are more than a couple unprocessed commands in the static char command_queue[BUFSIZE][MAX_CMD_SIZE]
.
Stated differently... If the printer is having a hard time keeping the Planner's work scheduled, let's not waste as many CPU cycles messing with the LCD Panel.
If I do these things... They will show up in the devel_ubl
branch and get tested there.
buffer_line
before waiting for a block to become available.Should we call this issue "solved"?
I would like to discuss if we want to stay with MINIMAL_STEP_RATE as the final rate for the last block in the buffer. For me, it's not logical why we start a line at jerk speed, but end it with something not jerk speed. First we calculate the final speed of the block to jerk speed in the planner, just to set it to minimal rate again in the forward/reverse pass as few steps later..
As long as a buffer underun is possible, this is also critical for LIN_ADVANCE because the next block starts at jerk speed again so we have a sudden speed jump at the junction which might cause lost esteps.
Aditionaly, for single executed moves or the first start of the print while the buffer fils up, this means the final speed for the first block is not defined. It may end with jerk speed or at minimum speed, depending on how fast the ISR blocks the block.
MINIMAL_STEP_RATE
as the final rate for the last block
I only see MINIMAL_STEP_RATE
being used in calculate_trapezoid_for_block
. And according to the commentary in the code, MINIMAL_STEP_RATE
is used to prevent the timer calculation from overflowing with a too-low initial_rate
or final_rate
. It implies that any lower value would fail.
why we start a line at jerk speed, but end it with something not jerk speed
Is this not corrected as soon as the block becomes no longer the last block?
for single executed moves or the first start of the print while the buffer fils up, this means the final speed for the first block is not defined.
Is this a longstanding bug that also still exists in grbl?
It's MINIMUM_PLANNER_SPEED
from recalculate_trapezoids():
// Last/newest block in buffer. Exit speed is set with MINIMUM_PLANNER_SPEED. Always recalculated.
if (next) {
float nom = next->nominal_speed;
calculate_trapezoid_for_block(next, next->entry_speed / nom, (MINIMUM_PLANNER_SPEED) / nom);
next->flag &= ~BLOCK_FLAG_RECALCULATE;
}
But as it's defined as 0.05mm/s by default it's replaced by MINIMAL_STEP_RATE in calculate_trapezoid_for_block() by:
NOLESS(final_rate, MINIMAL_STEP_RATE);
Is this not corrected as soon as the block becomes no longer the last block?
It is. But if this should be an intended behaviour, which I don't think, we should set the final speed with (MINIMUM_PLANNER_SPEED) / nom
already inside .buffer_line()
where we calculate the trapezoid first. Then the behaviour of single block movements would be determined again.
But again, as long as a buffer underun can happen, this leads to a jerk that shouldn't be there at block junctions.
Is this a longstanding bug that also still exists in grbl?
Grbl might not have this problem as they are not calculating trapezoids anymore inside the forward/reverse pass loop.
For Marlin, I don't know when the MINIMUM_PLANNER_SPEED
was introduced in the way it is today.
I don't know when the MINIMUM_PLANNER_SPEED was introduced
Maybe it corresponds to MINIMUM_JUNCTION_SPEED
in grbl…
Well, it bears some comparison. Perhaps we can re-adapt the latest grbl code to Marlin and see if it makes a difference. Probably grbl has never needed to account for something interrupting the planner for as long as a graphical display can do, so maybe it would make no difference.
In any case, I'm not sure there's a 100% solution. If there's a long enough interruption of the planner, the steppers are going to come to an unintended halt. So I guess the best solution lies in preventing the planner from running out. Which means we really need to optimize that graphical display further, split up the display update so the planner gets more calls…. Just anything to keep that planner buffer full.
Time to get out the oscilloscope again and count cycles!
It was proposed recently to add an extra state to lcdDrawUpdate
, so it would draw the first segment of the screen, return back to the main loop, and only on the next call of lcd_update
actually draw the second half of the screen. I'm going to look closely at that concept. I think that might be enough to keep the planner going.
It was proposed recently to add an extra state to lcdDrawUpdate, so it would draw the first segment of the screen, return back to the main loop, and only on the next call of lcd_update actually draw the second half of the screen.
I wonder if it would make sense to go even more extreme. First, maybe it makes sense to divide the screen drawing in 3 or 4 parts. If we can make it 2 parts, it might make sense to make it 3 or 4 parts. And in addition to doing this, I wonder if it would make sense to not do any update at all if Planner's Block Buffer is less than 1/2 (or maybe 1/4 ???) full. (If the Planner's Block Buffer is totally empty, the printer is not doing anything and of course it is OK to do the update in that case.)
@Roxy-3D this is nearly exactly what my PR #5125 is intended to do, except we have to look at the planned duration of the block(s) instead of the number of blocks available. If we have 3 blocks that need a long time because the are each 200mm long it's Ok to update the LCD, if the buffer is full with 0.1mm blocks due to an arc it might be not OK.
But this PR is also blocking the button handling because it's combined with the LCD update loop..
That's a good point. Something else just occurred to me. Could it be argued that in general the LCD should not be updated unless the Planner's block buffer is full? If the printer is so busy it can't fill up the Planner's block buffer, maybe it makes sense to just leave the LCD Screen frozen until the processor has some free time? (I'm not sure it would be acceptable to do exactly as I suggested. But the idea is to starve the LCD Panel of CPU resources instead of the Planner.)
That scheme would automatically address your concern about the duration of the different blocks. We just ignore the duration and instead insist the Planner's block buffer is full before we spend very many CPU cycles updating the LCD Screen.
It's not enough to look if the buffer is full. The reason for writing the PR was my found that even if the block buffer is full with 0.2-0.4mm lines @ 60mm/s the LCD update takes longer than draining the buffer. This leads to jerky moves because the printer is always stoping after the last segment, waiting until the LCD is redrawn and the next block is calculated afterwards. So it's impossible to go fast around a circle even if we are not talking about super-high resolution models with LCD enabled.
If we could have a big block buffer, this would be also a valid solution but that's not possible with the ATMega.
maybe it makes sense to divide the screen drawing in 3 or 4 parts
I believe this is a fixed value determined by the LCD controller.
I'm integrating support for SPI-based TMC2130 this week. And @Blue-Marlin did the legwork to support interrupt-based endstops (on most boards). So there are two strategies that will help to cut the overhead a bit.
Of course I hope to optimize the graphical display code too, but without having my printer handy I'll need help from testers. See if #4793 helps (or even works). It returns control to the main loop in-between display segment updates, allowing the planner to queue up more blocks.
maybe it makes sense to divide the screen drawing in 3 or 4 parts
I believe this is a fixed value determined by the LCD controller.
The more I think about it... Maybe we should move to a system where we keep a copy of what is on the LCD Panel, and another copy of what should be on the LCD Panel. If we were setup that way we could start updating locations on the screen any time we are busy waiting. And stop as soon as the busy wait ended. (and only spend time updating things that changed!!!!)
In that kind of a model, we could always update a few locations every idle() loop also and we wouldn't be spending so much time doing non-productive data transfers.
Double-buffering is an interesting idea, but of course requires lots of SRAM. And then there's the question of just how to manage that. I presume u8glib already provides some method to set aside a screen buffer and draw into that…? It would save time right away, because we currently redraw _all screen elements_ with every call to (*currentScreen)()
. With a display buffer we would draw only once, and the screen updates within the graphical display loop would only do copies of the buffer, which would be faster than the redraw of all elements we currently do.
It would save time right away, because we currently redraw all screen elements with every call to (*currentScreen)().
Yes. And what is worse is every character we need to update on the screen results in a huge sequence of bits that have to be clocked out. It is a huge amount of bit shifting to send anything to the LCD Displays. (At least, on some of the displays!)
One thing I see we can do right away is to unroll some of the loops in ultralcd_st7920_u8glib_rrd.h
…
-#define ST7920_WRITE_BYTES(p,l) {uint8_t i;for(i=0;i<l;i++){ST7920_SWSPI_SND_8BIT(*p&0xf0);ST7920_SWSPI_SND_8BIT(*p<<4);p++;}u8g_10MicroDelay();}
+#define ST7920_WRITE_BYTES(p,l) do{ \
+ for (uint8_t i = 0; i < l; i++) { \
+ ST7920_SWSPI_SND_8BIT(*p&0xf0); \
+ ST7920_SWSPI_SND_8BIT(*p<<4); \
+ p++; \
+ ST7920_SWSPI_SND_8BIT(*p&0xf0); \
+ ST7920_SWSPI_SND_8BIT(*p<<4); \
+ p++; \
+ ST7920_SWSPI_SND_8BIT(*p&0xf0); \
+ ST7920_SWSPI_SND_8BIT(*p<<4); \
+ p++; \
+ ST7920_SWSPI_SND_8BIT(*p&0xf0); \
+ ST7920_SWSPI_SND_8BIT(*p<<4); \
+ p++; \
+ } \
+ u8g_10MicroDelay(); \
+}while(0)
. . .
- ST7920_WRITE_BYTES(ptr, (LCD_PIXEL_WIDTH) / 8); //ptr is incremented inside of macro
+ ST7920_WRITE_BYTES(ptr, (LCD_PIXEL_WIDTH) / 32); //ptr is incremented inside of macro
- for (i = 0; i < 2 * (LCD_PIXEL_WIDTH) / 8; i++) //2x width clears both segments
+ for (i = 0; i < (LCD_PIXEL_WIDTH) / 32; i++) { //2x width clears both segments
ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ ST7920_WRITE_BYTE(0);
+ }
Could even go a little nuts with it… as in… #5279 ?
Spend 802 bytes to speed up the draw loop slightly.
Ok, so apart from the loop unrolling —which is only a minor optimization— I've just asked Oliver how to get the current page bounds in the draw loop so we can skip drawing items that are outside of the page. If we can get this information easily, then we can patch up the Info screen and menus to only render the items within the current page, saving perhaps half of the cycles currently required. We can use this information to adjust the Info Screen and make sure that everything in the upper part is within the upper 50%, and everything in the lower part is within the lower 50%. If the screen were to draw in 4 parts, we could even break up the Info Screen into quarters, drawing, for example, just the top half of the fan on the first page, then just the lower part of the fan on the next page, etc.
The page size is probably simple to get from u8glib, but I'm being lazy, and Oliver seems to be helpful. If it's not accessible directly from the u8g
object, then we can just glean it by counting up the number of pages in the draw loop.
Hmm… Actually, I see that we have a PAGE_HEIGHT
define. And also the ultralcd_st7920_u8glib_rrd.h
file has this code:
u8g_pb_t* pb = (u8g_pb_t*)(dev->dev_mem);
y = pb->p.page_y0;
Where the y
variable contains the starting line of the page.
So maybe we already have all we need to do selective rendering.
I think I'm still leaning towards keeping the full page in memory, and letting anything that is busy waiting move things that need updating to the LCD Panel. Personally... trading a few hundred bytes of RAM to get 1/2 of our CPU cycles back seems like a good investment.
a few hundred bytes of RAM to get 1/2 of our CPU cycles back
By using selective rendering, plus not using so much solid fill, and shorter delays between bytes sent to SPI, we have managed to save ~19ms — around 380,000 CPU cycles — per render! We can look at adding a 1024-byte buffer as an option when U8g2 is optimized (still too slow currently).
And as a side note... This kind of stuff is a good example of why 32-Bit processors are so important. The amount of developer time it takes to work around the limitations of these 8-bit processors really slows down what can be accomplished.
In a couple of years... Nobody will be buying 8-bit processors for their 3D Printer any more. But in the mean time... We burn developer time which is a very scarce commodity.
Roxy,
That was exactly my point when talking about 32 bit boards. It is nice to be able to optimise code, but if too much time is wasted doing so because of processor limitations, instead of developing new features, then it is worth moving to 32 bits.
Regards,
Ernesto
Agreed... It is kind of fun to squeeze every last cycle out of a piece of code.... The re-write of Mesh_Buffer_Line.cpp took over a week... But it cut out 1/2 of the floating point operations on a line move over a mesh. It was fun to write that code. But the truth is, I didn't have a choice if I want the code to run on Deltas.
I'm ready for a real CPU!
I disagree, people will be buying 8-bit controllers for far more than just 2 years - simply as a matter of vendors using up their existing stock.
Besides, they've been in use for over 40 years, and they're easy to program. If you need more speed, there are a number of 8-bitters that go way faster than the little 2560 we usually use. Besides, it's retro, and that's "in" these days. ;-)
me would also change to 32 Bit, but only if forced Hard to it...
ie if 8 bit becomes so rare that they get more expensive than 32 or 64 bit.
The original issue is receiving attention on a couple of fronts, both in limiting screen updates that might block the planner, and in optimizing screen updates so the planner doesn't get starved. And we have a lively 32-bit topic too. So let's migrate over to those threads.
Most helpful comment
The more I think about it... Maybe we should move to a system where we keep a copy of what is on the LCD Panel, and another copy of what should be on the LCD Panel. If we were setup that way we could start updating locations on the screen any time we are busy waiting. And stop as soon as the busy wait ended. (and only spend time updating things that changed!!!!)
In that kind of a model, we could always update a few locations every idle() loop also and we wouldn't be spending so much time doing non-productive data transfers.