Scratch-vm: Investigate performance issues in "Stacky Build" project

Created on 31 Oct 2017  Â·  14Comments  Â·  Source: LLK/scratch-vm

Expected Behavior

  • Project should run at 30 fps on testing Chromebook (2016 Samsung Chromebook 3)

Actual Behavior

  • Project runs slowly (5 – 10 fps)

Steps to Reproduce

https://scratch.mit.edu/projects/155128646/

bug performance

All 14 comments

This project is heavily bound by the virtual machine's internal performance. The virtual machine's performance is dependent on how it has been written in javascript and some of the data structures used. execute in scratch-vm/src/engine/execute and Sequencer#stepThreads in scratch-vm/src/engine/sequencer are responsible.

execute

The execute function is being deoptimized by at least Chrome. Chrome in trying to optimize the function is make assumptions like a variable is always a string and then seeing that assumption broken when its set to a number. Happening enough times Chrome is giving up and running the function in a safe but very slow way. This can be improved by either figuring what assumptions Chrome is making and ensures those don't break or break up execute into smaller functions where most of them can be optimized.

Each call of execute that calls a blockFunction (instead of returning early) creates at least 16 objects, both costing time to allocate the objects and later contributing to some regular Minor GC operations the javascript VM is performing. Many of these objects would be best turned into static reused instances with internal values set by each execute call. The first one to tackle is the util argument to blockFunction since it should be safe as a single instance. That would save 12 allocations, one for the util object, and one for each javascript closure letting the functions refer to the higher scopes they're created in.

stepThreads

stepThreads in this project is spending a lot of time in indexOf calls and its loops and tests on threads. This project is running a lot of threads in other words. And a lot of that is likely churning threads, starting and ending threads frequently, growing doneThreads every step to the point indexOf spends a fair amount of time checking for stepThreads if a thread is not in doneThreads so it can be added or if it is in doneThreads so it can be dropped from runtime threads.

While still using an array for doneThreads adding another strategy to test if a thread is in doneThreads or not would offer a recognizable performance boost.

Some numbers

Over a quick 5 second sample of the game running here are the amount of time spent in some of these expensive operations:

Operation | % of 5 seconds
------------|---------
stepThreads | 10.4%
execute | 24.8%
indexOf | 9.7%
Minor GC | 6.4%

From my own experiments I've found that many optimisations have lead to small increases in performance, often far less that I hoped. Things I've tried:

  • Replacing StackFrame & Stack with a single linked list with the top node as the root of the list. This would mean that the head of the stack 'was' the currently executing stackframe object so no new lookups were required. This resulted in a small speed up of a few percent only (but was a lot neater).

  • Strip out nested functions into first order classes - executeRecord, stackFrame - I expected this to lead to better optimisation, but again only improved things by a few percent.

  • Split out handleReport, makePromise, extractArgInputs & recursivelyEvaluateInputs functions to remove closure. Not much difference.

  • Changed the timer.js to stop indirectly calling of Timer.nowObj getter. This also improved things only a small amount.

I noted that execute was not being optimised by the js runtime, but it's interesting that you say it is down to the variable type being changed... that makes sense, but it there a way to tell the js compiler to expect this and optimise appropriately?

The only thing that really sped things up is caching:

  • engine.getInputs
  • engine.getProcedureDefinition
  • engine.getProcedureParamNames

These changes had a major impact on execution speed, around the x2 mark.
With all these changes added together it can make a substantial gain in performance (depending on the project).

You can see these changes here: https://github.com/griffpatch/scratch-vm/tree/optimise_root
This doesn't include the linkedlist fix which was in a much older branch and I never brought it forward when I started looking into this again.

If it's any help, here is a test setup showing the optimisations in action (takes 12 seconds to run):

http://llk.github.io/scratch-vm/#141514080
https://griffpatch.github.io/scratch-vm/#141514080

My personal recommendation is, if possible, to try dropping the util object altogether, instead passing the thread object to blockFunction calls. Most of the methods in util are just proxies to the thread object; some are proxies to sequencer or runtime (which the blockFunction calls already have access to).

If it can be removed then I'd say absolutely! All of my optimisations have
assumed keeping the existing framework incase it impacted 3rd party
modules, etc...

On 2 November 2017 at 14:46, Tim Mickel notifications@github.com wrote:

My personal recommendation is, if possible, to try dropping the util
object altogether, instead passing the thread object to blockFunction
calls. Most of the methods in util are just proxies to the thread object;
some are proxies to sequencer or runtime (which the blockFunction calls
already have access to).

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/LLK/scratch-vm/issues/740#issuecomment-341444770, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGbNvqRh3l4IFYCxC_UdQ0lZZJRteOgYks5sydXagaJpZM4QM_lg
.

Strip out nested functions into first order classes - executeRecord, stackFrame

const primitiveReportedValue = blockFunction(argValues,
  new ExecuteRecord(currentStackFrame.executionContext, thread, sequencer, blockContainer));

https://github.com/LLK/scratch-vm/compare/develop...griffpatch:optimise_root#diff-6175aeaffc8aee750c94d0db41b34718R262

@griffpatch The real win for popping out those classes will be documenting them. stackFrame as is and StackFrame as a class won't import any negligible or no recordable difference in performance. Pulling out the executeRecord definitely helps but its hard to measure this performance win as it really affects frame stuttering when JS decides to run a Minor GC. Instead of seeing overall performance improvement really need to measure the framerate of 99 or 95 percentile and compare with the 100%.

The next step for ExecuteRecord would be to instantiate one instance at the file scope of execute and set its internal values before passing it to the block function. Reusing or passing in objects instead of creating new ones inside the VM's overhead (stepThreads, stepThread, execute) will free up time by cutting down on object allocation and as regular a need to collect the garbage.

Right now there are bigger fish like execute being deoptimized and indexOf use in stepThreads to fix. But all this object allocation and collection will become a bigger issue after those big fish. #741 has to do a few milliseconds of GC every frame on my main machine.

Split out handleReport, makePromise, extractArgInputs & recursivelyEvaluateInputs functions to remove closure.

extractArgInputs and recursivelyEvaluateInputs don't have closures, but it doesn't necessarily hurt to pop them out of the execute super function.

Your makePromise still has the same two closures that it would have with the code originally inside execute.

handleReport is definitely saving a closure.

const threads = this.runtime.threads;

https://github.com/LLK/scratch-vm/compare/develop...griffpatch:optimise_root#diff-73faef1f944dbb62b51218068b0669e3R46

I'm not super worried about optimizations like this. They're hard to measure and some cases (mostly really small functions) paradoxically make the function take longer.

@tmickel Performance wise I don't think replacing the util object with directly passing the thread object will provide much performance improvement. Right now the problem is its created for every execute call where we could create a static instance one time and update its internal values for every execute call. We can do that as long as we mandate that block functions do not store or asynchronously use them, which to my knowledge is already true. (I kind of like the util object, it makes the block functions more readable.)

@griffpatch Side note: I hope I don't come off snarky or something. Hope my thoughts on your optimizations are useful.

but it's interesting that you say it is down to the variable type being changed... that makes sense

This was just one example of what can deoptimize functions. Its also easy to illustrate. I'm having read execute closely enough to say if this is specifically what is deoptimizing it.

Performance wise I don't think replacing the util object with directly passing the thread object will provide much performance improvement.

Think that makes sense. Agree that it is also a style question. At any rate I would focus a lot of effort on addressing that spot one way or another. Removing the util altogether in a branch at some point improved execution speed dramatically for me. I believe changing the internal values on a static instance would have a similar effect.

We can do that as long as we mandate that block functions do not store or asynchronously use them, which to my knowledge is already true.

I believe there are examples where this is not quite true:

https://github.com/LLK/scratch-vm/blob/c5d3e2dbb45d1e05b6d010f467ed484b07690bb0/src/blocks/scratch3_looks.js#L246-L256

(uses util.target asynchronously)
However I'm sure they could be re-written to be compatible :)

@tmickel Yeah I did a thorough scan through the blocks and saw that and its sibling using util asynchronously. A small change should let us make this rule. Another option would be to provide a method to make a copy the async methods can depend on. Since the async blocks that'd need that are less frequent that wouldn't be a performance issue and they can opt-in to it instead of execute assuming that behaviour.

@griffpatch Can you make a PR with your getInputs, getProcedureDefintion, getProcedureParamNames caching? Or I can make one if you'd be prefer that.

Yes I can, although I don't admit to knowing whether how I cached things
was done in a way that you would like, I was making the changes more as a
guide to what could be done... not the right way of doing it ;)
But I don't mind creating the PR so that it can then be reviewed.

Griffpatch

On 6 November 2017 at 21:12, mzgoddard notifications@github.com wrote:

@tmickel https://github.com/tmickel Yeah I did a thorough scan through
the blocks and saw that and its sibling using util asynchronously. A small
change should let us make this rule. Another option would be to provide a
method to make a copy the async methods can depend on. Since the async
blocks that'd need that are less frequent that wouldn't be a performance
issue and they can opt-in to it instead of execute assuming that behaviour.

@griffpatch https://github.com/griffpatch Can you make a PR with your
getInputs, getProcedureDefintion, getProcedureParamNames caching? Or I can
make one if you'd be prefer that.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/LLK/scratch-vm/issues/740#issuecomment-342288251, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGbNvsRjkVSxSOJayijqyjSm2Ox-B459ks5sz3ZUgaJpZM4QM_lg
.

@griffpatch No worries. We'll figure it out in the PR review.

I have updated and added a PR for my caching scripts #770
The resulting speed up for general block execution is pretty good.
See if you would prefer to store the cache in a different way or not, and we need to ensure the invalidation works as intended under 'all' circumstances.

This project is now running at > 30 fps in the VM Playground on our testing hardware.

When I play a few rounds on scratch-gui the project has a bug where layers aren't correctly deleted between rounds. I can't reproduce this bug on 2.0

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Micircle picture Micircle  Â·  5Comments

towerofnix picture towerofnix  Â·  5Comments

cwillisf picture cwillisf  Â·  4Comments

fsih picture fsih  Â·  7Comments

thisandagain picture thisandagain  Â·  3Comments