Powershell: Speed up Foreach-Object and Where-Object by using Steppable Pipeline

Created on 4 Nov 2019  路  28Comments  路  Source: PowerShell/PowerShell

Foreach-Object and Where-Object are among the most frequently used pipeline cmdlets, yet they tend to be very slow with a large number of iterations.

A classic foreach loop in comparison to the pipeline often is 100x faster.

Suspected Design Flaw? Reason for Slowness:

That's because both take a scriptblock and invoke it via InvokeReturnAsIs() per each iteration. This prevents code and compiler optimizations.

Technical Improvement - how to make Foreach-Object as fast as foreach loops:

By replacing scriptblock invocation via InvokeReturnAsIs() by a steppable pipeline, Foreach-Object and Where-Object can be just as fast as foreach loops.

Prototypes and Explanation

A detailed explanation, use-cases, and prototypes are available here:
https://powershell.one/tricks/performance/pipeline

Area-Cmdlets-Core Issue-Discussion Issue-Enhancement

Most helpful comment

I think I identified part of the problem:
When full scriptblock logging is enabled, you see the full impact and speed penalty. Apparently, invoking scriptblocks triggers some logging for each scriptblock while a steppable pipeline will only logged once.
So with scriptblock logging enabled in full mode, Foreach-ObjectFast is roughly 100x faster. When it is enabled, it is "only" 3x faster...

All 28 comments

This is a good and pretty thorough investigation, thanks! I'd like to have a look at the code here and see if I can put something together for the cmdlet proper, but I would have to wait until the weekend, I think. If someone else wants to have a stab in the meantime, please do!

However, I notice some points missing from the solution. Unless we're willing to accept breaking changes, we will have some issues creating the scriptblock sequence quite as you describe.

For one, ForEach-Object currently accepts an _arbitrary_ number of scriptblocks, and then dynamically determines which should be treated as Begin, Process or End according to number of blocks and whether any were specified by name.

Two, if I'm not mistaken, we will cause potential breaking changes implementing the same solution as you propose, since we'd be altering the way in which the blocks are invoked. This isn't immediately apparent, and isn't generally an issue until you start working with multiple separate process blocks with variables. I think @jaykul might have name examples of that if he's interested, I remember him showing me a while back.

Three, this method of constructing the scriptblock by returning it to string form again is a bit wasteful as it requires parsing all the blocks at least twice, and we should investigate if there's a more direct route available to us.

The concept is solid, but we need to find a solution that is more appropriate to the behaviour of the cmdlet. (Also, ForEach-Object is using .InvokeWithCmdlet() if I recall correctly, instead of .InvokeReturnAsIs())

That's because both take a scriptblock and invoke it via InvokeReturnAsIs() per each iteration. This prevents code and compiler optimizations.

InvokeReturnAsIs isn't what's used, and doesn't prevent optimization. Instead InvokeWithCmdlet is used (as @vexx32 pointed out), which does prevent optimization, but only because it's dot sourced.

That's not to say that InvokeWithCmdlet would be the same speed as SteppablePipeline if it wasn't for the fact that one dot sources and the other doesn't, but the speed difference wouldn't be as drastic (Edit: yeah it would, Invoke* methods are super slow).

Here's my understanding.

There are only two things that control whether the compiler is in "optimized" mode:

  1. Debugging is enabled (e.g. a breakpoint is set, the debugger is currently stepping, etc)
  2. If a new local scope is created

And iirc this mainly affects the lookup speed of variables local to that scope in particular. That explains the Testing Functions with Dynamic Scriptblocks section. It's not that InvokeReturnAsIs is slow, it's that when it looks up the variable $_, that variable isn't actually in the current scope, so it has to fallback to the scope enumerator which is a lot slower than the local lookup tuple. (Edit: While that does affect performance, the Invoke* methods are also just real slow)

However, I notice some points missing from the solution. Unless we're willing to accept breaking changes, we will have some issues creating the scriptblock sequence quite as you describe.

My suggestion is just a proof-of-concept. We should keep the exact parameter logic that Foreach-Object uses. In the end, Foreach-Object bakes a scriptblock from the user input, and that's where we would kick in: instead of invoking it for each iteration, we would get a steppable pipeline and use Process(). That would also take care of the string-to-scriptblock conversion.

For one, ForEach-Object currently accepts an _arbitrary_ number of scriptblocks...
Two, if I'm not mistaken, we will cause potential breaking changes implementing the same solution as you propose, since we'd be altering the way in which the blocks are invoked.

I would have to look at the sources but I believe these are two separate things:

  • first, Foreach-Object takes all user arguments and builds ONE scriptblock from it
  • next, it invokes the scriptblock.
    The first part would remain unchanged. Only the way how the scriptblock is invoked would change. So hopefully there would be no breaking changes.

The concept is solid, but we need to find a solution that is more appropriate to the behaviour of the cmdlet. (Also, ForEach-Object is using .InvokeWithCmdlet() if I recall correctly, instead of .InvokeReturnAsIs())

Correct, but InvokeUsingCmdlet() is an internal method so I couldn't use it for prototyping in PowerShell and simply used InvokeReturnAsIs(). Both internally end up running InvokeWithPipe() so that shouldn't make too much of a difference.

InvokeReturnAsIs isn't what's used, and doesn't prevent optimization. Instead InvokeWithCmdlet is used (as @vexx32 pointed out), which _does_ prevent optimization, but _only_ because it's dot sourced.

InvokeReturnAsIs() is a public method, InvokeUsingCmdlet() is private. Both end up calling the same InvokeWithPipe(). If InvokeReturnAsIs() does not prevent optimization, and since I used this method and still see the hefty performance penalty, then I assume dot-sourcing InvokeUsingCmdlet() won't make much of a difference.

I am not an optimization expert but a scriptblock cannot optimize itself. When it is invoked, PowerShell simply doesn't know that it will be repeatly called. Only loops have this information. In the pipeline, a steppablePipeline is the equivalent of a loop. So IMHO the scriptblock should be invoked via a steppablePipeline and not via Invoke...() method calls. The findings in my prototypes seem to at least support this theory.

Here's my understanding.

There are only two things that control whether the compiler is in "optimized" mode:

  1. Debugging is enabled (e.g. a breakpoint is set, the debugger is currently stepping, etc)
  2. If a new local scope is created

Which would raise the question how PowerShell knows that a scriptblock is going to be repeated. I assumed without looking at the code that PowerShell needs some knowledge about looping constructs that embed the scriptblock, but maybe I am wrong.

That said, could be there are multiple layers of optimizations. At least the facts speak a clear language, and the time penalty we are currently seeing isn't exactly "academic", so something major must be amiss.

If InvokeReturnAsIs() does not prevent optimization, and since I used this method and still see the hefty performance penalty, then I assume dot-sourcing InvokeUsingCmdlet() won't make much of a difference.

Sorry, it wasn't clear what I was referring to, but the section below is the explanation for that: (Edit: except also the Invoke* methods are generally just super slow comparatively)

And iirc this mainly affects the lookup speed of variables local to that scope in particular. That explains the Testing Functions with Dynamic Scriptblocks section. It's not that InvokeReturnAsIs is slow, it's that when it looks up the variable $_, that variable isn't actually in the current scope, so it has to fallback to the scope enumerator which is a lot slower than the local lookup tuple.

Continuing:

I am not an optimization expert but a scriptblock cannot optimize itself. When it is invoked, PowerShell simply doesn't know that it will be repeatly called. Only loops have this information.

As you mention later on, we're talking about different layers of optimization. You're talking about interpretation vs JIT compiled IL, I'm talking about differences in the behavior of SMA.Compiler which it internally refers to as optimization. The effect you're seeing is due to the latter, and is the bane of a whole bunch of performance discussions here because of how often it makes other things look like a problem. the Invoke* methods being a super slow code path in general (Edit: fixed).

That said, a scriptblock can optimize itself in the way you are referring to. The optimization threshold is stored on the LightLambda object created that serves as the closure for a block. For instance, if you look at a ScriptBlock with ImpliedReflection, follow this path: $sb.EndBlock.Target._compilationThreshold. That number is decremented every invocation, and when it hits 0 then it compiles itself (assuming that the block is less than 300 statements, otherwise it's set to never compile).

At least the facts speak a clear language, and the time penalty we are currently seeing isn't exactly "academic", so something major must be amiss.

Yeah some work being put into optimizing variable lookup in dot sourced scopes, along with optimizing lookup of variables from previous scopes in general would be fantastic. (Edit: but more importantly also making the Invoke* methods faster)

As a fun side note, if you follow these steps interactively (either use reflection manually or tab complete with ImpliedReflection) you can see exactly when it's compiled.

  1. Create a ScriptBlock $sb = { Write-Host Invoked! }
  2. Invoke it once however you'd like so EndBlock is populated
  3. Navigate to $sb.EndBlock.Target.add_Compile
  4. Run $sb.EndBlock.Target.add_Compiled{ Write-Host Compiled! }
  5. Run 0..40 | % { $sb.InvokeReturnAsIs() }

As a fun side note, if you follow these steps interactively (either use reflection manually or tab complete with ImpliedReflection) you can see exactly when it's compiled.

  1. Create a ScriptBlock $sb = { Write-Host Invoked! }
  2. Invoke it once however you'd like so EndBlock is populated
  3. Navigate to $sb.EndBlock.Target.add_Compile
  4. Run $sb.EndBlock.Target.add_Compiled{ Write-Host Compiled! }
  5. Run 0..40 | % { $sb.InvokeReturnAsIs() }

Awesome stuff! Thanks for sharing!

Actually I got a little caught up in semantics there. All of the Invoke* methods are way slower no matter settings are used. I don't think it's related to either layer of optimization though. My guess is that it's related to all of the spin up/tear down of pipeline/command processors.

@TobiasPSP SteppablePipeline being faster is a good observation, maybe ForEach-Object should create it's own PipelineProcessor similar to ScriptBlock.GetSteppablePipeline as a fast path. @daxian-dbw I know you did some work on rewriting the pipeline in very simple cases, maybe this could be a way to expand that.

One observation is that the time penalty is very different from machine to machine. It may depend on the CPU type or mobile processors. On some systems, below script takes 15-25 seconds, on others it takes a fraction of a second. So the suggested improvements would target and fix those where it takes 15-25 seconds. Finding out why this varies so much from machine to machine and who/how many are affected, should be the next step to investigate.

$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()

$result = 1..100000 | ForEach-Object {
"I am at $_"
}

$report = '{0} elements in {1:n2} seconds'
$report -f $result.Count, $stopwatch.Elapsed.TotalSeconds

I think I identified part of the problem:
When full scriptblock logging is enabled, you see the full impact and speed penalty. Apparently, invoking scriptblocks triggers some logging for each scriptblock while a steppable pipeline will only logged once.
So with scriptblock logging enabled in full mode, Foreach-ObjectFast is roughly 100x faster. When it is enabled, it is "only" 3x faster...

Great write-up! I made efforts in the same space 2 months back trying to rewrite the pipeline for ForEach-Object for its most commonly used scenario, see #10454.
Unfortunately, it resulted in a breaking change -- the value of $MyInvocation was then different in the script block specified to ForEach-Object. This is because InvokeUsingCmdlet() sets the $MyInvocation variable when invoking the script block ... And, $MyInvocation is used by many existing scripts with ForEach-Object. I searched the scripts in PowerShell Corpus, and see many uses of $MyInvocation with ForEach-Object.

Given that, that PR was reverted by #10485. After this, it feels to me, the best we can do is to have a new ForEach-Object command that is simpler and invoke the specified script blocks in a different way, because any manipulation of how ForEach-Object currently invoke a script block will very likely introduce one breaking change or another.

Agree, that makes sense. I just have revised the article and differentiated between systems with enabled scriptblock logging and those without, so this gives a better picture of the gains to be expected.
Here are my thoughts:

  • To preserve compatibility, I would suggest adding a switch parameter to Foreach-Object/Where-Object, i.e. -Simple. Then, based on that switch, I would use the steppable pipeline instead of the Invoke...() methods on the scriptblock that was baked from the existing parameters, taking advantage of the existing parameters and the way how Foreach-Object composes the scriptblock.
  • Using the steppable pipeline instead of Invoke...() methods does take care of the issue with the scriptblock logging. Maybe this could also be fixed from the other end (by revising the way how scriptblock logging works), however the steppable pipeline seems to generally be pretty fast
  • as the article illustrates, it is still faster to directly invoke a scriptblock or simple function, so aside from the steppable pipeline, there seems to be additional opportunities to improve speed.

Another thing you want to consider is the debugging experience of script block arguments provided to ForEach-Object. #10454 tried very hard to keep the same experience as much as possible. I'm not sure how the experience would be if you use steppable pipeline.

it is still faster to directly invoke a scriptblock or simple function, so aside from the steppable pipeline, there seems to be additional opportunities to improve speed.

The main reason it's slow is:

InvokeUsingCmdlet needs to do necessary setups/cleanup before/after invoking the script block and Foreach-Object has to pay that tax over and over again. While for a filter or a simple function, the setup and cleanup is done only once because the invocation stays in the same function.

I summarized why ForEach-Object is slow in my PR #10047.

@daxian-dbw would it perhaps be possible to, for example, just have ForEach-Object simply call GetSteppablePipeline on each block individually, cache the pipeline processors and call them in sequence rather than using InvokeWithCmdlet and incurring the additional cost every time?

@vexx32 I think the steppable pipeline is a good idea of replacing the InvokeWithPipeImpl for invoking script block in ForEach/Where-Object. The problem is more about breaking changes brought in by changing how script blocks are invoked by those two cmdlets, as there will always be subtle differences like $MyInvocation. Adding a new parameter set like -Simple for a completely different implementation is something worth considering, but I guess it will still raise questions and complains about inconsistencies like $MyInvocation comparing with the legacy implementation.

@daxian-dbw I'm wondering if it's possible to mimic how the invocation presents itself with a stoppable pipeline... I guess I'll have a look at it probably this weekend and see how it works with a naive implementation and get back to y'all with potential point points if I see them.

Would you be willing to write and PR some regression tests to cover these scenarios for ForEach-Object so that we can better make an informed judgement of whether a separate switch is or is not necessary.

Also, rather than a permanent cmdlet switch, we do have the option of putting it behind an experimental feature flag.

@vexx32 The reason SteppablePipeline is faster as shown in @TobiasPSP's work is that like invoking a filter function or a script block, the setup (scopes, function_context and etc.) and cleanup are done only once because the invocation stays in the same function/script, while for the InvokeUsingPipeImpl, it has to setup then cleanup every time the script block is invoked.

As for the regression tests, currently the one I'm aware of is $MyInvocation, see #10477 for related discussion. I can write tests targeting it without problem, but I think there will be other subtle things depending on how we are invoking script blocks in ForEach-Object and Where-Object today.

@daxian-dbw yep, I fully expect there to be. I'm just looking for some baseline tests we can put possible solutions against, and then we can explore from there and cover edge cases as best we can.

I think that from a UX standpoint, having a cmdlet that has two competing behaviour patterns with no easily demonstrated difference between them should be avoided.

If we must break it, we should just break it imo. But I think that break should be an minimal as we can possibly make it, and ideally there should be no user-facing difference apart from speed. :slightly_smiling_face:

At the end, the current implementation of Foreach-Object constructs one single scriptblock from the user-submitted parameters, so if we left this as-is and just changed the way how the internal scriptblock is invoked, this should preserve most backwards compatibility. The other major change I see is $MyInvocation, and maybe someone can shed light into why it is different in the first place, and whether it is doable to emit a "corrected" $MyInvocation. Note that my prototype is calling GetSteppablePipeline() w/o any arguments where proxy functions do submit invocation details. Maybe there is an easy way of tuning $MyInvocation so it is compatible.
So my hope would be to implement the invocation changes w/o breaking changes.

@TobiasPSP the reason changing this has a tendency to break the way $MyInvocation is set is because ForEach-Object (currently) is not constructing a single scriptblock from the input. All the blocks are invoked separately, everytime, even each individual process block is invoked separately, one at a time, using InvokeWithCmdlet()

I think you may be right in that we may be able to preserve the MyInvocation information, but I'll have to take a closer look at the code and try a few things.

@TobiasPSP InvokeUsingCmdlet explicitly uses the InvocationInfo object from the caller scope, see the code here:
https://github.com/PowerShell/PowerShell/blob/d58a82ad19fbfad81e85778c8b08cb1b28f58fce/src/System.Management.Automation/engine/lang/scriptblock.cs#L676-L677

While when invoking filter/function/simple scriptblock (including through SteppablePipeline), the InvocationInfo object is generated based on the script being invoked, which is correct when invocation happens this way. I don't see a way to force using a specific InvocationInfo object when a script block is invoked in this way.

I don't see a way to force using a specific InvocationInfo object when a script block is invoked in this way.

SteppablePipeline could be altered to support this. It already does something similar when the pipeline it's wrapping has parameters:

https://github.com/PowerShell/PowerShell/blob/d58a82ad19fbfad81e85778c8b08cb1b28f58fce/src/System.Management.Automation/engine/runtime/Operations/MiscOps.cs#L651-L667

Another parameter could be added to GetSteppablePipeline (or somewhere else in the code path) that would do something similar, but just for MyInvocation. Might/probably will hurt performance or alter behavior though.

@SeeminglyScience Yes, that's something worth looking into. Be noted, if we are aiming at replace the current ForEach/Where-Object implementation, then it's important to retain the current debugging experience as much as possible, which might be a problem if using SteppablePipeline with a wrapping script.

Be noted, if we are aiming at replace the current ForEach/Where-Object implementation, then it's important to retain the current debugging experience as much as possible, which might be a problem if using SteppablePipeline with a wrapping script.

Yeah... I think there's a lot of obstacles to specifically using SteppablePipeline. I think it makes sense to pursue creating your own pipeline processor manually, similar to how it's done for a steppable pipeline, but purpose built so the experience can be specifically catered to this scenario. Ideally it would be built out into a public API similar to SteppablePipeline, but maybe more specifically for performance sensitive areas (though obviously I'd settle for just having ForEach-Object be faster).

Is there anything we can do? Foreach/Where-Object are fundamental to PowerShell, and they are extremely slow. Any investment here would pay off considerably. Plus there is another severe issue tied to it: when scriptblock logging is enabled, the time penalty grows to a level that I'd call a bug.

@TobiasPSP I'm pretty sure creating a pipeline processor manually as outlined above is the right path. Problem is, that's a pretty significant investment. It'll be a pretty time consuming task, and likely require one of the most experienced members of the team.

As much as I'd love to see this happen, I can see why it might be hard to justify. The amount of users who are going to use ForEach-Object to the scale where it becomes a problem and actually care about the extra runtime aren't very likely to be that high. High enough that if the fix was pretty quick it'd be an easy win for sure, but I'm not sure it balances the scales atm.

I hear you. I鈥榙 think it is an investment in one of the most fundamental features of Powershell, a one-time investment for good, plus with the advent of scriptblock logging, the current implementation is essentially broken (the performance degradation with scriptblock logging turned on is crippling when processing large collections). I鈥榙 rather cut on some bells and whistles if that would buy me solid and fast core pipeline cmdlets. But I understand budget is always difficult.

While fixing the issue via a steppable pipeline would be super simple and low cost, I can see there is a slight chance of compatibility issues especially with the debug experience. I agree a new pipeline processor is technically the best choice. Yet if that means we won鈥檛 ever get it, I鈥榙 rather settle for second-best ;-).

It would be unfortunate if we had to live with this fundamental flaw forever. Powershell pipeline has become notorious to be super slow compared to nonstreaming loops, and to be generally avoided, which is sad because the primary reason for the disadvantage is this flaw only. Powershells pipeline is very fast by design.

Should we close this thread?

I hear you. I鈥榙 think it is an investment in one of the most fundamental features of Powershell, a one-time investment for good, plus with the advent of scriptblock logging, the current implementation is essentially broken (the performance degradation with scriptblock logging turned on is crippling when processing large collections). I鈥榙 rather cut on some bells and whistles if that would buy me solid and fast core pipeline cmdlets. But I understand budget is always difficult.

Note I'm mainly talking from what I am guessing is the PowerShell teams perspective. Personally I run into this often enough that I think it's worth it as well. The worth diminishes a bit if you consider the ~95% of users for which this scenario will be (more than likely) exceedingly rare.

While fixing the issue via a steppable pipeline would be super simple and low cost, I can see there is a slight chance of compatibility issues especially with the debug experience. I agree a new pipeline processor is technically the best choice. Yet if that means we won鈥檛 ever get it, I鈥榙 rather settle for second-best ;-).

FWIW it seems like risk of breaking change is a bigger hurdle to the PowerShell team than implementation difficulty.

It would be unfortunate if we had to live with this fundamental flaw forever.

Forever is a long time. Maybe the landscape changes in a year and everyone is using large collections for some reason. Maybe more advanced users flood to PowerShell and high performance scenarios become important. Maybe other engine changes enable a change like this with relative ease. Maybe the PowerShell team doubles in size 馃 馃

Powershell pipeline has become notorious to be super slow compared to nonstreaming loops, and to be generally avoided, which is sad because the primary reason for the disadvantage is this flaw only. Powershells pipeline is very fast by design.

If we're not accounting for the allocation of the foreach "condition", it'll always be the better choice for when performance is important. Pipeline will always be doing more or less the same with but with extra architecture around it. Absolutely worth improving though.

Should we close this thread?

Nah, unless the PowerShell team comes out and says it'll never happen there's still hope.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JohnLBevan picture JohnLBevan  路  3Comments

manofspirit picture manofspirit  路  3Comments

aragula12 picture aragula12  路  3Comments

MaximoTrinidad picture MaximoTrinidad  路  3Comments

andschwa picture andschwa  路  3Comments