Foreach-Object and Where-Object are among the most frequently used pipeline cmdlets, yet they tend to be very slow with a large number of iterations.
A classic foreach loop in comparison to the pipeline often is 100x faster.
That's because both take a scriptblock and invoke it via InvokeReturnAsIs() per each iteration. This prevents code and compiler optimizations.
By replacing scriptblock invocation via InvokeReturnAsIs() by a steppable pipeline, Foreach-Object and Where-Object can be just as fast as foreach loops.
A detailed explanation, use-cases, and prototypes are available here:
https://powershell.one/tricks/performance/pipeline
This is a good and pretty thorough investigation, thanks! I'd like to have a look at the code here and see if I can put something together for the cmdlet proper, but I would have to wait until the weekend, I think. If someone else wants to have a stab in the meantime, please do!
However, I notice some points missing from the solution. Unless we're willing to accept breaking changes, we will have some issues creating the scriptblock sequence quite as you describe.
For one, ForEach-Object currently accepts an _arbitrary_ number of scriptblocks, and then dynamically determines which should be treated as Begin, Process or End according to number of blocks and whether any were specified by name.
Two, if I'm not mistaken, we will cause potential breaking changes implementing the same solution as you propose, since we'd be altering the way in which the blocks are invoked. This isn't immediately apparent, and isn't generally an issue until you start working with multiple separate process blocks with variables. I think @jaykul might have name examples of that if he's interested, I remember him showing me a while back.
Three, this method of constructing the scriptblock by returning it to string form again is a bit wasteful as it requires parsing all the blocks at least twice, and we should investigate if there's a more direct route available to us.
The concept is solid, but we need to find a solution that is more appropriate to the behaviour of the cmdlet. (Also, ForEach-Object is using .InvokeWithCmdlet() if I recall correctly, instead of .InvokeReturnAsIs())
That's because both take a scriptblock and invoke it via InvokeReturnAsIs() per each iteration. This prevents code and compiler optimizations.
InvokeReturnAsIs
isn't what's used, and doesn't prevent optimization. Instead InvokeWithCmdlet
is used (as @vexx32 pointed out), which does prevent optimization, but only because it's dot sourced.
That's not to say that InvokeWithCmdlet
would be the same speed as SteppablePipeline
if it wasn't for the fact that one dot sources and the other doesn't, but the speed difference wouldn't be as drastic (Edit: yeah it would, Invoke*
methods are super slow).
Here's my understanding.
There are only two things that control whether the compiler is in "optimized" mode:
And iirc this mainly affects the lookup speed of variables local to that scope in particular. That explains the Testing Functions with Dynamic Scriptblocks
section. It's not that (Edit: While that does affect performance, the InvokeReturnAsIs
is slow, it's that when it looks up the variable $_
, that variable isn't actually in the current scope, so it has to fallback to the scope enumerator which is a lot slower than the local lookup tuple.Invoke*
methods are also just real slow)
However, I notice some points missing from the solution. Unless we're willing to accept breaking changes, we will have some issues creating the scriptblock sequence quite as you describe.
My suggestion is just a proof-of-concept. We should keep the exact parameter logic that Foreach-Object uses. In the end, Foreach-Object bakes a scriptblock from the user input, and that's where we would kick in: instead of invoking it for each iteration, we would get a steppable pipeline and use Process(). That would also take care of the string-to-scriptblock conversion.
For one, ForEach-Object currently accepts an _arbitrary_ number of scriptblocks...
Two, if I'm not mistaken, we will cause potential breaking changes implementing the same solution as you propose, since we'd be altering the way in which the blocks are invoked.
I would have to look at the sources but I believe these are two separate things:
The concept is solid, but we need to find a solution that is more appropriate to the behaviour of the cmdlet. (Also, ForEach-Object is using .InvokeWithCmdlet() if I recall correctly, instead of .InvokeReturnAsIs())
Correct, but InvokeUsingCmdlet() is an internal method so I couldn't use it for prototyping in PowerShell and simply used InvokeReturnAsIs(). Both internally end up running InvokeWithPipe() so that shouldn't make too much of a difference.
InvokeReturnAsIs
isn't what's used, and doesn't prevent optimization. InsteadInvokeWithCmdlet
is used (as @vexx32 pointed out), which _does_ prevent optimization, but _only_ because it's dot sourced.
InvokeReturnAsIs() is a public method, InvokeUsingCmdlet() is private. Both end up calling the same InvokeWithPipe(). If InvokeReturnAsIs() does not prevent optimization, and since I used this method and still see the hefty performance penalty, then I assume dot-sourcing InvokeUsingCmdlet() won't make much of a difference.
I am not an optimization expert but a scriptblock cannot optimize itself. When it is invoked, PowerShell simply doesn't know that it will be repeatly called. Only loops have this information. In the pipeline, a steppablePipeline is the equivalent of a loop. So IMHO the scriptblock should be invoked via a steppablePipeline and not via Invoke...() method calls. The findings in my prototypes seem to at least support this theory.
Here's my understanding.
There are only two things that control whether the compiler is in "optimized" mode:
- Debugging is enabled (e.g. a breakpoint is set, the debugger is currently stepping, etc)
- If a new local scope is created
Which would raise the question how PowerShell knows that a scriptblock is going to be repeated. I assumed without looking at the code that PowerShell needs some knowledge about looping constructs that embed the scriptblock, but maybe I am wrong.
That said, could be there are multiple layers of optimizations. At least the facts speak a clear language, and the time penalty we are currently seeing isn't exactly "academic", so something major must be amiss.
If InvokeReturnAsIs() does not prevent optimization, and since I used this method and still see the hefty performance penalty, then I assume dot-sourcing InvokeUsingCmdlet() won't make much of a difference.
Sorry, it wasn't clear what I was referring to, but the section below is the explanation for that: (Edit: except also the Invoke*
methods are generally just super slow comparatively)
And iirc this mainly affects the lookup speed of variables local to that scope in particular. That explains the
Testing Functions with Dynamic Scriptblocks
section. It's not thatInvokeReturnAsIs
is slow, it's that when it looks up the variable$_
, that variable isn't actually in the current scope, so it has to fallback to the scope enumerator which is a lot slower than the local lookup tuple.
Continuing:
I am not an optimization expert but a scriptblock cannot optimize itself. When it is invoked, PowerShell simply doesn't know that it will be repeatly called. Only loops have this information.
As you mention later on, we're talking about different layers of optimization. You're talking about interpretation vs JIT compiled IL, I'm talking about differences in the behavior of SMA.Compiler
which it internally refers to as optimization. The effect you're seeing is due to the latter, and is the bane of a whole bunch of performance discussions here because of how often it makes other things look like a problem. the Invoke*
methods being a super slow code path in general (Edit: fixed).
That said, a scriptblock can optimize itself in the way you are referring to. The optimization threshold is stored on the LightLambda
object created that serves as the closure for a block. For instance, if you look at a ScriptBlock with ImpliedReflection, follow this path: $sb.EndBlock.Target._compilationThreshold
. That number is decremented every invocation, and when it hits 0 then it compiles itself (assuming that the block is less than 300 statements, otherwise it's set to never compile).
At least the facts speak a clear language, and the time penalty we are currently seeing isn't exactly "academic", so something major must be amiss.
Yeah some work being put into optimizing variable lookup in dot sourced scopes, along with optimizing lookup of variables from previous scopes in general would be fantastic. (Edit: but more importantly also making the Invoke*
methods faster)
As a fun side note, if you follow these steps interactively (either use reflection manually or tab complete with ImpliedReflection
) you can see exactly when it's compiled.
$sb = { Write-Host Invoked! }
EndBlock
is populated$sb.EndBlock.Target.add_Compile
$sb.EndBlock.Target.add_Compiled{ Write-Host Compiled! }
0..40 | % { $sb.InvokeReturnAsIs() }
As a fun side note, if you follow these steps interactively (either use reflection manually or tab complete with
ImpliedReflection
) you can see exactly when it's compiled.
- Create a ScriptBlock
$sb = { Write-Host Invoked! }
- Invoke it once however you'd like so
EndBlock
is populated- Navigate to
$sb.EndBlock.Target.add_Compile
- Run
$sb.EndBlock.Target.add_Compiled{ Write-Host Compiled! }
- Run
0..40 | % { $sb.InvokeReturnAsIs() }
Awesome stuff! Thanks for sharing!
Actually I got a little caught up in semantics there. All of the Invoke*
methods are way slower no matter settings are used. I don't think it's related to either layer of optimization though. My guess is that it's related to all of the spin up/tear down of pipeline/command processors.
@TobiasPSP SteppablePipeline
being faster is a good observation, maybe ForEach-Object
should create it's own PipelineProcessor
similar to ScriptBlock.GetSteppablePipeline
as a fast path. @daxian-dbw I know you did some work on rewriting the pipeline in very simple cases, maybe this could be a way to expand that.
One observation is that the time penalty is very different from machine to machine. It may depend on the CPU type or mobile processors. On some systems, below script takes 15-25 seconds, on others it takes a fraction of a second. So the suggested improvements would target and fix those where it takes 15-25 seconds. Finding out why this varies so much from machine to machine and who/how many are affected, should be the next step to investigate.
$stopwatch = [System.Diagnostics.Stopwatch]::StartNew()
$result = 1..100000 | ForEach-Object {
"I am at $_"
}
$report = '{0} elements in {1:n2} seconds'
$report -f $result.Count, $stopwatch.Elapsed.TotalSeconds
I think I identified part of the problem:
When full scriptblock logging is enabled, you see the full impact and speed penalty. Apparently, invoking scriptblocks triggers some logging for each scriptblock while a steppable pipeline will only logged once.
So with scriptblock logging enabled in full mode, Foreach-ObjectFast is roughly 100x faster. When it is enabled, it is "only" 3x faster...
Great write-up! I made efforts in the same space 2 months back trying to rewrite the pipeline for ForEach-Object
for its most commonly used scenario, see #10454.
Unfortunately, it resulted in a breaking change -- the value of $MyInvocation
was then different in the script block specified to ForEach-Object
. This is because InvokeUsingCmdlet()
sets the $MyInvocation
variable when invoking the script block ... And, $MyInvocation
is used by many existing scripts with ForEach-Object
. I searched the scripts in PowerShell Corpus, and see many uses of $MyInvocation
with ForEach-Object
.
Given that, that PR was reverted by #10485. After this, it feels to me, the best we can do is to have a new ForEach-Object
command that is simpler and invoke the specified script blocks in a different way, because any manipulation of how ForEach-Object
currently invoke a script block will very likely introduce one breaking change or another.
Agree, that makes sense. I just have revised the article and differentiated between systems with enabled scriptblock logging and those without, so this gives a better picture of the gains to be expected.
Here are my thoughts:
Another thing you want to consider is the debugging experience of script block arguments provided to ForEach-Object
. #10454 tried very hard to keep the same experience as much as possible. I'm not sure how the experience would be if you use steppable pipeline.
it is still faster to directly invoke a scriptblock or simple function, so aside from the steppable pipeline, there seems to be additional opportunities to improve speed.
The main reason it's slow is:
InvokeUsingCmdlet
needs to do necessary setups/cleanup before/after invoking the script block andForeach-Object
has to pay that tax over and over again. While for a filter or a simple function, the setup and cleanup is done only once because the invocation stays in the same function.
I summarized why ForEach-Object
is slow in my PR #10047.
@daxian-dbw would it perhaps be possible to, for example, just have ForEach-Object simply call GetSteppablePipeline on each block individually, cache the pipeline processors and call them in sequence rather than using InvokeWithCmdlet and incurring the additional cost every time?
@vexx32 I think the steppable pipeline is a good idea of replacing the InvokeWithPipeImpl
for invoking script block in ForEach/Where-Object
. The problem is more about breaking changes brought in by changing how script blocks are invoked by those two cmdlets, as there will always be subtle differences like $MyInvocation
. Adding a new parameter set like -Simple
for a completely different implementation is something worth considering, but I guess it will still raise questions and complains about inconsistencies like $MyInvocation
comparing with the legacy implementation.
@daxian-dbw I'm wondering if it's possible to mimic how the invocation presents itself with a stoppable pipeline... I guess I'll have a look at it probably this weekend and see how it works with a naive implementation and get back to y'all with potential point points if I see them.
Would you be willing to write and PR some regression tests to cover these scenarios for ForEach-Object so that we can better make an informed judgement of whether a separate switch is or is not necessary.
Also, rather than a permanent cmdlet switch, we do have the option of putting it behind an experimental feature flag.
@vexx32 The reason SteppablePipeline is faster as shown in @TobiasPSP's work is that like invoking a filter function or a script block, the setup (scopes, function_context and etc.) and cleanup are done only once because the invocation stays in the same function/script, while for the InvokeUsingPipeImpl
, it has to setup then cleanup every time the script block is invoked.
As for the regression tests, currently the one I'm aware of is $MyInvocation
, see #10477 for related discussion. I can write tests targeting it without problem, but I think there will be other subtle things depending on how we are invoking script blocks in ForEach-Object
and Where-Object
today.
@daxian-dbw yep, I fully expect there to be. I'm just looking for some baseline tests we can put possible solutions against, and then we can explore from there and cover edge cases as best we can.
I think that from a UX standpoint, having a cmdlet that has two competing behaviour patterns with no easily demonstrated difference between them should be avoided.
If we must break it, we should just break it imo. But I think that break should be an minimal as we can possibly make it, and ideally there should be no user-facing difference apart from speed. :slightly_smiling_face:
At the end, the current implementation of Foreach-Object constructs one single scriptblock from the user-submitted parameters, so if we left this as-is and just changed the way how the internal scriptblock is invoked, this should preserve most backwards compatibility. The other major change I see is $MyInvocation, and maybe someone can shed light into why it is different in the first place, and whether it is doable to emit a "corrected" $MyInvocation. Note that my prototype is calling GetSteppablePipeline() w/o any arguments where proxy functions do submit invocation details. Maybe there is an easy way of tuning $MyInvocation so it is compatible.
So my hope would be to implement the invocation changes w/o breaking changes.
@TobiasPSP the reason changing this has a tendency to break the way $MyInvocation is set is because ForEach-Object (currently) is not constructing a single scriptblock from the input. All the blocks are invoked separately, everytime, even each individual process block is invoked separately, one at a time, using InvokeWithCmdlet()
I think you may be right in that we may be able to preserve the MyInvocation information, but I'll have to take a closer look at the code and try a few things.
@TobiasPSP InvokeUsingCmdlet
explicitly uses the InvocationInfo
object from the caller scope, see the code here:
https://github.com/PowerShell/PowerShell/blob/d58a82ad19fbfad81e85778c8b08cb1b28f58fce/src/System.Management.Automation/engine/lang/scriptblock.cs#L676-L677
While when invoking filter/function/simple scriptblock (including through SteppablePipeline), the InvocationInfo
object is generated based on the script being invoked, which is correct when invocation happens this way. I don't see a way to force using a specific InvocationInfo
object when a script block is invoked in this way.
I don't see a way to force using a specific
InvocationInfo
object when a script block is invoked in this way.
SteppablePipeline
could be altered to support this. It already does something similar when the pipeline it's wrapping has parameters:
Another parameter could be added to GetSteppablePipeline
(or somewhere else in the code path) that would do something similar, but just for MyInvocation
. Might/probably will hurt performance or alter behavior though.
@SeeminglyScience Yes, that's something worth looking into. Be noted, if we are aiming at replace the current ForEach/Where-Object
implementation, then it's important to retain the current debugging experience as much as possible, which might be a problem if using SteppablePipeline
with a wrapping script.
Be noted, if we are aiming at replace the current
ForEach/Where-Object
implementation, then it's important to retain the current debugging experience as much as possible, which might be a problem if usingSteppablePipeline
with a wrapping script.
Yeah... I think there's a lot of obstacles to specifically using SteppablePipeline
. I think it makes sense to pursue creating your own pipeline processor manually, similar to how it's done for a steppable pipeline, but purpose built so the experience can be specifically catered to this scenario. Ideally it would be built out into a public API similar to SteppablePipeline
, but maybe more specifically for performance sensitive areas (though obviously I'd settle for just having ForEach-Object
be faster).
Is there anything we can do? Foreach/Where-Object are fundamental to PowerShell, and they are extremely slow. Any investment here would pay off considerably. Plus there is another severe issue tied to it: when scriptblock logging is enabled, the time penalty grows to a level that I'd call a bug.
@TobiasPSP I'm pretty sure creating a pipeline processor manually as outlined above is the right path. Problem is, that's a pretty significant investment. It'll be a pretty time consuming task, and likely require one of the most experienced members of the team.
As much as I'd love to see this happen, I can see why it might be hard to justify. The amount of users who are going to use ForEach-Object
to the scale where it becomes a problem and actually care about the extra runtime aren't very likely to be that high. High enough that if the fix was pretty quick it'd be an easy win for sure, but I'm not sure it balances the scales atm.
I hear you. I鈥榙 think it is an investment in one of the most fundamental features of Powershell, a one-time investment for good, plus with the advent of scriptblock logging, the current implementation is essentially broken (the performance degradation with scriptblock logging turned on is crippling when processing large collections). I鈥榙 rather cut on some bells and whistles if that would buy me solid and fast core pipeline cmdlets. But I understand budget is always difficult.
While fixing the issue via a steppable pipeline would be super simple and low cost, I can see there is a slight chance of compatibility issues especially with the debug experience. I agree a new pipeline processor is technically the best choice. Yet if that means we won鈥檛 ever get it, I鈥榙 rather settle for second-best ;-).
It would be unfortunate if we had to live with this fundamental flaw forever. Powershell pipeline has become notorious to be super slow compared to nonstreaming loops, and to be generally avoided, which is sad because the primary reason for the disadvantage is this flaw only. Powershells pipeline is very fast by design.
Should we close this thread?
I hear you. I鈥榙 think it is an investment in one of the most fundamental features of Powershell, a one-time investment for good, plus with the advent of scriptblock logging, the current implementation is essentially broken (the performance degradation with scriptblock logging turned on is crippling when processing large collections). I鈥榙 rather cut on some bells and whistles if that would buy me solid and fast core pipeline cmdlets. But I understand budget is always difficult.
Note I'm mainly talking from what I am guessing is the PowerShell teams perspective. Personally I run into this often enough that I think it's worth it as well. The worth diminishes a bit if you consider the ~95% of users for which this scenario will be (more than likely) exceedingly rare.
While fixing the issue via a steppable pipeline would be super simple and low cost, I can see there is a slight chance of compatibility issues especially with the debug experience. I agree a new pipeline processor is technically the best choice. Yet if that means we won鈥檛 ever get it, I鈥榙 rather settle for second-best ;-).
FWIW it seems like risk of breaking change is a bigger hurdle to the PowerShell team than implementation difficulty.
It would be unfortunate if we had to live with this fundamental flaw forever.
Forever is a long time. Maybe the landscape changes in a year and everyone is using large collections for some reason. Maybe more advanced users flood to PowerShell and high performance scenarios become important. Maybe other engine changes enable a change like this with relative ease. Maybe the PowerShell team doubles in size 馃 馃
Powershell pipeline has become notorious to be super slow compared to nonstreaming loops, and to be generally avoided, which is sad because the primary reason for the disadvantage is this flaw only. Powershells pipeline is very fast by design.
If we're not accounting for the allocation of the foreach
"condition", it'll always be the better choice for when performance is important. Pipeline will always be doing more or less the same with but with extra architecture around it. Absolutely worth improving though.
Should we close this thread?
Nah, unless the PowerShell team comes out and says it'll never happen there's still hope.
Most helpful comment
I think I identified part of the problem:
When full scriptblock logging is enabled, you see the full impact and speed penalty. Apparently, invoking scriptblocks triggers some logging for each scriptblock while a steppable pipeline will only logged once.
So with scriptblock logging enabled in full mode, Foreach-ObjectFast is roughly 100x faster. When it is enabled, it is "only" 3x faster...