Powershell: Improve performance of Group-Object

Created on 30 Jul 2018  Â·  14Comments  Â·  Source: PowerShell/PowerShell

As of pwsh 6.1, preview-4, we have quadratic performance when the number of unique values -> n.

By gathering the input first, sorting it, and then only each new value to the last group, we can come much closer to n * log(n) instead of n * n.

I have a prototype with the following perf measurements:

Count| Unique| OldImpl | newImpl | Speedup | Command
------| ---------|--------- | ----------| ----------|-----------
10689| 8220 | 00:00:06.81 | 00:00:00.23| 29,1 | $allItemsInPowerShellSrcTree \| group {[io.path]::GetFileName($_)}
1690765 |3761|00:02:30.34 | 00:00:22.32 | 6,7 | $u \| group

where $u is a dataset of string values out of which 3700 is unique.

The only downside I have seen is that the both the order of the output objects are different, and so is the order within the groups.

Is that part of the public contract?
Is it worth a PR?

Area-Cmdlets-Utility Breaking-Change Committee-Reviewed Issue-Enhancement Resolution-Fixed

Most helpful comment

image

Performance graph - before is the seven column to the left, showing the quadratic nature of the number of groups.

The others are from this PR.

All 14 comments

I submitted my implementation since I figured it is easier to reason about and review the change when it is more concrete.

Changed the sorting to a stable sort.
Now it is only the order of the GroupInfos that are changed, but not the objects in GroupInfo.Group.

How does it handles?

'007', 7 | group
#vs
7, '007' | group

@PetSerAl , I like how you think! I just made a special case when the types differ, and fall back to the slower algorithm.

Update:
I maintain the behavior of windows powershell.

PS> '007', 7 | group

Count Name                      Group
----- ----                      -----
    1 007                       {007}
    1 7                         {7}


PS> '007', 7 | group

Count Name                      Group
----- ----                      -----
    2 7                         {7, 007}

@PowerShell/powershell-committee reviewed this. Although users should not rely on the order, they could be broken by this if they wanted the first group. We agreed that this change should be taken and the impact of someone dependent on previous behavior is low.

image

Performance graph - before is the seven column to the left, showing the quadratic nature of the number of groups.

The others are from this PR.

@dfinke Of course, Export-Excel was used to generate the graphs! Kudos!

Well, I was just about to file a new issue about this, because this impacted me. Here's what I was going to file:

In PowerShell 5.x and below, the Group-Object cmdlet does not sort the groups by the property used to group the objects. In PowerShell 6.1, Group-Object is sorting the groups by the property used to group the objects.

I ran into this as an issue while implementing some changes in Pester. The results of tests are grouped by Contexts (for Gherkin style tests), and the result of the Context (e.g. passed, failed, etc.) is determined from the results of the individual steps that form the members of the group. There are unit tests which check that the results are grouped according to the Context status. The check uses an array of expected context names for a given result status, which is compared with the context names coming from the Pester TestResult object. This object uses the Group-Object cmdlet to perform that sorting. As a consequence, on PowerShell 6.1, these unit tests are broken because the results are not in the same order as the expected array results.

Steps to reproduce

[PSCustomObject]@{ Key = 'z'; Value = 'z' },
[PSCustomObject]@{ Key = 'y'; Value = 'y' },
[PSCustomObject]@{ Key = 'b'; Value = 'b' },
[PSCustomObject]@{ Key = 'a'; Value = 'a' } |
    Group-Object -Property Key

Expected behavior

Count Name Group
----- ---- -----
    1 z    {@{Key=z; Value=z}}
    1 y    {@{Key=y; Value=y}}
    1 b    {@{Key=b; Value=b}}
    1 a    {@{Key=a; Value=a}}

Actual behavior

Count Name Group
----- ---- -----
    1 a    {@{Key=a; Value=a}}
    1 b    {@{Key=b; Value=b}}
    1 y    {@{Key=y; Value=y}}
    1 z    {@{Key=z; Value=z}}

Environment data

Name                           Value
----                           -----
PSVersion                      6.1.0
PSEdition                      Core
GitCommitId                    6.1.0
OS                             Microsoft Windows 10.0.17763
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Conceivably, you could refactor your test to take this into account, no?

$Groups.Name | Sort | Should -Be ($Items.Key | Sort)

That thought occurred to me. What I wasn't sure about was whether or not the order of the names of the scenarios as reported in the test results object was important. Because that what you said would be the simplest solution and is the first one I thought of.

@fourpastmidnight See Steves comment above. This is a breaking change, in that it changes the order of the output, but it was deemed to be worth it for the perf improvements.

I saw that. That is a dramatic improvement! There is, however, no mention of this breaking change in the documentation on docs.microsoft.com. This is a fairly significant change. In most cases, people were probably already piping the results to Sort-Object or something equivalent. But I’m sure there are others like me who View this cmdlet as another Linq-like PS “operator” and that it would simply stream over the results, preserving the order of the results. I have to remember it’s like Linq, it’s not Linq.

Anyway, I managed to find another way to accomplish what I was trying to do which did not necessitate the use of Group-Object, thereby avoiding the issue entirely.

Thank you for replying to me in the first place on this closed issue. I appreciate your time and consideration.

Sent from my Windows 10 phone

From: Staffan Gustafsson
Sent: Friday, April 5, 2019 4:08
To: PowerShell/PowerShell
Cc: Craig E. Shea; Mention
Subject: Re: [PowerShell/PowerShell] Improve performance of Group-Object(#7409)

@fourpastmidnight See Steves comment above. This is a breaking change, in that it changes the order of the output, but it was deemed to be worth it for the perf improvements.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@powercode have we created a docs issue for it? It is a notable change and should definitely be mentioned in the help for the cmdlet.

Was this page helpful?
0 / 5 - 0 ratings