Powershell: Get-Content -ReadCount 0 combined with -Last / -Tail seems to read ALL lines internally

Created on 11 Feb 2020  路  5Comments  路  Source: PowerShell/PowerShell

Get-Content -ReadCount 0 is a convenient way to request that all lines be read _at once, into an array_ and to have that array be output _as a single_ object to the success stream.

When combined with -First aka -TotalCount aka -Head, this sensibly allocates an array for and outputs _only the requested number of lines_ from the beginning, not for _all_ lines.
Note: Meaningfully combining -First with -ReadCount 0 was only recently implemented, in #10749.

While combining -Last aka -Tail with -ReadCount 0 is _functional_, performance timings suggest that _all_ lines are needlessly being read into an array behind the scenes, before only the sub-array of interest is output.

In other words: while -Tail <n> -ReadCount 0 _should_ be the same as -Tail <n> -ReadCount <n> (explicitly setting the the read count to the same number as the number of trailing lines requested), it currently isn't in terms of _performance and memory use_.

Steps to reproduce

# Create a temporary file with 1 million lines
$f = [IO.Path]::GetTempFileName(); (, 'foo') * 1e6 > $f
# Warm up the cache.
$tmp = gc $f -ReadCount 0

# Read 1000 lines from the end, as a single array
$n = 1000
{ $v = gc $f -Tail $n -ReadCount $n },
{ $v = gc $f -Tail $n -ReadCount 0  }, 
{ $v = gc $f          -ReadCount 0  } <# control: read all lines #> |  % {
  "$_`: " + (Measure-Command $_).TotalSeconds 
}

Remove-Item $f

Expected behavior

-Tail $n -ReadCount $n and -Tail $n -ReadCount 0 should perform virtually the same and should be faster than -ReadCount 0 by itself.

Actual behavior

-Tail $n -ReadCount 0 is not only slower than -Tail $n -ReadCount $n , but also slower than -ReadCount 0 by itself, suggesting that _all_ lines were read behind the scenes.

 $v = gc $f -Tail $n -ReadCount $n : 0.1515997
 $v = gc $f -Tail $n -ReadCount 0  : 0.2118531
 $v = gc $f          -ReadCount 0  : 0.209674

Environment data

PowerShell Core 7.0.0-rc.2
Issue-Question Resolution-Fixed

Most helpful comment

Looked into this a bit. I think you discovered an interesting issue / circumstance. In PowerShell Core, the [IO.Path]::GetTempFileName(); (, 'foo') * 1e6 > $f command produces a UTF-8 file without a byte-order-marker (BOM). Side note: This differs from Windows PowerShell which produces a UTF-16 with a BOM. When processing UTF-8 data with no BOM, the Get-Content cannot detect the file encoding when the file is read in reverse. As a result, it does a forward search which enumerates the whole file (albeit in different chunk sizes depending on ReadCount -- which might explain your performance differences). Can you confirm that doing the following changes the behavior for you?

$f = [IO.Path]::GetTempFileName()
(, 'foo') * 1e6 | Set-Content -Encoding utf8BOM -LiteralPath $f

All 5 comments

Looked into this a bit. I think you discovered an interesting issue / circumstance. In PowerShell Core, the [IO.Path]::GetTempFileName(); (, 'foo') * 1e6 > $f command produces a UTF-8 file without a byte-order-marker (BOM). Side note: This differs from Windows PowerShell which produces a UTF-16 with a BOM. When processing UTF-8 data with no BOM, the Get-Content cannot detect the file encoding when the file is read in reverse. As a result, it does a forward search which enumerates the whole file (albeit in different chunk sizes depending on ReadCount -- which might explain your performance differences). Can you confirm that doing the following changes the behavior for you?

$f = [IO.Path]::GetTempFileName()
(, 'foo') * 1e6 | Set-Content -Encoding utf8BOM -LiteralPath $f

Intriguing, @NoMoreFood, thanks for the sleuthing - using a file _with a BOM_ indeed makes a big difference in the resulting performance (and even seems to make -ReadCount 0 marginally faster than -ReadCount $n)

I would never expect Get-Content to _detect_ encodings, however: if a BOM is present, the encoding is unambiguously specified; in the absence of a BOM, the _default_ encoding should be _assumed_ (UTF-8 in PS Core), so there is no good reason for this variation in behavior.

Yeah, I think you're right and I believe I see the detection bug in the code. More to come....

Pull Request to resolve this issue has been created.

:tada:This issue was addressed in #11899, which has now been successfully released as v7.1.0-preview.1.:tada:

Handy links:

Was this page helpful?
0 / 5 - 0 ratings