Using Get-Content to read an example 170,000 line wordlist text file.
# Default use, slow. Roughly 6 seconds. Over 100x longer than the alternatives.
$lines = Get-Content -Path '/path/to/bigfile.txt'
# Fast. Roughly 40ms - 90ms.
$lines = [system.io.file]::ReadAllLines('/path/to/bigfile.txt')
# Fast. Roughly 50-100ms. NB. the ReadCount has to be larger than the file line count,
# otherwise $lines is not a 1-dimensional array. i.e. you need to know the file line
# count to be able to do this in one move.
$lines = Get-Content -Path '/path/to/bigfile.txt' -ReadCount 200kb
# Fastest. Roughly 30ms - 50ms.
$lines = Get-Content -Path '/path/to/bigfile.txt' -ReadCount 100 | foreach { $_ }
The reason for the slow version is explained here, apparently by Bruce Payette in 2006:
This is a known issue with the way Get-Content works. For each object
returned from the pipe, it adds a bunch of extra information to that object
in the form of NoteProperties.
These properties are being added for every object processed in the
pipeline. We do this to allow cmdlets to work more effectively together.
It's important because things like the Path property may vary across
different object types. In effect, we're doing "property name
normalization". Unfortunately, while this technique provides significant
benefits by making the system more consistent, it isn't free. It adds
significant overhead both in terms of processing time and memory space.
We're investigating ways to reduce these costs without losing the benefits
but in the end, we may need to add a way to suppress adding this extra
information.
I think it's a shame that the default usage of Get-Content is the slow version, but that's likely not going to change. But, 12 years on from this posting, is it time to add a way to suppress adding this extra information?
e.g. a parameter to Get-Content which switches off the NoteProperties. I have no good parameter name suggestion - ideally I would want it to communicate "this is faster" to people who see it in written code, or who read the documentation wondering how they can speed up Get-Content on large files.
I'm willing to take this if the PS team OKs it, although I'm unsure what to name the parameter. Someone suggested RawLines to me.
I have decided to go ahead and work on this issue. @powershell/powershell can I get an assignment?
I've marked it as an enhancement and up-for-grabs. You should just be able to assign it to yourself. We added -Raw a long while back to address the perf issue but it doesn't really do the right thing. Naming the new parameter -RawLines sounds ok but maybe a -ReadMode parameter that took lines, text, rawlines, rawtext etc. might be more flexible.
@BrucePay maybe I'm just being daft but I don't see a way to assign this to myself
See #7501
Someone just brought PR #7502 to my attention, would that make solving this issue unnecessary? It seems to solve the same issue of Get-Content being slow, but in a more elegant manner.
Only individuals marked as Collaborators show up in the Assignees list. @jcotton42 I'll assign this to myself to avoid someone else duplicating the work. You can assume this is assigned to you.
The WIP PR is still under review as it is a breaking change. However, although that change will improve things if accepted, it may still make sense to add a parameter to Get-Content
We added -Raw a long while back to address the perf issue but it doesn't really do the right thing
Ohhh I didn't imagine it had already been acted on. That's partly what I meant about "ideally I would want a parameter name to communicate "this is faster" to people who see it". I've used -Raw in other circumstances, not noticed it was faster, and not twigged it was related to this.
@SteveL-MSFT ok sounds good. Given that it looks like that PR will affect mine I will wait until it's merged.
As for what to name the parameter:
I suggest -Bare, which avoids the -Raw confusion (_raw_ also has inapplicable connotations of reading _raw bytes_).
(Conversely, a more sensibly named parameter alias for -Raw should be introduced - see #7715)
Using -Bare - without including the term _lines_ - also opens the door for implementing similar logic for other cmdlets (opting out of output-object decoration) that may be emitting different types of output objects - see #7713 (though there the "bare" objects happen to be lines too, except if combined with the proposed option to return only matching portions of a line (#7712)).
Most helpful comment
As for what to name the parameter:
I suggest
-Bare, which avoids the-Rawconfusion (_raw_ also has inapplicable connotations of reading _raw bytes_).(Conversely, a more sensibly named parameter alias for
-Rawshould be introduced - see #7715)Using
-Bare- without including the term _lines_ - also opens the door for implementing similar logic for other cmdlets (opting out of output-object decoration) that may be emitting different types of output objects - see #7713 (though there the "bare" objects happen to be lines too, except if combined with the proposed option to return only matching portions of a line (#7712)).