Importing a 150 MB CSV file with 50k rows, 700 columns and 5 +/- 3 characters per entry results in a memory consumption of 7GB of the PowerShell process. This is a ratio of around 50 when compared to the file size. I have seen this ratio also with a similar file of smaller size (1.5MB) and therefore think this is a more generic issue not specific to the CSV file structure itself.
$csv = Import-Csv $pathToCsvFiile
Note that this does not happen if I do not assign to a variable or pipe to Out-Null
Filesize to memory ratio is reasonable (around 10)
Filesize to memory ratio is around 50
Latest daily build and RC1
> $PSVersionTable
Name Value
---- -----
PSVersion 6.1.0-preview.761
PSEdition Core
GitCommitId 6.1.0-preview.761
OS Microsoft Windows 6.3.9600
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
Name Value
---- -----
PSVersion 6.1.0-rc.1
PSEdition Core
GitCommitId 6.1.0-rc.1
OS Microsoft Windows 6.3.9600
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
Dup #7112
@thezim I do not see this as a duplicate because the behaviour in Windows PowerShell was the same.
@bergmeister Could you please test with latest build or RC1? We already added two performance fixes. I guess you will be pleasantly surprised.
@iSazonov I tried with the latest daily build and using the RC1 today and there was no noticeable difference.
@bergmeister Thanks!
the behaviour in Windows PowerShell was the same.
Please add results of the comparision in the PR description: memory and speed.
@bergmeister I tried with a randomly generated file (600 columns, 50000 lines and 10 MB size) and can not confirm - RC1 works very good (process size is 40 MB at start then reduced to 22 MB in TaskManager).
@iSazonov How did you run it? If I run it using Import-Csv $file | Out-Null
there is hardly any memory usage but if I assign it to a variable using $csv = Import-Csv $file
I get the big memory allocation that I can only get rid of at the end by calling [System.GC]::Collect()
Test script
cd c:\tmp\
function perf_test($source, $result) {
Write-Host "Measuring Import-Csv performance over time..."
"index,duration_ms,bytes_consumed" | Out-File $result
for ($i=0; $i -le 400; $i++) {
$millis = (Measure-Command { Import-Csv $source }).TotalMilliseconds
# Uncomment this if you want analize results in Excel
$memory = [System.GC]::GetTotalMemory($false)
$i.ToString() + "," + $millis.ToString() + "," + $memory | Out-File $result -Append
}
Write-Host "Done"
}
###############
# For measuring adding new properties to PSObject-s
$fields = 0..699 | ForEach-Object { "random_numbers$_" }
($fields -join ",") | Out-File .\source2.csv
Get-Random -SetSeed 1 | Out-Null
for ($i=0; $i -le 50000; $i++) {
$values = 0..19 | ForEach-Object { (Get-Random).ToString() }
($values -join ",") | Out-File .\source2.csv -Append
}
perf_test .\source2.csv .\results2.csv
Your issue is not directly related with Import-Csv. Seems we have extra allocations in "=" operator.
The memory pressure most likely comes from the cost of PSNoteProperty. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.
A workaround you can try is to generate a type for the items, and store the data there.
Import-Csv .\source2.csv |
Select -first 1 |
foreach {$_.psobject.properties.name} |
join-string -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -property {"`t`$$_"}|
invoke-expression
$a = Import-Csv .\source2.csv | Foreach{[MyCsv] $_}
That should be more efficient. You can improve the typing, ints are cheaper than strings for example, but you get the gist.
Additional thoughts. The cmdlet works well in Framework where GC free memory more aggressively. .Net Core GC has different behavior. Perhaps we could adjust Core GC settings.
@bergmeister I created a gist here https://gist.github.com/powercode/19eeb7c65e7542f4f221d54bb487307a where I made that into a function.
@iSazonov The IDataView
looks really promising! But this isn't so much a GC issue. It's about how we store our properties. If you pipe to out-null, we do a lot of work of creating the properties, but they probably stay in Gen0, which is fast to collect. The memory pressure is only noticeable when the data is saved to a variable.
We really hit the degenerated case when we have very many properties, and very little data in each.
It would be interesting to play with IDataView and use it in PSObject as another way of storing properties (instead of noteproperties, that is).
I was thinking about a global property cache. I don't know what benefits and problems this will bring. And we could consider IDataView for this cache. Although perhaps there could be a rather simpler structure. Combining this with Simple Case Folding #8120 we would get a huge resource saving.
@bergmeister I have a working implementation to import csv into a typed object.
If a class has properties, or constructor parameters, that matches the names of the columns in the file, either the default ctor is invoked, and properties are set (with conversion if necessary), or the ctor with matching count and names are invoked, also with conversion if necessary on the parameters.
It works by generating expression trees, that are then compiled to a Func<IList<string>, object>
, that is invoked for each line of CSV.
Maybe time for a PR?
I have a prototype here: https://github.com/PowerShell/PowerShell/pull/8860
Just to also link to the proposal relating to PR #8860: #8862
Most helpful comment
The memory pressure most likely comes from the cost of PSNoteProperty. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.