Powershell: Import-CSV has very high filesize-to-memory consumption ratio of around 50

Created on 22 Aug 2018 · 18Comments · Source: PowerShell/PowerShell

Steps to reproduce

Importing a 150 MB CSV file with 50k rows, 700 columns and 5 +/- 3 characters per entry results in a memory consumption of 7GB of the PowerShell process. This is a ratio of around 50 when compared to the file size. I have seen this ratio also with a similar file of smaller size (1.5MB) and therefore think this is a more generic issue not specific to the CSV file structure itself.

$csv = Import-Csv $pathToCsvFiile

Note that this does not happen if I do not assign to a variable or pipe to Out-Null

Expected behavior

Filesize to memory ratio is reasonable (around 10)

Actual behavior

Filesize to memory ratio is around 50

Environment data

Latest daily build and RC1

> $PSVersionTable
Name                           Value
----                           -----
PSVersion                      6.1.0-preview.761
PSEdition                      Core
GitCommitId                    6.1.0-preview.761
OS                             Microsoft Windows 6.3.9600
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0


Name                           Value
----                           -----
PSVersion                      6.1.0-rc.1
PSEdition                      Core
GitCommitId                    6.1.0-rc.1
OS                             Microsoft Windows 6.3.9600
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Source

bergmeister

Most helpful comment

The memory pressure most likely comes from the cost of PSNoteProperty. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

powercode on 10 Feb 2019

👍3

All 18 comments

Dup #7112

thezim on 23 Aug 2018

@thezim I do not see this as a duplicate because the behaviour in Windows PowerShell was the same.

bergmeister on 23 Aug 2018

@bergmeister Could you please test with latest build or RC1? We already added two performance fixes. I guess you will be pleasantly surprised.

iSazonov on 23 Aug 2018

@iSazonov I tried with the latest daily build and using the RC1 today and there was no noticeable difference.

bergmeister on 23 Aug 2018

@bergmeister Thanks!

the behaviour in Windows PowerShell was the same.

Please add results of the comparision in the PR description: memory and speed.

iSazonov on 23 Aug 2018

@bergmeister I tried with a randomly generated file (600 columns, 50000 lines and 10 MB size) and can not confirm - RC1 works very good (process size is 40 MB at start then reduced to 22 MB in TaskManager).

iSazonov on 24 Aug 2018

@iSazonov How did you run it? If I run it using Import-Csv $file | Out-Null there is hardly any memory usage but if I assign it to a variable using $csv = Import-Csv $file I get the big memory allocation that I can only get rid of at the end by calling [System.GC]::Collect()

bergmeister on 24 Aug 2018

Test script

cd c:\tmp\

function perf_test($source, $result) {
    Write-Host "Measuring Import-Csv performance over time..."

    "index,duration_ms,bytes_consumed" | Out-File $result
    for ($i=0; $i -le 400; $i++) {

            $millis = (Measure-Command { Import-Csv $source }).TotalMilliseconds
            # Uncomment this if you want analize results in Excel
            $memory = [System.GC]::GetTotalMemory($false)
        $i.ToString() + "," + $millis.ToString() + "," + $memory | Out-File $result -Append
    }
    Write-Host "Done"
}

###############
# For measuring adding new properties to PSObject-s
$fields = 0..699 | ForEach-Object { "random_numbers$_" }
($fields -join ",") | Out-File .\source2.csv
Get-Random -SetSeed 1 | Out-Null
for ($i=0; $i -le 50000; $i++) {
    $values = 0..19 | ForEach-Object { (Get-Random).ToString() }
    ($values -join ",") | Out-File .\source2.csv -Append
}

perf_test .\source2.csv .\results2.csv

Your issue is not directly related with Import-Csv. Seems we have extra allocations in "=" operator.

iSazonov on 24 Aug 2018

👍1

The memory pressure most likely comes from the cost of PSNoteProperty. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

powercode on 10 Feb 2019

👍3

A workaround you can try is to generate a type for the items, and store the data there.

Import-Csv .\source2.csv |
     Select -first 1 |
     foreach {$_.psobject.properties.name} |
     join-string -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -property {"`t`$$_"}|
     invoke-expression
$a = Import-Csv .\source2.csv | Foreach{[MyCsv] $_}

That should be more efficient. You can improve the typing, ints are cheaper than strings for example, but you get the gist.

powercode on 10 Feb 2019

Additional thoughts. The cmdlet works well in Framework where GC free memory more aggressively. .Net Core GC has different behavior. Perhaps we could adjust Core GC settings.

iSazonov on 10 Feb 2019

@bergmeister I created a gist here https://gist.github.com/powercode/19eeb7c65e7542f4f221d54bb487307a where I made that into a function.

powercode on 10 Feb 2019

🎉1

@iSazonov The IDataView looks really promising! But this isn't so much a GC issue. It's about how we store our properties. If you pipe to out-null, we do a lot of work of creating the properties, but they probably stay in Gen0, which is fast to collect. The memory pressure is only noticeable when the data is saved to a variable.

We really hit the degenerated case when we have very many properties, and very little data in each.

powercode on 10 Feb 2019

It would be interesting to play with IDataView and use it in PSObject as another way of storing properties (instead of noteproperties, that is).

powercode on 10 Feb 2019

I was thinking about a global property cache. I don't know what benefits and problems this will bring. And we could consider IDataView for this cache. Although perhaps there could be a rather simpler structure. Combining this with Simple Case Folding #8120 we would get a huge resource saving.

iSazonov on 10 Feb 2019

@bergmeister I have a working implementation to import csv into a typed object.
If a class has properties, or constructor parameters, that matches the names of the columns in the file, either the default ctor is invoked, and properties are set (with conversion if necessary), or the ctor with matching count and names are invoked, also with conversion if necessary on the parameters.

It works by generating expression trees, that are then compiled to a Func<IList<string>, object>, that is invoked for each line of CSV.

Maybe time for a PR?

powercode on 10 Feb 2019

👍1

I have a prototype here: https://github.com/PowerShell/PowerShell/pull/8860

powercode on 10 Feb 2019

👍2

Just to also link to the proposal relating to PR #8860: #8862