Powershell: If # is the first character of the header, ConvertFrom-Csv behaves weirdly

Created on 23 Oct 2020 · 14Comments · Source: PowerShell/PowerShell

Steps to reproduce

Write-Output "Case 1:"
$csv = '#;1;"foo";2
1;1;"bar";2
1;1;"barbar";2'
$csv | ConvertFrom-Csv -Delimiter ";"

Write-Output "Case 2:"
$csv = '#;"foo";2
1;"bar";2
1;1;"barbar";2'
$csv | ConvertFrom-Csv -Delimiter ";"

Expected behavior

Case 1:
# 1 "foo"  2
- - -----  -
1 1 bar    2
1 1 barbar 2

Case 2: 
# "foo"  2
- -----  -
1 bar    2
1 barbar 2

Actual behavior

Case 1:
ConvertFrom-Csv:
Line |
   4 |  $csv | ConvertFrom-Csv -Delimiter ";"
     |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | The member "1" is already present.


Case 2:
1 bar    2
- ---    -
1 barbar 2

Environment data

Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.18363
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Issue-Question Resolution-Answered

Source

larsthelord

Most helpful comment

@larsthelord

Import-Csv and ConvertFrom-Csv should act consistently.

It is unfortunate that that consistent behavior is marred by a historical oddity kept around for backward compatibility, but that is a separate issue.

Backward compatibility was _broken_ in PowerShell Core when _writing_ CSV files, for a very good reason: including the #TYPE line should never have been the default, given that it results in _nonstandard_ CSV files, and that in virtually all cases you do _not_ want or need it.
Backward compatibility was _preserved_ for _reading_ such nonstandard CSV files, and that's what you're seeing; also, again for backward compatibility you can still produce such files in PowerShell Core, but only on an _opt-in_ basis, with -IncludeTypeInformation.

The current limitation is that your header row's _first_ column name _must not start with an#_ - whether or not the column name is double-quoted and whether or not there is leading whitespace (in Windows PowerShell quoting and leading whitespace matter), otherwise the first line will be interpreted as a type-annotation line, causing it to be ignored as a header row.

This implies that the check for whether the first line is a #TYPE annotation is needlessly lax, leading to unnecessary false positives.

@jhoneill, there is a way to deal with this without breaking backward compatibility - mostly:

A simple improvement would be to check the first line with line.StartsWith("#TYPE", StringComparison.OrdinalIgnoreCase) and only _then_ treat the line as a type annotation, because only then it is truly recognized as such (as reflected in the output objects containing the specified type name - both as-is (in PS Core) and with a CSV: prefix (in both editions - in their .pstypenames ETS property).

Technically, this would amount to a breaking change, because a first line that starts with something like #foo would then be considered a header row, whereas it is currently quietly ignored.

However, given that such a malformed #-prefixed lines shouldn't be present in CSV files (Export-Csv would never produce them), to me that falls into Bucket 3: Unlikely Grey Area and is therefore still worth doing.

(This still leaves the inability to have a header row's first column name start with #TYPE, but that is much less likely to occur and can be worked around by double-quoting the name; using a regex to make the test stricter could also help; e.g.
^#TYPE\s+\S+\s* (stricter versions are possible))

mklement0 on 24 Oct 2020

👍2

All 14 comments

If you do export-csv you'll see by default the first line says
#TypeName

#;1;"foo";2
Is creating a type name. then
1;1;"bar";2
Is creating field names 1, 1, bar and 2
So when it tries to the add the values from the next row it tries to a member named 1 twice.

So by design.

jhoneill on 24 Oct 2020

Didn't the #TYPE stop being standard from PS 6.0 and onwards?

Beginning with PowerShell 6.0 the default behavior of Export-CSV is to not include the #TYPE information in the CSV and NoTypeInformation is implied. IncludeTypeInformation can be used to include the #TYPE Information and emulate the default behavior of Export-CSV prior to PowerShell 6.0.
https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/export-csv?view=powershell-7#notes

information about #TYPE is also only documented in Export-Csv and Import-Csv, but not in ConvertFrom-Csv.
Should the behaviour still be the same?

Export-Csv (Microsoft.PowerShell.Utility) - PowerShell
The Export-CSV cmdlet creates a CSV file of the objects that you submit. Each object is a row that includes a comma-separated list of the object's property values. You can use the Export-CSV cmdlet to create spreadsheets and share data with programs that accept CSV files as input. Do not format objects before sending them to the Export-CSV cmdlet. If Export-CSV receives formatted objects the CSV file contains the format properties rather than the object properties. To export only selected properties of an object, use the Select-Object cmdlet.

larsthelord on 24 Oct 2020

I also saw in ConvertFrom-Csv that this under description

You can also use the Export-Csv and Import-Csv cmdlets to convert objects to CSV strings in a file (and back). These cmdlets are the same as the ConvertTo-Csv and ConvertFrom-Csv cmdlets, except that they save the CSV strings in a file.
https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/convertfrom-csv?view=powershell-7#description

It seems that the underlying method doing the transformation is the same.
I still find it weird that it would try to interpret a line starting with "#" as #TYPE in PowerShell 7

ConvertFrom-Csv (Microsoft.PowerShell.Utility) - PowerShell
The ConvertFrom-Csv cmdlet creates objects from CSV variable-length strings that are generated by the ConvertTo-Csv cmdlet. You can use the parameters of this cmdlet to specify the column header row, which determines the property names of the resulting objects, to specify the item delimiter, or to direct this cmdlet to use the list separator for the current culture as the delimiter. The objects that ConvertFrom-Csv creates are CSV versions of the original objects. The property values of the CSV objects are string versions of the property values of the original objects. The CSV versions of the objects do not have any methods. You can also use the Export-Csv and Import-Csv cmdlets to convert objects to CSV strings in a file (and back). These cmdlets are the same as the ConvertTo-Csv and ConvertFrom-Csv cmdlets, except that they save the CSV strings in a file.

larsthelord on 24 Oct 2020

Didn't the #TYPE stop being standard from PS 6.0 and onwards?

Beginning with PowerShell 6.0 the default behavior of Export-CSV is to not include the #TYPE information

I'm still specifying -NoTypeInformation everywhere, and that detail had been swapped out of working memory.
Even so when Import- and Convert see a type, they process it.

information about #TYPE is also only documented in Export-Csv and Import-Csv, but not in ConvertFrom-Csv.
Should the behaviour still be the same?

I've always understood that under the covers Import was Get-Content | convertfrom.

jhoneill on 24 Oct 2020

$4Sent from my Samsung Galaxy smartphone.
-------- Original message --------From: jhoneill notifications@github.com Date: 10/24/20 09:55 (GMT-05:00) To: PowerShell/PowerShell PowerShell@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [PowerShell/PowerShell] If # is the first character of the header, ConvertFrom-Csv behaves weirdly (#13857)

Didn't the #TYPE stop being standard from PS 6.0 and onwards?

Beginning with PowerShell 6.0 the default behavior of Export-CSV is to not include the #TYPE information

I'm still specifying -NoTypeInformation everywhere, and that detail had been swapped out of working memory.
Even so when Import- and Convert see a type, they process it.

information about #TYPE is also only documented in Export-Csv and Import-Csv, but not in ConvertFrom-Csv.
Should the behaviour still be the same?

I've always understood that under the covers Import was Get-Content | convertfrom.

BrunoG87 on 24 Oct 2020

I still find it weird that it would try to interpret a line starting with "#" as #TYPE in PowerShell 7

If it stopped doing so then scripts from and (worse) datafiles produced by PowerShell 5 would fail in newer versions.
Someone chose # as the type designator in the beta of Windows PowerShell 1 before it even had the name PowerShell and you have a dozen years of scripts and data to overturn.

jhoneill on 24 Oct 2020

@larsthelord

Import-Csv and ConvertFrom-Csv should act consistently.

It is unfortunate that that consistent behavior is marred by a historical oddity kept around for backward compatibility, but that is a separate issue.

Backward compatibility was _broken_ in PowerShell Core when _writing_ CSV files, for a very good reason: including the #TYPE line should never have been the default, given that it results in _nonstandard_ CSV files, and that in virtually all cases you do _not_ want or need it.
Backward compatibility was _preserved_ for _reading_ such nonstandard CSV files, and that's what you're seeing; also, again for backward compatibility you can still produce such files in PowerShell Core, but only on an _opt-in_ basis, with -IncludeTypeInformation.

This implies that the check for whether the first line is a #TYPE annotation is needlessly lax, leading to unnecessary false positives.

@jhoneill, there is a way to deal with this without breaking backward compatibility - mostly:

Technically, this would amount to a breaking change, because a first line that starts with something like #foo would then be considered a header row, whereas it is currently quietly ignored.

mklement0 on 24 Oct 2020

👍2

@mklement0 Agree. Your suggestion is the sensible place between the current "Treat everything with a # as a type even if is malformed" and what I took the OP as suggesting (which he probably didn't mean) of "Ignore #Type even when well formed". It's a very unlikely breaker. It is far, far, more likely that something depends on the default behaviour of Windows PowerShell writing #Type and produces invalid data files on 6 and 7 - and _that_ was considered low enough risk, and changing sufficiently obviously right and doable that it went ahead.

jhoneill on 24 Oct 2020

👍1

I can confirm that it is by-design for the _edge_ case.
Our intentions were to make the cmdlets (1) more smart (support standard headers), (2) more performance (we still hope to add more improvements here).
It is second question about the behavior I remember but we need to see an important business scenario to fix this in the repo otherwise it is more simple to use an workaround in custom script.
If there is a gap in docs please open new issue in PowerShell-Docs repository.

iSazonov on 26 Oct 2020

@iSazonov: A # line in above a CSV's file first row is definitely _not_ a standard header.

While there is no formal standard for CSV, RFC 4180 describes a format followed by most implementations, which doesn't allow for annotations of this kind, so it's likely that other tools interpret such lines as _data_.

It is really unfortunate that PowerShell ever introduced this custom variation, which provides virtually no benefit and causes only problems.

mklement0 on 26 Oct 2020

@mklement0 It is de-facto standard #2480 We support this because it is widely used.

iSazonov on 27 Oct 2020

Fair enough, @iSazonov (those are log files with specific formats that start with a block of #-prefixed metadata (comment) lines followed by a separator-based format).

However, what _doesn't_ make sense and is worth fixing is that a _double_-quoted # as the first non-whitespace token on a line is also considered a comment line:

PS> @'
"#","Why","am","I","a","comment?"
"Col1","Col2","Col3","Col4"
"1","2","3","4"
'@ | ConvertFrom-Csv 

Col1 Col2 Col3 Col4   # !! "#"... line was ignored, despite the # in double quotes
---- ---- ---- ----
1    2    3    4

This means that you cannot round-trip something like the following, because the header row isn't recognized as such on re-import:

PS> [pscustomobject]@{ '#' = 10; Name = 'foo' } | ConvertTo-Csv
"#","Name"
"10","foo"

PS> [pscustomobject]@{ '#' = 10; Name = 'foo' } | ConvertTo-Csv | ConvertFrom-Csv
# !! No output, because the header row was ignored, and the first data row became the header row.

Note that Export-Csv / Import-Csv _always_ double-quote the fields, so once this is problem is fixed, at least CSV files created with these cmdlets would be immune to misinterpretation of #.

mklement0 on 27 Oct 2020

Just so we don't lose track of this: see #13907.

mklement0 on 27 Oct 2020

This issue has been marked as answered and has not had any activity for 1 day. It has been closed for housekeeping purposes.