Powershell: Converting from windows 1252 to UTF8

Created on 3 Apr 2018  路  8Comments  路  Source: PowerShell/PowerShell

Steps to reproduce

Using Windows 1252 encoding, create a file "test.txt" that contents this sentence :
cette fonction doit 锚tre appel茅e avant l'initialisation de l'API

Try to convert the file "test.txt" from Windows 1252 to UTF8 using this script.

Param (
[Parameter(Mandatory=$True)][String]$SourcePath
)

Get-ChildItem $SourcePath* -recurse -Include *.txt | ForEach-Object {
$content = $_ | Get-Content

Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}

Expected behavior

In UTF8 :

cette fonction doit 锚tre appel茅e avant l'initialisation de l'API

Actual behavior

In UTF8:

cette fonction doit 锟絫re appel锟絜 avant l'initialisation de l'API

Environment data

Name Value
---- -----
PSVersion 6.1.0-preview.1
PSEdition Core
GitCommitId v6.1.0-preview.1
OS Microsoft Windows 6.1.7601 S
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0

Note

Powershell 4.0 does not have this issue

Area-Cmdlets-Core Issue-Bug Resolution-Answered

Most helpful comment

@stknohg:

Ah, thanks. Somehow I had wrongly convinced myself that you couldn't directly pass a System.Text.Encoding instance - thanks for clarifying that.

I think the discussion around the linked RFC eventually led to the current Core behavior of globally defaulting to BOM-less UTF-8 - see https://github.com/PowerShell/PowerShell-RFC/issues/71

The WindowsLegacy meta-setting was intended for a never-implemented $PSDefaultEncoding preference variable, and was meant to _globally_ revert to the old, inconsistent encoding behavior for the sake of backward compatibility - an approach that I personally think is not worth pursuing.

Again, given that OEM - the OEM code page implied by the legacy system locale - already exists as a predefined encoding enumeration value, it should be complemented with an ANSI identifier for the "ANSI" code page implied by the system locale (on Windows only; the equivalent of what Default represents for _Windows_ PowerShell).

All 8 comments

The default encoding in PowerShell Core is now UTF-8 (without a BOM when creating files).

That means that a Windows 1252-encoded file - in the absence of a BOM defining it as such (there is none for Windows 1252) - is now interpreted as _UTF-8_.

The upshot is that you must now tell Get-Content what encoding to assume - unless it is UTF-8 or there is a BOM.

Regrettably, Get-Content doesn't currently allow you to specify Windows 1252, because Default now represents UTF-8 and no longer the active "ANSI" code page (such as Windows 1252), as on Windows PowerShell, and you cannot pass a [System.Text.Encoding] instance directly.

This is an oversight that must be corrected.

My suggestion: add an ANSI encoding enumeration value on Windows that represents the system's legacy "ANSI" code page (e.g., Windows 1252 on US-English systems).


The - cumbersome - workaround to use in the meantime requires use of the .NET framework directly:

$content = [IO.File]::ReadAllText($_.FullName, [text.encoding]::GetEncoding(1252))

Or, more generically:

$content = [IO.File]::ReadAllText($_.FullName, [text.encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage))

@mklement0

PowerShell Core 6.0 accepts System.Text.Encoding class in -Encoding parameter. (#5080)

We can write as follow.

$content = $_ | Get-Content -Encoding ([System.Text.Encoding]::GetEncoding(1252))

# or

$content = $_ | Get-Content -Encoding ([System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage))

Additionally, WindowsLegacyg is proposed in RFC.
(but WindowsLegacyg is not implemented yet...)

It is better to discuss this RFC if compatibility is necessary.


Maybe #5204 related.

@stknohg:

Ah, thanks. Somehow I had wrongly convinced myself that you couldn't directly pass a System.Text.Encoding instance - thanks for clarifying that.

I think the discussion around the linked RFC eventually led to the current Core behavior of globally defaulting to BOM-less UTF-8 - see https://github.com/PowerShell/PowerShell-RFC/issues/71

The WindowsLegacy meta-setting was intended for a never-implemented $PSDefaultEncoding preference variable, and was meant to _globally_ revert to the old, inconsistent encoding behavior for the sake of backward compatibility - an approach that I personally think is not worth pursuing.

Again, given that OEM - the OEM code page implied by the legacy system locale - already exists as a predefined encoding enumeration value, it should be complemented with an ANSI identifier for the "ANSI" code page implied by the system locale (on Windows only; the equivalent of what Default represents for _Windows_ PowerShell).

Certainly, to introduce ANSI is simpler and not globally as you say.
I think it's good.

The workaround proposed by mklement0 works for me.
I propose to close this issue since the rest of the discussion is mainly focused on BM-less UTF8 which is indeed treated in PowerShell/PowerShell-RFC#71.
Thanks.

@Calimerou: Alternatively, we could retitle your issue and modify the initial post to propose the missing ANSI encoding-enumeration value, as discussed. If you prefer my creating a new issue instead, let me know.

I would prefer yours.
Thanks in advance.

For now, I work around this issue in my scripts as follows:

````
$iswinps = ($null, 'Desktop') -contains $PSVersionTable.PSEdition
if (!$iswinps)
{
$encoding = [System.Text.Encoding]::GetEncoding(1252)
}
else
{
$encoding = [Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding]::Default
}

Get-Content -Encoding $encoding ...

````

HTH

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alx9r picture alx9r  路  3Comments

HumanEquivalentUnit picture HumanEquivalentUnit  路  3Comments

JohnLBevan picture JohnLBevan  路  3Comments

manofspirit picture manofspirit  路  3Comments

aragula12 picture aragula12  路  3Comments