Powershell: Add Get-FileEncoding cmdlet or function.

Created on 17 Sep 2016  路  20Comments  路  Source: PowerShell/PowerShell

This is common task I see across many PowerShell modules and think it would add value for cross platform tasks.

Area-Cmdlets Committee-Reviewed Issue-Enhancement Up-for-Grabs

All 20 comments

Do you mean this?
http://poshcode.org/2059
https://gist.github.com/jpoehls/2406504

This suggests that need the following cmdlets: Convert-FileEncoding and Convert-StringEncoding

And the RFC is required.

@iSazonov Yes. The additional cmdlets are nice to haves as well.

This is common task I see across many PowerShell modules
@thezim Could you give examples of such modules?

I investigated this field. It is questionable. We need the reference algorithm from experts in the field.
Sample http://gnuwin32.sourceforge.net/packages/file.htm

For compatibility we need to use the ported file utility. Can we rewrite it on C# and include in the repo as cmdlet?

Posted by @sdwheeler in our Community Call, this is a version from Lee: http://poshcode.org/2153

@PowerShell/powershell-committee discussed this and recommendation is to have a cmdlet that supports this capability instead of adding to FileInfo. Usage will be more common now that we are cross platform and should be part of the Utility module. Get-FileEncoding and Convert-FileEncoding makes sense from a discovery standpoint. Seems we can just review the parameters at PR time rather than requiring RFC for this one.

@joeyaiello If we do a different algorithm then file, it may be misleading Unix users.

@SteveL-MSFT Could you please clarify about the possibility of porting of file utility?

@iSazonov porting file as a cmdlet makes sense (assuming appropriate licensing). alternatively since I see the file is ported to Windows already, perhaps it's not worth the effort to port file to c# and instead just wrap it in a cmdlet?

Our conclusion on this issue was specifically about wanting better support for encodings, nothing more.

I think we also questioned the value in porting file to PowerShell because extensions are the primary way of understanding file types on Windows.

@SteveL-MSFT We cannot expect that there is the file utility on each Unix system especially on OsX.

Today I am more deeply researched how file utility works. Encoding detection is very simple (yes, file type detection is overkill for us) and can be easily ported to C#. Thus we can easily achieve compliance with the de facto Unix standard. The bad news is that the code is very old and should be brought into line with modern standards (from FSS-UTF (1992) / UTF-8 (1993) to UTF8 (2003)).

Another bad news is that this utility does not detect codepages. Do we want to make detection of codepages? If so, do we want high-speed heuristics (sample) or will use simpler but slower ways?

Now about the conversion. Simple test:

[text.encoding]::GetEncodings().count

return
in Powershell 5.1 - 140 codepages
in Powershell 6.0 (alfa 13) - 8 codepages
(Unix iconv - ~300 codepages)

Should we completely rely on .Net Core in the expectation that there will be support for multiple charsets? Or should we make our implementation?

@SteveL-MSFT for me I was just looking for detection of encodings that existing cmdlets currently accept such as Out-File. No code page usage. I do see the value in a full set of encoding cmdlets though.

Opened - Initial discussion about encoding cmdlets https://github.com/PowerShell/PowerShell-RFC/issues/67

@iSazonov: As an aside re:

We cannot expect that there is the file utility on each Unix system especially on OsX.

file is POSIX-mandated utility and therefore available on most (all?) modern Unix platforms, including macOS (OS X).

That said, the focus of the POSIX file utility spec is on classifying files by content - encodings aren't even mentioned.

In practice, however, both the GNU and the BSD/macOS implementations _do_ report a text file's encoding, including the presence/absence of the UTF-8 pseudo-BOM.

@mklement0 Thank you mentioned this utility as POSIX. In most cases, however, it is installed as part of a _separate_ package. This should encourage us to require the installation of this utility when installing PowerShell Core. I believe it is unacceptable for us.
I recently did a little review of GNU file utility and found that its code is too out of date.
I suppose we should not rely on it. Perhaps there is a more modern version, but I don't known about it.

And welcome to discussion https://github.com/PowerShell/PowerShell-RFC/issues/67

I'm not (nearly) as advanced a PowerShell user as you guys, and I have a weak understanding of file encoding (I don't have a clue what the point of a BOM is honestly) but once every year or two, I get stung by file encoding, and the last time (a few days ago), cost us a Production migration as we were scratching our heads why our automation tool could not run batch scripts (the reason was that the batch scripts were generated by PowerShell which defaults to UTF-8 which made the batch scripts broken, but the errors made us think that it was the automation tool that was failing in some way). Such a scenario might all be very trivial/obvious to you guys, but it is not to most users (a "text file" has no deeper complexity than "text file" to most people, most of the time).

Both required tools (Get-FileEncoding and Convert-FileEncoding in https://github.com/PowerShell/PowerShell-RFC/issues/67) are long-overdue as core components of PowerShell. Get- would greatly enhance appreciation of file encoding issues (and the more information the better in my mind, codepages etc), while Convert- becomes more and more important in making PowerShell a useful cross-platform tool. Would really appreciate if this two-years-since-last-comment thread was un-mothballed?

Would really appreciate if this two-years-since-last-comment thread was _un-mothballed_?

@roysubs This was approved and you can grab the work.

I really wish that I had the ability to do that @iSazonov !

I know that @mklement0 has a very deep understanding of file encoding, I'm hoping that he might have the time to build this... 馃檪

@mklement0 is a great analytic but not a fan of coding :-)

Implementation is simple with using StreamReader.CurrentEncoding . Of cause later we could make the cmdlet more "powershel-ly" smart with an heuristics.

Sounds great, and I'll help if I can, but presumably you'd have to do this in C# (I'm more of just a SysAdmin / DevOps type scripter, I just use PowerShell and Python to manage some tasks on my work environments). I want to see PowerShell take over on Linux though, it's just a much better language imo 馃檪.

Was this page helpful?
0 / 5 - 0 ratings