I was looking for a regex to match non-ascii characters (not within 0-127), and found an expression here: https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions . But weirdly capital and small letter 'i' match as non-ascii. "\P" means "not".
'i' -match '\P{IsBasicLatin}'
'I' -match '\P{IsBasicLatin}'
False
False
True
True
No other ascii character matches.
0..127 | foreach { if ([char]$_ -match '\P{IsBasicLatin}') { [char]$_ } }
I
i
A workaround is to use -cmatch:
'i' -cmatch '\P{IsBasicLatin}'
False
Ah, 'i' matches some some Turkish character (0x130) without case sensitivity. There's a little dot over the capital I.
'i' -match 'İ'
True
Also, Kelvin K matches as ascii when case is ignored:
[char]0x212a | select-string '\p{IsBasicLatin}'
K
Name Value
---- -----
PSVersion 7.0.0
PSEdition Core
GitCommitId 7.0.0
OS Microsoft Windows 10.0.14393
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
This will be dependent on system culture settings and whether you use case sensitive or insensitive matching. Also, any perceived or actual discrepancies there are entirely down to how the regex processor in .NET Core is handling it. While we can document such discrepancies, we cannot fix them here; if we need them fixed or there are corrections to be made, we'll need to file issues in the https://github.com/dotnet/runtime repo. 🙂
The culture is irrelevant. My culture is en-US, and somehow the Turkish İ gets conflated with small i and capital I when case is ignored. I'm not sure what culture the kelvin sign K is. I posted a little about it on stackoverflow: https://stackoverflow.com/questions/30805741/match-high-ascii-characters-but-not-the-letter-i/60590324#60590324
Stack OverflowI'm trying to match all high ASCII and special utf-8 characters using powershell: gc $file -readcount 0 | select-string -allmatches -pattern "[\x80-\uffff]" which should find all the characters I...
It's funny how this kind of thing happens even outside .net. Maybe I should make a ticket with the unicode consortium.
echo i | findstr /i İ
i