Powershell: regex '\P{IsBasicLatin}' (non-ascii) matches the letter 'i'

Created on 8 Mar 2020  Â·  3Comments  Â·  Source: PowerShell/PowerShell

I was looking for a regex to match non-ascii characters (not within 0-127), and found an expression here: https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions . But weirdly capital and small letter 'i' match as non-ascii. "\P" means "not".

Steps to reproduce

'i' -match '\P{IsBasicLatin}'
'I' -match '\P{IsBasicLatin}'

Expected behavior

False
False

Actual behavior

True
True

No other ascii character matches.

0..127 | foreach { if ([char]$_ -match '\P{IsBasicLatin}') { [char]$_ } }
I
i

A workaround is to use -cmatch:

'i' -cmatch '\P{IsBasicLatin}'
False

Ah, 'i' matches some some Turkish character (0x130) without case sensitivity. There's a little dot over the capital I.

'i' -match 'İ'
True

Also, Kelvin K matches as ascii when case is ignored:

[char]0x212a | select-string '\p{IsBasicLatin}'

K

Environment data

Name                           Value
----                           -----
PSVersion                      7.0.0
PSEdition                      Core
GitCommitId                    7.0.0
OS                             Microsoft Windows 10.0.14393
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
Issue-Question Resolution-Answered

All 3 comments

This will be dependent on system culture settings and whether you use case sensitive or insensitive matching. Also, any perceived or actual discrepancies there are entirely down to how the regex processor in .NET Core is handling it. While we can document such discrepancies, we cannot fix them here; if we need them fixed or there are corrections to be made, we'll need to file issues in the https://github.com/dotnet/runtime repo. 🙂

The culture is irrelevant. My culture is en-US, and somehow the Turkish İ gets conflated with small i and capital I when case is ignored. I'm not sure what culture the kelvin sign K is. I posted a little about it on stackoverflow: https://stackoverflow.com/questions/30805741/match-high-ascii-characters-but-not-the-letter-i/60590324#60590324

Stack Overflow
I'm trying to match all high ASCII and special utf-8 characters using powershell: gc $file -readcount 0 | select-string -allmatches -pattern "[\x80-\uffff]" which should find all the characters I...

It's funny how this kind of thing happens even outside .net. Maybe I should make a ticket with the unicode consortium.

echo i | findstr /i İ
i
Was this page helpful?
0 / 5 - 0 ratings

Related issues

SteveL-MSFT picture SteveL-MSFT  Â·  189Comments

joeyaiello picture joeyaiello  Â·  66Comments

NJ-Dude picture NJ-Dude  Â·  64Comments

SteveL-MSFT picture SteveL-MSFT  Â·  66Comments

mklement0 picture mklement0  Â·  67Comments