Powershell: regex '\P{IsBasicLatin}' (non-ascii) matches the letter 'i'

Created on 8 Mar 2020 · 3Comments · Source: PowerShell/PowerShell

I was looking for a regex to match non-ascii characters (not within 0-127), and found an expression here: https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions . But weirdly capital and small letter 'i' match as non-ascii. "\P" means "not".

Steps to reproduce

'i' -match '\P{IsBasicLatin}'
'I' -match '\P{IsBasicLatin}'

Expected behavior

False
False

Actual behavior

True
True

No other ascii character matches.

0..127 | foreach { if ([char]$_ -match '\P{IsBasicLatin}') { [char]$_ } }
I
i

A workaround is to use -cmatch:

'i' -cmatch '\P{IsBasicLatin}'
False

Ah, 'i' matches some some Turkish character (0x130) without case sensitivity. There's a little dot over the capital I.

'i' -match 'İ'
True

Also, Kelvin K matches as ascii when case is ignored:

[char]0x212a | select-string '\p{IsBasicLatin}'

K

Environment data

Name                           Value
----                           -----
PSVersion                      7.0.0
PSEdition                      Core
GitCommitId                    7.0.0
OS                             Microsoft Windows 10.0.14393
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Issue-Question Resolution-Answered

Source

jszabo98

All 3 comments

This will be dependent on system culture settings and whether you use case sensitive or insensitive matching. Also, any perceived or actual discrepancies there are entirely down to how the regex processor in .NET Core is handling it. While we can document such discrepancies, we cannot fix them here; if we need them fixed or there are corrections to be made, we'll need to file issues in the https://github.com/dotnet/runtime repo. 🙂

vexx32 on 9 Mar 2020

👍1

The culture is irrelevant. My culture is en-US, and somehow the Turkish İ gets conflated with small i and capital I when case is ignored. I'm not sure what culture the kelvin sign K is. I posted a little about it on stackoverflow: https://stackoverflow.com/questions/30805741/match-high-ascii-characters-but-not-the-letter-i/60590324#60590324

Stack Overflow
Match high ASCII characters (but not the letter i)
I'm trying to match all high ASCII and special utf-8 characters using powershell: gc $file -readcount 0 | select-string -allmatches -pattern "[\x80-\uffff]" which should find all the characters I...

jszabo98 on 9 Mar 2020

It's funny how this kind of thing happens even outside .net. Maybe I should make a ticket with the unicode consortium.

echo i | findstr /i İ
i

jszabo98 on 15 Mar 2020

😄1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Parameter parsing/passing: unquoted tokens that look like named arguments with colon as the separator are broken in two when passed indirectly via $Args / @Args

mklement0 · 3Comments

Modify Environment Variables (for example $env:PSModulePath) on Linux

rudolfvesely · 3Comments

Comment based help does not work for scripts that start with a shebang

abock · 3Comments

Write-Output -NoEnumerate outputs PSObject[] rather than Object[] and generally doesn't respect the input collection type

mklement0 · 3Comments

Support for $MaximumFunctionCount and other limits should be removed

lzybkr · 3Comments