Shellcheck: [A-Z] (range) pattern matches depend on locale in bash.

Created on 13 Dec 2018 · 19Comments · Source: koalaman/shellcheck

I tried this on shellcheck.net online, and got no results (no issues found), and I expected to get a warning.

Consider the following shellscript './Demo'

#!/bin/bash
for i in "$@"; do
  case "$i" in
    *[A-Z]*) echo "$i has upper case";;
  esac
done

In dash, mksh, ksh, and zsh, this will always produce the following output (regardless of locale) :

$ env - LANG=en_US.UTF-8 /bin/dash ./Demo a b C y z
C has upper case

But in bash, the results will depend on the locale that is set. For example, when LANG is set to en_US.UTF-8, you will get the following results :

$ env - LANG=en_US.UTF-8 /bin/bash ./Demo a b C y z
b has upper case
C has upper case
y has upper case
z has upper case

The rationale/reason for this is that the bash developers feel that this is how character sets work, and since the locale specifies the ordering, under this locale, the [A-Z] range means: [AbBcCdD..zZ] (that's why it doesn't match 'a', and does match 'y' and 'z'). In bash, the only locale guaranteed to have that ordering for ranges ([a-z], [A-Z], ...) is the POSIX locale (LANG=C or LANG=POSIX). If you always want the set of upper case characters in bash regardless of locale, you need to use [:upper:] instead of [A-Z] (or explicitly set LANG to C or POSIX).

Here's what shellcheck currently says:

No issues detected!

Here's what I wanted or expected to see:

A warning, stating that when using bash, either use [:upper:] instead of [A-Z] and [:lower:] instead of [a-z], or explicitly set LANG to either C or POSIX. Of course, when you want sub-ranges like [A-D] you cannot use [:upper:], which is [A-Z], so the only option left is setting LANG.

I guess this is closely related to issue #985.

PS:
The piece of information in the GNU AWK manual at the url below describes the same issue that is/was true for (certain versions of) gawk (and GNU grep as well), but I guess those commands deserve their own issue here.

https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Source

mhoes

👍1

All 19 comments

This is essentially a bash bug...it should be fixed in bash.

orbea on 29 Dec 2018

This is essentially a bash bug...it should be fixed in bash.

Unfortunately, this is not a 'bug'. The behavior of Bash is intended to be this way, and in fact is even POSIX compliant [1]. (For reference, back in 2014 an issue was opened for Bash for this behavior on the GNU website [2], which was closed for the valid reason that the behavior is indeed intended). Please read the link to the GNU website [3] that explains the issue in some more detail (even though a different approach that is equally POSIX compliant was eventually settled on for GNU Awk).

[1]
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05

[2]
https://savannah.gnu.org/support/index.php?108609

[3]
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

mhoes on 29 Dec 2018

Hrm. Reading back my own link to the relevant part of the POSIX standard and the issue #985 I referenced, I guess that indeed the correct warning message to issue here when using range matches in a shellscript, is that their usage is not portable or even has 'undefined' behavior in some cases. Accompanied by an appropriate longer explanation of the issue on the wiki, that states that the behavior is only defined in POSIX when using the 'C' or 'POSIX' locales, or when using the following character class expressions defined in POSIX that are supported in all locales :

[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
[:alpha:]   [:digit:]   [:print:]   [:upper:]
[:blank:]   [:graph:]   [:punct:]   [:xdigit:]

mhoes on 29 Dec 2018

No, this is just a bash bug, until you start using special characters [A-Z] should be entirely valid, a, b, C, y, z are not special characters.

Though, not that I expect bash devs to care just like they don't care about this infinite loop...

#!/bin/sh

# Setup the test directory:
#   mkdir -p /tmp/test
#   cd /tmp/test
#   for i in $(seq 100000); do touch "$i" ; done

empty () { case "${1:-}" in "${2:-}") return 0 ;; *) return 1 ;; esac; }

set -- ./*

for i do
  empty "$i"
done

# Performance is greatly reduced with each
# additional 0 added to the number of files.

# Issues are mostly avoided with:
#   for i in ./*; do
# or
#   case "$i" in '' ) : ;; esac

Or how they don't care that they are incorrectly using .so in their man pages forcing downstream distributions to fix it for them.

orbea on 29 Dec 2018

No, this is just a bash bug

Hi, it sounds like you have not done so yet, so may I ask you if you could please at least once read the page in the GNU Awk manual I linked to ?

By the way, this exact same behavior used to be present in both GNU Awk and GNU grep (see below) for the exact same reason (see the GNU Awk manual page link). It's just that it has been modified in more recent versions of GNU grep/awk because the current POSIX specification allows both, not because one of them is right and the other wrong.

$ awk --version
GNU Awk 3.1.8
$ grep --version
GNU grep 2.6.3


$ cat ./foo.txt
b c d e f
B C D E F
$ env - LANG=C awk '/[A-Z]/ {print $0}' ./foo.txt
B C D E F
$ env - LANG=en_US.UTF-8 awk '/[A-Z]/ {print $0}' ./foo.txt
b c d e f
B C D E F
$ env - LANG=C grep '[A-Z]' ./foo.txt
B C D E F
$ env - LANG=en_US.UTF-8  grep '[A-Z]' ./foo.txt
b c d e f
B C D E F

mhoes on 29 Dec 2018

If you want a good indication that this is a bash bug, see literally every other shell that does this right.

For example ash, dash, ksh, loksh, mksh, oksh, posh, pdksh, yash and zsh.

This even works correctly in hsh which is a true non-posix bourne shell. Also note that yash does it right which strictly follows the posix spec.

Whether if awk also shares this bug with bash is not relevant.

orbea on 29 Dec 2018

If you want a good indication that this is a bash bug, see literally every other shell that does this right.

For example ash, dash, ksh, loksh, mksh, oksh, posh, pdksh, yash and zsh.

Well, as you can see in my original post, I already noticed that dash, mksh, ksh, and zsh behave the way you prefer. It's just that it is not the only correct way to behave, you have options here. You may or may not like all the options personally, but they are there to choose from.

This even works correctly in hsh which is a true non-posix bourne shell.

And I guess that means that it does not have to follow POSIX, whereas I am writing about POSIX compliance, so what's your point here ?

Also note that yash does it right which strictly follows the posix spec.

Again, POSIX allows you to do it multiple ways, so you can be POSIX compliant doing it either way.

Whether if awk also shares this bug with bash is not relevant.

The point is that the same rationale/reason (following POSIX) is the root cause for both older GNU awk/grep versions behavior and current bash behavior. I find the page on the GNU awk website more readable, but if you'd like you are free to read the relevant bits in the POSIX specification which I also linked to.

But now that you have demonstrated multiple times over that you are not going to able to contribute in a meaningful manner, I will be off to doing more constructive things.

mhoes on 29 Dec 2018

👎1

But now that you have demonstrated multiple times over that you are not going to able to contribute in a meaningful manner, I will be off to doing more constructive things.

Just because this is a bash bug doesn't mean you have to be insulting...

And I guess that means that it does not have to follow POSIX, whereas I am writing about POSIX compliance, so what's your point here ?

I suppose I should make this clearer, this is a bash bug which is ignoring around 40 years of unix history and breaks backwards compatibility.

orbea on 29 Dec 2018

Just because this is a bash bug doesn't mean you have to be insulting...

I honestly do not mean to be insulting (and if I was I apologize), but from where I am standing it does appear that you consciously decided multiple times over to simply not read the reference material that was presented to you.

I suppose I should make this clearer, this is a bash bug which is ignoring around 40 years of unix history and breaks backwards compatibility.

And if you would have read that GNU Awk manual page I linked to, you might have understood by now why currently backwards compatibility is broken in some cases, in which cases it is not, and for which historical reasons this is so.

mhoes on 29 Dec 2018

I have read the reference material, the posix spec discusses how their is much disagreement while the gawk manual explains this behavior is unexpected and reverted.

My interpretation is that bash is the odd one out for no good reason other than the usual and tired "We know best!" routine. They have no good reason to break compatibility with most if not all other programs, but for some further demonstration...

#!/bin/bash

foo='a b C y z
echo "$foo" | tr [A-Z] [a-z]

$ env - LANG=en_US.UTF-8 /bin/bash /tmp/test.sh 
a b c y z

orbea on 29 Dec 2018

I have read the reference material

Ah, now we are getting somewhere.

the posix spec discusses how their is much disagreement

Agreed. Although they do leave room for any disagreement for implementations in the specification: locale 'C' and 'POSIX' is defined, while the behavior under other locales 'is up to the particular implementation', and even then 'however it is implemented' is still POSIX compliant. GNU grep/awk eventually chose to revert to the pre-locales situation, bash chose to stick with a previous POSIX-defined locales-introducing specification. Both are POSIX compliant according to the latest POSIX specification.

while the gawk manual explains this behavior is unexpected and reverted.

Agreed. Although it is worth mentioning that both this and the bash behavior is POSIX compliant according to the latest POSIX spec. (see above). (and the gawk 'reversion' was 'made possible' due to the updated POSIX specification).

My interpretation is that bash is the odd one out

Based on personal experience, I agree; However, it is still not 'incorrect' in relation to POSIX compliance.

for no good reason other than the usual and tired "We know best!" routine.

I disagree. The bash developers deliberately chose to stick with an earlier POSIX standard, which the current POSIX standard acknowledges and leaves room for. (hence the 'undefined')

They have no good reason to break compatibility with most if not all other programs

But they do, namely an earlier POSIX specification, which the current POSIX specification still leaves room for, explicitly.

mhoes on 29 Dec 2018

I read them when you first posted them too...

My point still stands, adding warnings for bash bugs which only affect certain locales and certain commands while only using bash is wrong. It should still be fixed in bash...

This issue should really be brought up with bash devs again instead of forcing their misfeatures onto other programs.

orbea on 29 Dec 2018

My point still stands

In fact, all of your points have fallen flat on their face and have been run into the ground thoroughly, while all of my points still stand.

for bash bugs

It is not a bug, it is POSIX allowed/specified behavior.

which only affect certain locales

It affects almost all locales, just not 'C' and 'POSIX'.

and certain commands

It affects at least current bash, and older versions of GNU awk and grep.

It should still be fixed in bash...

There is nothing to fix.

This issue should really be brought up with bash devs

This has been done, and was rightfully so dismissed.

mhoes on 29 Dec 2018

Well, - insulting or not - I am off now. Nevermind me.

mhoes on 29 Dec 2018

It affects at least current bash, and older versions of GNU awk and grep.

That is because those programs fixed the bug...

This has been done, and was rightfully so dismissed.

Their reasons are terrible and its clearly a bug.

It affects almost all locales, just not 'C' and 'POSIX'.

I tried several other locales and it appears you are right, correction noted.

orbea on 29 Dec 2018

I think it is fixed by default in bash 5.0, as globasciiranges is now enabled by default.

globasciiranges If set, range expressions used in pattern matching bracket expressions (see Pattern Matching above) behave as if in the traditional C locale when performing comparisons. That is, the current locale's collating sequence is not taken into account, so b will not collate between A and B, and upper-case and lower-case ASCII characters will collate together.

pawamoy on 8 Jan 2019

👍1

Nice catch. It seems you are correct, the 'NEWS' file for bash 5.0 does indeed state that the 'globasciiranges' option (which forces the same behavior as LOCALE 'C', regardless of the locale) is now enabled by default.

So I guess the question becomes what to do with that information. Just ignore the issue raised here, or still issue a warning since it has been the default behavior for bash for years ? For example, does shellcheck currently make a distinction between ksh88 and ksh93 (or newer) behavior when they differ, or does shellcheck just assume 'latest ksh' when the ksh shell is specified ? (For example, ksh88 doesn't have associative arrays, but currently shellcheck does not seem to issue a warning when you use them with ksh).

PS: It does appear that shellcheck let's you differentiate between the ksh versions with 'shellcheck --shell=ksh88 scriptname' and 'shellcheck --shell=ksh93 scriptname'. ... And then, if I interpret the sourcecode correctly, proceeds to treat both ksh88 and ksh93 the same way. Nevermind.

mhoes on 8 Jan 2019

Regardless of all the discussion here, whether or not this behavior should exist, shouldn't exist, whatever... This is how bash behaves (or at least does in the most common versions of the shells in the wild today), and so it would be beneficial for ShellCheck to provide at least severity=info level warning when encountering [A-Z]/[a-z] and recommend the POSIX character classes for portability.

xPMo on 31 Mar 2019

👍1

I agree. If it is currently intended behavior, then shellcheck should issue a warning for it.

However, I would personally prefer it if shellcheck would recommend to set LANG to C or POSIX (LANG=C or LANG=POSIX) instead of recommending POSIX character classes. Because a character class like [:upper:] or [:lower:] can only be used if you want to match [A-Z] or [a-z] (the entire alphabet), but not if you want to match a subset like for example [a-e] or [G-P] (and setting LANG does work for those cases as well). The same applies for [:digit:] versus [1-5], for example.

mhoes on 1 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings