Termux-packages: GNU grep can't match foreign language characters and outputs everything

Created on 2 Nov 2018  ·  22Comments  ·  Source: termux/termux-packages

Hi, I noticed that GNU grep has problem to match czech characters and so it outputs more lines than it should.

Reproducible example:

### Setup

# upgrade grep from busybox to gnu
pkg install grep

# I also installed coreutils, but not sure if it is relevant
pkg install coreutils

# libandroid-support is supposed to extend locale support in Bionic, but it had no effect on my usecase
# I tried with and without it
pkg install libandroid-support

# restart bash session; I also rebooted


### Test grep
LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8
# I also tried "CODESET=cs_CZ.UTF-8" because I saw it in another issue; probably not relevant

echo bar > test.txt
echo hezky česky >> test.txt
echo foo >> test.txt

# busybox grep returns one line as it should
cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky

# gnu grep returns all the lines
cat test.txt | grep česky

Shortened output

-bash-4.4$ LANG=cs_CZ.UTF-8
-bash-4.4$ LC_ALL=cs_CZ.UTF-8
-bash: warning: setlocale: LC_ALL: cannot change locale (cs_CZ.UTF-8): No such file or directory

-bash-4.4$ echo bar > test.txt
-bash-4.4$ echo hezky česky >> test.txt
-bash-4.4$ echo foo >> test.txt

-bash-4.4$ cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky
hezky česky

-bash-4.4$ cat test.txt | grep česky
bar
hezky česky
foo

I think the LANG variable should make this work but I do set LC_ALL along with it. Unfortunately it fails, but that is a problem I reported separately #3009.


tl;dr

grep variant | android 7.1 @ arm | fornwall's device | android 8 @ aarch64
-------------------------------------------- | ---------------------| ------------------- | -----------------------
(reported) | 2x | 1x | 1x
busybox grep | ✔ | ✔ |
busybox grep -E; egrep | ✔ | |
busybox grep -F | ✔ | |
gnu grep; grep -G | ❌ | ✔ | ✔
gnu grep -E; egrep | ❌ | |
gnu grep -F, fgrep | ✔ | |
gnu grep -P | ✔ | |
freebsd /system/bin/grep; grep -G | ✔ | |
freebsd /system/bin/grep -E; egrep | ✔ | |
freebsd /system/bin/grep -F; fgrep | ✔ | |
freebsd /system/bin/grep -P | not supported | |

bug report help wanted

All 22 comments

LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8

@vovcacik We don't have locale support.

I don't want to make this issue about the LC_ALL (there is #3009 for that), but rather about the LANG variable and about the gnu grep.

It seems that the grep see the č=0xC4 0x8D UTF-8 bytes correctly:

grep --color='auto' -P -n "[\x80-\xFF]" test.txt
grep --color='auto' -P -n "[\x80-\xFF]+" test.txt
grep --color='auto' -P -n "[\xC4]" test.txt
grep --color='auto' -P -n "[\x8D]" test.txt

2018-11-03_154149_ivymrcr

UTF-8 support and locales are different things.
libandroid-support doesn't install locale files for grep and other packages.

No doubt about that.

I am afraid that because of

  • the -bash: warning: setlocale: LC_ALL:... locale error and
  • the fact that I also reported locale related problem in #3009 along with this issue #3010

made you think this is also locale problem. I am not saying it is not, but notice that the grep č test.txt does not itself require locale per se. It is basically just character matching, and everything is in UTF-8 so I dont see the problem grep is having.

Example of regex operation that requires locale would be grep "[a-k]" test.txt since the character class locale's collating sequence (to determine what characters are in the "a" "k" range).

More findings:

-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -F č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -E č test.txt
hezky česky
-bash-4.4$ grep č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -G č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -E č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -P č test.txt
hezky česky
-bash-4.4$ grep -F č test.txt
hezky česky

I guess grep -F success is not that surprising after my previous comment, however grep -P definitely is.

So I've got my workaround, feel free to close if you don't consider this a bug.

@vovcacik Thanks a lot for reporting! I'm unable to reproduce it on a device I tested with just now. The below transcript indicates that I cannot reproduce your problem, right?

localhost$ echo bar > test.txt
localhost$ echo hezky česky >> test.txt
localhost$ echo foo >> test.txt
localhost$ cat test.txt | grep česky
hezky česky
localhost$ cat test.txt | busybox grep česky
hezky česky
localhost$ which grep
/data/data/com.termux/files/usr/bin/grep

As seen, both busybox grep and coreutils correctly finds only the matching line. This is regardless of me setting LANG=cs_CZ.UTF-8 LC_ALL=cs_CZ.UTF-8 or not.

Some things to try:

  1. Update to latest packages with pkg up if you haven't already done so.
  2. Try running grep without environment variables set and see if that makes a difference (cat test.txt | env -i grep česky).

Does that make a change? If not, could you paste the output from running termux-info here, as it may be specific to arch/android version/device?

Yes, it appears alright on your device. You could maybe double check that you are running gnu grep from /data/data/com.termux/files/usr/bin/grep, but I don't see why you wouldn't.

I'll try the suggestions as soon as possible and get back to you.

I did the suggestion to pkg up, restarted ssh session and tried env -i. It didn't really help with the gnu grep, but with no environment it seems there is FreeBSD /system/bin/grep that is getting executed.

  • setup
$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
  • performing grep
$ cat test.txt | grep česky
foo
hezky česky
bar
$ cat test.txt | busybox grep česky
hezky česky
$ cat test.txt | env -i grep česky
hezky česky
  • identifying grep
$ which grep
/data/data/com.termux/files/usr/bin/grep
$ sha1sum `which grep`
48e865431d5ceffc4cc414885560dd5c4b831f2a  /data/data/com.termux/files/usr/bin/grep
$ env -i which grep
$ env -i /data/data/com.termux/files/usr/bin/applets/which grep
/system/bin/grep
  • identifying grep again
$ grep --version
grep (GNU grep) 3.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ env -i grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ busybox grep --version
busybox: unrecognized option `--version'
BusyBox v1.29.3 (2018-09-10 22:47:04 UTC) multi-call binary.
  • more info for you
$ termux-info
Updatable packages:
All packages up to date
System information:
Linux localhost 3.4.0-eXcaliBur+ #1 SMP PREEMPT Thu Jul 20 00:11:26 EDT 2017 armv7l Android
Termux-packages arch:
arm
Android version:
7.1.2
Device manufacturer:
OnePlus
Device model:
One
$ env
LD_LIBRARY_PATH=/data/data/com.termux/files/usr/lib
SSH_CONNECTION=192.168.1.5 60264 192.168.1.10 8022
LANG=en_US.UTF-8
PREFIX=/data/data/com.termux/files/usr
USER=u0_a83
PWD=/data/data/com.termux/files/home/__test
HOME=/data/data/com.termux/files/home
SSH_CLIENT=192.168.1.5 60264 8022
TMPDIR=/data/data/com.termux/files/usr/tmp
SSH_TTY=/dev/pts/3
SHELL=/data/data/com.termux/files/usr/bin/bash
TERM=xterm
SHLVL=1
ANDROID_ROOT=/system
ANDROID_DATA=/data
LOGNAME=u0_a83
EXTERNAL_STORAGE=/sdcard
PATH=/data/data/com.termux/files/usr/bin:/data/data/com.termux/files/usr/bin/applets
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so
OLDPWD=/data/data/com.termux/files/home
_=/data/data/com.termux/files/usr/bin/env
  • grepping through $PATH dirs
$ ls -la /data/data/com.termux/files/usr/bin | grep grep
-rwx------ 1 u0_a83 u0_a83      59 Jul 10  2017 egrep
-rwx------ 1 u0_a83 u0_a83      59 Jul 10  2017 fgrep
-rwx------ 1 u0_a83 u0_a83  129180 Jul 10  2017 grep
$ ls -la /data/data/com.termux/files/usr/bin/applets | grep grep
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 egrep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 grep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 pgrep -> ../busybox

I can confirm the problem on arm and android 7.1, but busybox grep works as intended

screenshot_20181111-200705_termux

On AArch64 and Android 8 no problem with gnu grep.

@vovcacik Just being curious, what do

grep -n česky test.txt
grep -no . test.txt

give you?

@tomty89 interesting. The grep switches to binary mode and it stops printing rest of the line when it hits č:

$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
$ cat test.txt
foo
hezky česky
bar
$ cat test.txt | grep č
foo
hezky česky
bar
$ grep -n česky test.txt
1:foo
2:hezky česky
3:bar
$ grep -no . test.txt
1:f
1:o
1:o
2:h
2:e
2:z
2:k
2:y
2:
3:b
3:a
3:r
Binary file test.txt matches

But I can't say whether this is expected or not.

Hmm, looks like it's even more messed up than I thought (that the newlines were ignored for some reason, like multiple characters being treated as a single character, for example).

Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...

looks like it's even more messed up than I thought

Not sure if grep messed. More like that this is libandroid-support (or libc-specific) problem as I can reproduce this issue in Android 5.1 (x86_64 AVD) but not in Android 9 (x86_64 AVD). It also never happens on my AArch64 device.

Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...

Output is different. See output of these 2 commands on Android 5.1 (x86_64):

a51

On Android 9 (x86_64) it seems okay, though:
a90


Busybox's grep -no . test.txt:
a90_bb

@vovcacik @tomty89 I guess this PR will fix that: https://github.com/termux/termux-packages/pull/3060. At least it worked for me. I can provide *.deb files so you can test it yourself.

I know. Most likely it's some old bionic bug.

Output is different.

That actually makes the problem look even more irrational. Seems like grep ignore newlines but only in a peculiar manner? (Partially ignore it when doing the final output but not when matching?)

Not sure if it's relevant, but I can't make grep in Termux do what's in your second post. In Arch (proot) I can make that happen by unsetting LANG or setting it to C. It seems Termux is always UTF-8.

@xeffyr if depending on libandroid-support fixes it I wonder if it's a duplicate of #3047

When libandroid-support dependency is set, the script ./build-package.sh will append it's includes to CPPFLAGS:
.bash if [ "$TERMUX_PKG_DEPENDS" != "${TERMUX_PKG_DEPENDS/libandroid-support/}" ]; then # If using the android support library, link to it and include its headers as system headers: CPPFLAGS+=" -isystem $TERMUX_PREFIX/include/libandroid-support" LDFLAGS+=" -landroid-support" fi

I know, which is silly. I don't see any reason that we should have symlink for one of the headers but not the other (and explicitly depend to libandroid-support package by package when we notice a problem. In fact I'm not sure if there's good reason for not putting them directly under include/.

The updated 3.1-1 version of the grep package, now available for installation, should fix this.

@tomty89 Agreed, this whack-a-mole of adding libandroid-support when a problem pops up is a bit silly.

It's fixed, thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zejji picture zejji  ·  4Comments

reggi picture reggi  ·  4Comments

thurask picture thurask  ·  3Comments

roalyr picture roalyr  ·  3Comments

ILadis picture ILadis  ·  3Comments