Hi, I noticed that GNU grep has problem to match czech characters and so it outputs more lines than it should.
Reproducible example:
### Setup
# upgrade grep from busybox to gnu
pkg install grep
# I also installed coreutils, but not sure if it is relevant
pkg install coreutils
# libandroid-support is supposed to extend locale support in Bionic, but it had no effect on my usecase
# I tried with and without it
pkg install libandroid-support
# restart bash session; I also rebooted
### Test grep
LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8
# I also tried "CODESET=cs_CZ.UTF-8" because I saw it in another issue; probably not relevant
echo bar > test.txt
echo hezky česky >> test.txt
echo foo >> test.txt
# busybox grep returns one line as it should
cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky
# gnu grep returns all the lines
cat test.txt | grep česky
Shortened output
-bash-4.4$ LANG=cs_CZ.UTF-8
-bash-4.4$ LC_ALL=cs_CZ.UTF-8
-bash: warning: setlocale: LC_ALL: cannot change locale (cs_CZ.UTF-8): No such file or directory
-bash-4.4$ echo bar > test.txt
-bash-4.4$ echo hezky česky >> test.txt
-bash-4.4$ echo foo >> test.txt
-bash-4.4$ cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky
hezky česky
-bash-4.4$ cat test.txt | grep česky
bar
hezky česky
foo
I think the LANG variable should make this work but I do set LC_ALL along with it. Unfortunately it fails, but that is a problem I reported separately #3009.
tl;dr
grep variant | android 7.1 @ arm | fornwall's device | android 8 @ aarch64
-------------------------------------------- | ---------------------| ------------------- | -----------------------
(reported) | 2x | 1x | 1x
busybox grep | ✔ | ✔ |
busybox grep -E; egrep | ✔ | |
busybox grep -F | ✔ | |
gnu grep; grep -G | ❌ | ✔ | ✔
gnu grep -E; egrep | ❌ | |
gnu grep -F, fgrep | ✔ | |
gnu grep -P | ✔ | |
freebsd /system/bin/grep; grep -G | ✔ | |
freebsd /system/bin/grep -E; egrep | ✔ | |
freebsd /system/bin/grep -F; fgrep | ✔ | |
freebsd /system/bin/grep -P | not supported | |
LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8
@vovcacik We don't have locale support.
I don't want to make this issue about the LC_ALL (there is #3009 for that), but rather about the LANG variable and about the gnu grep.
It seems that the grep see the č=0xC4 0x8D UTF-8 bytes correctly:
grep --color='auto' -P -n "[\x80-\xFF]" test.txt
grep --color='auto' -P -n "[\x80-\xFF]+" test.txt
grep --color='auto' -P -n "[\xC4]" test.txt
grep --color='auto' -P -n "[\x8D]" test.txt

UTF-8 support and locales are different things.
libandroid-support doesn't install locale files for grep and other packages.
No doubt about that.
I am afraid that because of
-bash: warning: setlocale: LC_ALL:... locale error and made you think this is also locale problem. I am not saying it is not, but notice that the grep č test.txt does not itself require locale per se. It is basically just character matching, and everything is in UTF-8 so I dont see the problem grep is having.
Example of regex operation that requires locale would be grep "[a-k]" test.txt since the character class locale's collating sequence (to determine what characters are in the "a" "k" range).
More findings:
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -F č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -E č test.txt
hezky česky
-bash-4.4$ grep č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -G č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -E č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -P č test.txt
hezky česky
-bash-4.4$ grep -F č test.txt
hezky česky
I guess grep -F success is not that surprising after my previous comment, however grep -P definitely is.
So I've got my workaround, feel free to close if you don't consider this a bug.
@vovcacik Thanks a lot for reporting! I'm unable to reproduce it on a device I tested with just now. The below transcript indicates that I cannot reproduce your problem, right?
localhost$ echo bar > test.txt
localhost$ echo hezky česky >> test.txt
localhost$ echo foo >> test.txt
localhost$ cat test.txt | grep česky
hezky česky
localhost$ cat test.txt | busybox grep česky
hezky česky
localhost$ which grep
/data/data/com.termux/files/usr/bin/grep
As seen, both busybox grep and coreutils correctly finds only the matching line. This is regardless of me setting LANG=cs_CZ.UTF-8 LC_ALL=cs_CZ.UTF-8 or not.
Some things to try:
pkg up if you haven't already done so.cat test.txt | env -i grep česky).Does that make a change? If not, could you paste the output from running termux-info here, as it may be specific to arch/android version/device?
Yes, it appears alright on your device. You could maybe double check that you are running gnu grep from /data/data/com.termux/files/usr/bin/grep, but I don't see why you wouldn't.
I'll try the suggestions as soon as possible and get back to you.
I did the suggestion to pkg up, restarted ssh session and tried env -i. It didn't really help with the gnu grep, but with no environment it seems there is FreeBSD /system/bin/grep that is getting executed.
$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
$ cat test.txt | grep česky
foo
hezky česky
bar
$ cat test.txt | busybox grep česky
hezky česky
$ cat test.txt | env -i grep česky
hezky česky
$ which grep
/data/data/com.termux/files/usr/bin/grep
$ sha1sum `which grep`
48e865431d5ceffc4cc414885560dd5c4b831f2a /data/data/com.termux/files/usr/bin/grep
$ env -i which grep
$ env -i /data/data/com.termux/files/usr/bin/applets/which grep
/system/bin/grep
$ grep --version
grep (GNU grep) 3.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ env -i grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ busybox grep --version
busybox: unrecognized option `--version'
BusyBox v1.29.3 (2018-09-10 22:47:04 UTC) multi-call binary.
$ termux-info
Updatable packages:
All packages up to date
System information:
Linux localhost 3.4.0-eXcaliBur+ #1 SMP PREEMPT Thu Jul 20 00:11:26 EDT 2017 armv7l Android
Termux-packages arch:
arm
Android version:
7.1.2
Device manufacturer:
OnePlus
Device model:
One
$ env
LD_LIBRARY_PATH=/data/data/com.termux/files/usr/lib
SSH_CONNECTION=192.168.1.5 60264 192.168.1.10 8022
LANG=en_US.UTF-8
PREFIX=/data/data/com.termux/files/usr
USER=u0_a83
PWD=/data/data/com.termux/files/home/__test
HOME=/data/data/com.termux/files/home
SSH_CLIENT=192.168.1.5 60264 8022
TMPDIR=/data/data/com.termux/files/usr/tmp
SSH_TTY=/dev/pts/3
SHELL=/data/data/com.termux/files/usr/bin/bash
TERM=xterm
SHLVL=1
ANDROID_ROOT=/system
ANDROID_DATA=/data
LOGNAME=u0_a83
EXTERNAL_STORAGE=/sdcard
PATH=/data/data/com.termux/files/usr/bin:/data/data/com.termux/files/usr/bin/applets
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so
OLDPWD=/data/data/com.termux/files/home
_=/data/data/com.termux/files/usr/bin/env
$ ls -la /data/data/com.termux/files/usr/bin | grep grep
-rwx------ 1 u0_a83 u0_a83 59 Jul 10 2017 egrep
-rwx------ 1 u0_a83 u0_a83 59 Jul 10 2017 fgrep
-rwx------ 1 u0_a83 u0_a83 129180 Jul 10 2017 grep
$ ls -la /data/data/com.termux/files/usr/bin/applets | grep grep
lrwxrwxrwx 1 u0_a83 u0_a83 10 Nov 11 17:47 egrep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83 10 Nov 11 17:47 grep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83 10 Nov 11 17:47 pgrep -> ../busybox
I can confirm the problem on arm and android 7.1, but busybox grep works as intended

On AArch64 and Android 8 no problem with gnu grep.
@vovcacik Just being curious, what do
grep -n česky test.txt
grep -no . test.txt
give you?
@tomty89 interesting. The grep switches to binary mode and it stops printing rest of the line when it hits č:
$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
$ cat test.txt
foo
hezky česky
bar
$ cat test.txt | grep č
foo
hezky česky
bar
$ grep -n česky test.txt
1:foo
2:hezky česky
3:bar
$ grep -no . test.txt
1:f
1:o
1:o
2:h
2:e
2:z
2:k
2:y
2:
3:b
3:a
3:r
Binary file test.txt matches
But I can't say whether this is expected or not.
Hmm, looks like it's even more messed up than I thought (that the newlines were ignored for some reason, like multiple characters being treated as a single character, for example).
Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...
looks like it's even more messed up than I thought
Not sure if grep messed. More like that this is libandroid-support (or libc-specific) problem as I can reproduce this issue in Android 5.1 (x86_64 AVD) but not in Android 9 (x86_64 AVD). It also never happens on my AArch64 device.
Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...
Output is different. See output of these 2 commands on Android 5.1 (x86_64):

On Android 9 (x86_64) it seems okay, though:

Busybox's grep -no . test.txt:

@vovcacik @tomty89 I guess this PR will fix that: https://github.com/termux/termux-packages/pull/3060. At least it worked for me. I can provide *.deb files so you can test it yourself.
I know. Most likely it's some old bionic bug.
Output is different.
That actually makes the problem look even more irrational. Seems like grep ignore newlines but only in a peculiar manner? (Partially ignore it when doing the final output but not when matching?)
Not sure if it's relevant, but I can't make grep in Termux do what's in your second post. In Arch (proot) I can make that happen by unsetting LANG or setting it to C. It seems Termux is always UTF-8.
@xeffyr if depending on libandroid-support fixes it I wonder if it's a duplicate of #3047
When libandroid-support dependency is set, the script ./build-package.sh will append it's includes to CPPFLAGS:
.bash
if [ "$TERMUX_PKG_DEPENDS" != "${TERMUX_PKG_DEPENDS/libandroid-support/}" ]; then
# If using the android support library, link to it and include its headers as system headers:
CPPFLAGS+=" -isystem $TERMUX_PREFIX/include/libandroid-support"
LDFLAGS+=" -landroid-support"
fi
grep packages for testing:
I know, which is silly. I don't see any reason that we should have symlink for one of the headers but not the other (and explicitly depend to libandroid-support package by package when we notice a problem. In fact I'm not sure if there's good reason for not putting them directly under include/.
The updated 3.1-1 version of the grep package, now available for installation, should fix this.
@tomty89 Agreed, this whack-a-mole of adding libandroid-support when a problem pops up is a bit silly.
It's fixed, thank you!