On Sun, Nov 20, 2016 at 9:53 PM, Jim Meyering wrote: > On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas > wrote: >> 2016-11-20 21:50:28 +0000, Stephane Chazelas: >>> $ locale charmap >>> GB18030 >>> $ printf '\uC9\n' | grep '.*7' | hd >>> 00000000 81 30 87 37 0a |.0.7.| >>> 00000005 >>> >>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). >> [...] >>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. >> [...] >> >> Same behaviour with 2.26 on Solaris 11. > > Thank you for the report. > I can reproduce that error on Fedora 25 with this: > > $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c > 5 > > I confirmed that the problem does not arise (i.e., no match, with exit > status of 1) when we force the use of glibc's regex matcher by > inserting a trivial back-reference: > > $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E > '()\1.*7' k); echo $? > 1 > > This bisected to v2.18-54-g3ef4c8e, but that commit was just the > messenger: it exposed the latent bug by making it so this case was no > longer handled by glibc's regexp matcher, but rather by grep's dfa.c. I've fixed this by forcing any non-UTF8 multibyte locale to use regex rather than DFA matcher with the following. The gnulib/dfa patch makes that change, and the grep change updates to latest gnulib, adds tests and NEWS. I suspect this won't be the last word in this area, because it feels like we should be able to adjust DFA's tables so that people using such locales can retain DFA's efficiency without the bug in the current implementation.