GNU bug report logs - #24973
[regression] [d-f] no longer includes e with acute accent in single-byte locales

Previous Next

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Sun, 20 Nov 2016 21:15:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24973 in the body.
You can then email your comments to 24973 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Sun, 20 Nov 2016 21:15:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 20 Nov 2016 21:15:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: [regression] [d-f] no longer includes e with acute accent in
 single-byte locales
Date: Sun, 20 Nov 2016 21:14:31 +0000
Hello,

In grep 2.26,

echo é | grep '[d-f]'

no longer matches in locales like fr_FR.iso885915 <at> euro or
en_GB.iso88591 where the character set is single-byte like
ISO-8859-1. It still works OK with UTF-8.

2.25 was OK. git bisect points to commit
2769d5331a38d623b67b1860ac46b39ff7e54aca

Reproduce with:

printf '\351\n' | LC_ALL=en_US.iso88591 ./src/grep '[d-f]' || echo fail

(assuming that locale is available on the system).

Tested on Ubuntu 16.04 amd64.

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Sun, 20 Nov 2016 21:24:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: [regression] [d-f] no longer includes e with acute accent in
 single-byte locales
Date: Sun, 20 Nov 2016 21:23:16 +0000
2016-11-20 21:14:31 +0000, Stephane Chazelas:
[...]
> echo é | grep '[d-f]'
> 
> no longer matches in locales like fr_FR.iso885915 <at> euro or
> en_GB.iso88591 where the character set is single-byte like
> ISO-8859-1. It still works OK with UTF-8.
[...]

I also seems to still be OK with other multi-byte locales like
zh_HK.big5hkscs:

$ locale charmap
BIG5-HKSCS
$ printf '\ue9' | ./src/grep '[d-f]' | hd
00000000  88 6d 0a                                          |.m.|
00000003

Though:

$ printf '\ue9' | ./src/grep '.*m' | hd
00000000  88 6d 0a                                          |.m.|

However, that seems to be a separate issue as it also failed in
earlier versions. I'll raise that separately.

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Sun, 20 Nov 2016 21:39:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dennis Clarke <dclarke <at> blastwave.org>
To: bug-grep <at> gnu.org, stephane.chazelas <at> gmail.com
Subject: Re: bug#24973: [regression] [d-f] no longer includes e with acute
 accent in single-byte locales
Date: Sun, 20 Nov 2016 16:38:29 -0500
On 11/20/2016 04:14 PM, Stephane Chazelas wrote:
> printf '\351\n' | LC_ALL=en_US.iso88591

On a Solaris 10 system the locales are named a bit different :

dasoyva_$ locale -a
C
POSIX
en_CA
en_CA.ISO8859-1
en_CA.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.ISO8859-15 <at> euro
en_US.UTF-8
es
es_MX
es_MX.ISO8859-1
es_MX.UTF-8
fr
fr_CA
fr_CA.ISO8859-1
fr_CA.UTF-8

dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
0000000 e9 0a
0000002

I am not sure if the single byte 0xe9h is correct at all for this test.

dasoyva_$ LC_ALL=en_US.UTF-8 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
0000000 e9 0a
0000002

dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n'
�

Wonder how I would test this on a strict POSIX system here. Any thoughts?

Dennis





Information forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Sun, 20 Nov 2016 22:07:01 GMT) Full text and rfc822 format available.

Message #14 received at 24973 <at> debbugs.gnu.org (full text, mbox):

From: Stephane CHAZELAS <stephane.chazelas <at> gmail.com>
To: Dennis Clarke <dclarke <at> blastwave.org>
Cc: 24973 <at> debbugs.gnu.org
Subject: Re: bug#24973: [regression] [d-f] no longer includes e with acute
 accent in single-byte locales
Date: Sun, 20 Nov 2016 22:06:29 +0000
2016-11-20 16:38:29 -0500, Dennis Clarke:
[...]
> On a Solaris 10 system the locales are named a bit different :
> dasoyva_$ locale -a
[...]
> en_US.ISO8859-1
> en_US.ISO8859-15
[...]
> dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
> 0000000 e9 0a
> 0000002
> 
> I am not sure if the single byte 0xe9h is correct at all for this test.
[...]

Note that

printf '\351'

Will print the byte 0xe9 regardless of the locale.

0xe9 happens to be the code point for é in ISO8859-1 and
ISO8859-15.


> dasoyva_$ LC_ALL=en_US.UTF-8 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
> 0000000 e9 0a
> 0000002
> 
> dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n'
> �
> 
> Wonder how I would test this on a strict POSIX system here. Any thoughts?
[...]

POSIX leaves all that unspecified. It doesn't specify any locale
other than C/POSIX. It leaves '[d-f]' unspecified in locales
other than C/POSIX.

Here, the problem is a change of behaviour between GNU grep 2.25
and 2.26. (and 2.26 behaviour makes it inconsistent with other
GNU utilities). Both behaviours are POSIX compliant, since [d-f] is
unspecified anyway.

On your Solaris machine, you can check:

printf '\351\n' | LC_ALL=en_US.ISO8859-1 gnu-grep '[d-f]' | od -An -vtx1

And check if it's consistent with /usr/xpg4/bin/grep.

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Sun, 20 Nov 2016 22:23:01 GMT) Full text and rfc822 format available.

Message #17 received at 24973 <at> debbugs.gnu.org (full text, mbox):

From: Stephane CHAZELAS <stephane.chazelas <at> gmail.com>
To: Dennis Clarke <dclarke <at> blastwave.org>
Cc: 24973 <at> debbugs.gnu.org
Subject: Re: bug#24973: [regression] [d-f] no longer includes e with acute
 accent in single-byte locales
Date: Sun, 20 Nov 2016 22:22:36 +0000
2016-11-20 16:38:29 -0500, Dennis Clarke:
> On 11/20/2016 04:14 PM, Stephane Chazelas wrote:
> >printf '\351\n' | LC_ALL=en_US.iso88591
> 
> On a Solaris 10 system
[...]

FWIW, on Solaris 11, it looks as if (speculated from very few
tests) GNU grep's ranges ([x-y]) are only based on code point,
both in 2.25 and 2.26 so [d-f] doesn't match é in any locale.

Seems to behave like /bin/grep in that instance, not
/usr/xpg4/bin/grep

-- 
Stephane




Information forwarded to bug-grep <at> gnu.org:
bug#24973; Package grep. (Mon, 21 Nov 2016 04:20:01 GMT) Full text and rfc822 format available.

Message #20 received at 24973 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-gnulib <at> gnu.org,
	24973 <at> debbugs.gnu.org,
	stephane.chazelas <at> gmail.com
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Subject: [PATCH] dfa: fix logic typo
Date: Sun, 20 Nov 2016 20:18:38 -0800
Problem reported by Stephane Chazelas (Bug#24973).
* lib/dfa.c (using_simple_locale): Fix typo that caused some
non-simple locales like fr_FR to be treated as simple.
---
 ChangeLog | 7 +++++++
 lib/dfa.c | 4 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 88139c3..fbdecf0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2016-11-20  Paul Eggert  <eggert <at> cs.ucla.edu>
+
+	dfa: fix logic typo
+	Problem reported by Stephane Chazelas (Bug#24973).
+	* lib/dfa.c (using_simple_locale): Fix typo that caused some
+	non-simple locales like fr_FR to be treated as simple.
+
 2016-11-20  Jim Meyering  <meyering <at> fb.com>
 
 	fix test driver leaks: exclude, malloc, realloc
diff --git a/lib/dfa.c b/lib/dfa.c
index 744a9f1..7b80a1a 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -815,8 +815,8 @@ using_simple_locale (bool multibyte)
      && '}' == 125 && '~' == 126)
   };
 
-  if (native_c_charset && !multibyte)
-    return true;
+  if (!native_c_charset || multibyte)
+    return false;
   else
     {
       /* Treat C and POSIX locales as being compatible.  Also, treat
-- 
2.7.4





Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 21 Nov 2016 04:35:02 GMT) Full text and rfc822 format available.

Notification sent to Stephane Chazelas <stephane.chazelas <at> gmail.com>:
bug acknowledged by developer. (Mon, 21 Nov 2016 04:35:02 GMT) Full text and rfc822 format available.

Message #25 received at 24973-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>, 24973-done <at> debbugs.gnu.org
Subject: Re: bug#24973: [regression] [d-f] no longer includes e with acute
 accent in single-byte locales
Date: Sun, 20 Nov 2016 20:34:35 -0800
[Message part 1 (text/plain, inline)]
Stephane Chazelas wrote:
> 2.25 was OK. git bisect points to commit
> 2769d5331a38d623b67b1860ac46b39ff7e54aca

Thanks for pinpointing the bug. It was my logic error in that commit. Fixed by 
altering Gnulib as follows:

http://lists.gnu.org/archive/html/bug-gnulib/2016-11/msg00086.html

and by installing the attached patches into grep.
[0001-build-update-gnulib-submodule-to-latest.patch (text/x-diff, attachment)]
[0002-tests-check-for-unibyte-French-range-bug.patch (text/x-diff, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 19 Dec 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 184 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.