GNU bug report logs -
#24973
[regression] [d-f] no longer includes e with acute accent in single-byte locales
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24973 in the body.
You can then email your comments to 24973 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Sun, 20 Nov 2016 21:15:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 20 Nov 2016 21:15:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
In grep 2.26,
echo é | grep '[d-f]'
no longer matches in locales like fr_FR.iso885915 <at> euro or
en_GB.iso88591 where the character set is single-byte like
ISO-8859-1. It still works OK with UTF-8.
2.25 was OK. git bisect points to commit
2769d5331a38d623b67b1860ac46b39ff7e54aca
Reproduce with:
printf '\351\n' | LC_ALL=en_US.iso88591 ./src/grep '[d-f]' || echo fail
(assuming that locale is available on the system).
Tested on Ubuntu 16.04 amd64.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Sun, 20 Nov 2016 21:24:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
2016-11-20 21:14:31 +0000, Stephane Chazelas:
[...]
> echo é | grep '[d-f]'
>
> no longer matches in locales like fr_FR.iso885915 <at> euro or
> en_GB.iso88591 where the character set is single-byte like
> ISO-8859-1. It still works OK with UTF-8.
[...]
I also seems to still be OK with other multi-byte locales like
zh_HK.big5hkscs:
$ locale charmap
BIG5-HKSCS
$ printf '\ue9' | ./src/grep '[d-f]' | hd
00000000 88 6d 0a |.m.|
00000003
Though:
$ printf '\ue9' | ./src/grep '.*m' | hd
00000000 88 6d 0a |.m.|
However, that seems to be a separate issue as it also failed in
earlier versions. I'll raise that separately.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Sun, 20 Nov 2016 21:39:01 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
On 11/20/2016 04:14 PM, Stephane Chazelas wrote:
> printf '\351\n' | LC_ALL=en_US.iso88591
On a Solaris 10 system the locales are named a bit different :
dasoyva_$ locale -a
C
POSIX
en_CA
en_CA.ISO8859-1
en_CA.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.ISO8859-15 <at> euro
en_US.UTF-8
es
es_MX
es_MX.ISO8859-1
es_MX.UTF-8
fr
fr_CA
fr_CA.ISO8859-1
fr_CA.UTF-8
dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
0000000 e9 0a
0000002
I am not sure if the single byte 0xe9h is correct at all for this test.
dasoyva_$ LC_ALL=en_US.UTF-8 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
0000000 e9 0a
0000002
dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n'
�
Wonder how I would test this on a strict POSIX system here. Any thoughts?
Dennis
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Sun, 20 Nov 2016 22:07:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 24973 <at> debbugs.gnu.org (full text, mbox):
2016-11-20 16:38:29 -0500, Dennis Clarke:
[...]
> On a Solaris 10 system the locales are named a bit different :
> dasoyva_$ locale -a
[...]
> en_US.ISO8859-1
> en_US.ISO8859-15
[...]
> dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
> 0000000 e9 0a
> 0000002
>
> I am not sure if the single byte 0xe9h is correct at all for this test.
[...]
Note that
printf '\351'
Will print the byte 0xe9 regardless of the locale.
0xe9 happens to be the code point for é in ISO8859-1 and
ISO8859-15.
> dasoyva_$ LC_ALL=en_US.UTF-8 /usr/bin/printf '\351\n' | od -Ax -t x1 -v
> 0000000 e9 0a
> 0000002
>
> dasoyva_$ LC_ALL=en_US.ISO8859-1 /usr/bin/printf '\351\n'
> �
>
> Wonder how I would test this on a strict POSIX system here. Any thoughts?
[...]
POSIX leaves all that unspecified. It doesn't specify any locale
other than C/POSIX. It leaves '[d-f]' unspecified in locales
other than C/POSIX.
Here, the problem is a change of behaviour between GNU grep 2.25
and 2.26. (and 2.26 behaviour makes it inconsistent with other
GNU utilities). Both behaviours are POSIX compliant, since [d-f] is
unspecified anyway.
On your Solaris machine, you can check:
printf '\351\n' | LC_ALL=en_US.ISO8859-1 gnu-grep '[d-f]' | od -An -vtx1
And check if it's consistent with /usr/xpg4/bin/grep.
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Sun, 20 Nov 2016 22:23:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 24973 <at> debbugs.gnu.org (full text, mbox):
2016-11-20 16:38:29 -0500, Dennis Clarke:
> On 11/20/2016 04:14 PM, Stephane Chazelas wrote:
> >printf '\351\n' | LC_ALL=en_US.iso88591
>
> On a Solaris 10 system
[...]
FWIW, on Solaris 11, it looks as if (speculated from very few
tests) GNU grep's ranges ([x-y]) are only based on code point,
both in 2.25 and 2.26 so [d-f] doesn't match é in any locale.
Seems to behave like /bin/grep in that instance, not
/usr/xpg4/bin/grep
--
Stephane
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24973
; Package
grep
.
(Mon, 21 Nov 2016 04:20:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 24973 <at> debbugs.gnu.org (full text, mbox):
Problem reported by Stephane Chazelas (Bug#24973).
* lib/dfa.c (using_simple_locale): Fix typo that caused some
non-simple locales like fr_FR to be treated as simple.
---
ChangeLog | 7 +++++++
lib/dfa.c | 4 ++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 88139c3..fbdecf0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2016-11-20 Paul Eggert <eggert <at> cs.ucla.edu>
+
+ dfa: fix logic typo
+ Problem reported by Stephane Chazelas (Bug#24973).
+ * lib/dfa.c (using_simple_locale): Fix typo that caused some
+ non-simple locales like fr_FR to be treated as simple.
+
2016-11-20 Jim Meyering <meyering <at> fb.com>
fix test driver leaks: exclude, malloc, realloc
diff --git a/lib/dfa.c b/lib/dfa.c
index 744a9f1..7b80a1a 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -815,8 +815,8 @@ using_simple_locale (bool multibyte)
&& '}' == 125 && '~' == 126)
};
- if (native_c_charset && !multibyte)
- return true;
+ if (!native_c_charset || multibyte)
+ return false;
else
{
/* Treat C and POSIX locales as being compatible. Also, treat
--
2.7.4
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Mon, 21 Nov 2016 04:35:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Stephane Chazelas <stephane.chazelas <at> gmail.com>
:
bug acknowledged by developer.
(Mon, 21 Nov 2016 04:35:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 24973-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Stephane Chazelas wrote:
> 2.25 was OK. git bisect points to commit
> 2769d5331a38d623b67b1860ac46b39ff7e54aca
Thanks for pinpointing the bug. It was my logic error in that commit. Fixed by
altering Gnulib as follows:
http://lists.gnu.org/archive/html/bug-gnulib/2016-11/msg00086.html
and by installing the attached patches into grep.
[0001-build-update-gnulib-submodule-to-latest.patch (text/x-diff, attachment)]
[0002-tests-check-for-unibyte-French-range-bug.patch (text/x-diff, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 19 Dec 2016 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 184 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.