GNU bug report logs -
#16581
suggested code simplification in dfa.c
Previous Next
Reported by: Aharon Robbins <arnold <at> skeeve.com>
Date: Tue, 28 Jan 2014 20:12:01 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16581 in the body.
You can then email your comments to 16581 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Tue, 28 Jan 2014 20:12:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Aharon Robbins <arnold <at> skeeve.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Tue, 28 Jan 2014 20:12:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi.
The code in atom() looks to me like it could use a little refactoring
and simplification. I suggest the diff below. With it both grep and gawk
still pass their tests.
Thanks,
Arnold
diff --git a/src/dfa.c b/src/dfa.c
index b79c604..d2916ee 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1725,17 +1725,20 @@ add_utf8_anychar (void)
static void
atom (void)
{
- if (0)
+ if (MBS_SUPPORT && tok == WCHAR)
{
- /* empty */
- }
- else if (MBS_SUPPORT && tok == WCHAR)
- {
- addtok_wc (case_fold ? towlower (wctok) : wctok);
- if (case_fold && iswalpha (wctok))
+ if (! case_fold)
+ {
+ addtok_wc (wctok);
+ }
+ else
{
- addtok_wc (towupper (wctok));
- addtok (OR);
+ addtok_wc (towlower (wctok));
+ if (iswalpha (wctok))
+ {
+ addtok_wc (towupper (wctok));
+ addtok (OR);
+ }
}
tok = lex ();
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Tue, 28 Jan 2014 21:51:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 16581 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I like that as far as it goes, but it pulls loose a thread that has been
nagging me for a while. How about the attached instead? It includes
somewhat more simplification, entailing more-efficient handling of
caseless letters when ignoring case.
[0001-Simplify-handling-of-letter-case.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 02:52:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Hi Paul.
I skimmed the patch.
All that exclusive-ORing looks a little scary to me. Will that work,
for example, on EBCDIC systems? Gawk supports z/OS - a POSIX enviornment
on top of OS/390. Will it work on systems using some of the older
far Eastern, non-Unicode locales?
What is it even doing? What do you expect to get from
wc ^ towlower(wc) ^ towupper(wc)
?
I'm worried that you've embedded a deep assumption about how characters
are encoded and how upper and lower case relate to each other in
every possible character set we might be called upon to handle, and
it feels really risky to me.
I think I'd be happier if you did the simplification in smaller, more
comprehensible, steps.
My two cents, of course. :-)
Thanks,
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 06:49:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 16581 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Aharon Robbins wrote:
> that exclusive-ORing looks a little scary to me.
It is derived from a coding hack that I picked up from Dijkstra back in
the 1970s. He gave the most boring computer-science lecture I have ever
attended -- so boring that Kit Fine walked out a few minutes into it --
but I stubbornly stayed through to the end and I've never forgotten the
hack.
The hack works everywhere, including platforms that use EBCDIC,
shift-JIS, DBCS, etc., because it doesn't rely on the encoding scheme at
all. Attached is a revised patch that adds some commentary and breaks
the hack into some functions that I hope help explain things. Sorry, I
don't know how to break this into smaller patches that would be easier
to understand.
[0001-Simplify-handling-of-letter-case.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 13:43:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 16581 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 01/28/2014 11:48 PM, Paul Eggert wrote:
>
> +/* The following functions exploit the commutativity and associativity of ^,
> + and the fact that X ^ X is zero. POSIX requires that C equals
> + either tolower (C) or toupper (C);
Unfortunately, while this is true, I'm not sure if it accurately covers
all possible case-folded comparisons outside of the C locale.
http://www.unicode.org/faq/casemap_charprop.html
Consider the Greek locale, el_GR.UTF-8, which has two lower-case sigma:
L'\x3c3' and L'\x3c2', but only one upper-case: L'\x3a3'. As a result,
all three wchar_t values must compare case-insensitively to one another.
Or consider titlecase characters, such as Unicode L'\x1c8' (Lj), which
has both an uppercase mapping L'\x1c7' (LJ) and lowercase mapping
L'\x1c9' (lj) - again, all three wchar_t values must compare
case-insensitively to one another.
Your hack is great at finding characters that have a case mapping, but
not necessarily at finding all such characters that map to the same
result when passed through towlower(towupper(c)).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 13:50:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 16581 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 01/29/2014 06:42 AM, Eric Blake wrote:
> Your hack is great at finding characters that have a case mapping, but
> not necessarily at finding all such characters that map to the same
> result when passed through towlower(towupper(c)).
>
In particular, note that the Java language has formalized
case-insensitive comparison as follows:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#equalsIgnoreCase%28java.lang.String%29
Two characters c1 and c2 are considered the same, ignoring case if at
least one of the following is true:
The two characters are the same (as compared by the == operator).
Applying the method Character.toUpperCase(char) to each character
produces the same result.
Applying the method Character.toLowerCase(char) to each character
produces the same result.
and lower down, compareToIgnoreCase():
Compares two strings lexicographically, ignoring case differences. This
method returns an integer whose sign is that of calling compareTo with
normalized versions of the strings where case differences have been
eliminated by calling
Character.toLowerCase(Character.toUpperCase(character)) on each character.
Note that this method does not take locale into account, and will result
in an unsatisfactory ordering for certain locales. The java.text package
provides collators to allow locale-sensitive ordering.
In particular, the specification was careful to require double-case
conversion, with uppercase first, in order to normalize all
single-character oddities, while still mentioning that true Unicode
collation has even more special cases that can't be decided on a
character-by-character basis.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 15:30:03 GMT)
Full text and
rfc822 format available.
Message #23 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Aaron Crane <grep <at> aaroncrane.co.uk> wrote:
> I don't think this works for the wide-character case. For example, ....
Maybe just use the xor stuff for the single byte case and the more
straightforward code for the multibyte case?
Otherwise it sounds like we're asking for trouble.
Also, maybe name the routines ..._other_case instead of just _other ?
Thanks,
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Wed, 29 Jan 2014 16:58:03 GMT)
Full text and
rfc822 format available.
Message #26 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> +/* The following functions exploit the commutativity and associativity of ^,
> + and the fact that X ^ X is zero. POSIX requires that C equals
> + either tolower (C) or toupper (C); if the former, then C ^ tolower (C)
> + is zero so C ^ xor_other (C) equals toupper (C), and similarly
> + for the latter. */
> +
> +/* Return the exclusive-OR of C and C's other case, or zero if C is
> + not a letter that changes case. */
> +
> +static wint_t
> +xor_wother (wint_t c)
> +{
> + return towlower (c) ^ towupper (c);
> +}
[…]
> + if (case_fold)
> {
> + wchar_t xor = xor_wother (wc);
> + if (xor)
> + {
> + addtok_wc (wc ^ xor);
> + addtok (OR);
> + }
I don't think this works for the wide-character case. For example, in
a suitable locale, I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
towlower(). This matches the behaviour I can observe with a simple
test program under the en_GB.UTF-8 locale on both Linux and Mac OS.
Since 0x1c7 ^ 0x1c9 == 14, and 0x1c8 ^ 14 == 0x1c6, this means we'd
call addtok_wc(0x1c6), and U+01C6 is LATIN SMALL LETTER DZ WITH CARON,
which isn't a desired character.
--
Aaron Crane ** http://aaroncrane.co.uk/
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Thu, 30 Jan 2014 15:39:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Aaron Crane wrote:
> I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
> LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
> under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
> towlower().
Ouch, thanks, I hadn't considered that. So my idea was all wrong. But
this means the current code is all wrong too. I'll take a look at it. I
hope I don't regret picking up this thread....
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Thu, 30 Jan 2014 15:53:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Aaron Crane wrote:
> > I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
> > LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
> > under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
> > towlower().
>
> Ouch, thanks, I hadn't considered that. So my idea was all wrong. But
> this means the current code is all wrong too. I'll take a look at it. I
> hope I don't regret picking up this thread....
This seems to be a weird (and very much corner) case: wc != towlower(wc)
and wc != towupper(wc). It can only be an issue if doing case folding,
and there are only a few spots in the code that deal with case folding
when compiling the dfa.
I suggest starting with the XOR changes for unibyte locales - they seem
(to me) to be good no matter what. And then separately try to deal with
the multibyte case.
And just to increase the need for Aspirin, any idea how regex handles
this case? I would not be surprised if the code there also doesn't
catch this. Wheeeeeeeee! :-)
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Thu, 30 Jan 2014 21:57:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 16581 <at> debbugs.gnu.org (full text, mbox):
On 01/30/2014 07:51 AM, arnold <at> skeeve.com wrote:
> I suggest starting with the XOR changes for unibyte locales - they seem
> (to me) to be good no matter what. And then separately try to deal with
> the multibyte case.
Unfortunately the changes don't work even for unibyte locales, since
unibyte locales can have the same problem, i.e., c != tolower (c) && c
!= toupper (c). Admittedly this is rare, but it's possible (users can
define their own locales, after all), and fixing the code for the
multibyte case will induce similar changes for unibyte, I hope.
>
> And just to increase the need for Aspirin, any idea how regex handles
> this case?
No idea, sorry. Rounds of aspirin for everybody!
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16581
; Package
grep
.
(Fri, 31 Jan 2014 09:21:02 GMT)
Full text and
rfc822 format available.
Message #38 received at 16581 <at> debbugs.gnu.org (full text, mbox):
> > I suggest starting with the XOR changes for unibyte locales - they seem
> > (to me) to be good no matter what. And then separately try to deal with
> > the multibyte case.
>
> Unfortunately the changes don't work even for unibyte locales, since
> unibyte locales can have the same problem, i.e., c != tolower (c) && c
> != toupper (c). Admittedly this is rare, but it's possible (users can
> define their own locales, after all), and fixing the code for the
> multibyte case will induce similar changes for unibyte, I hope.
I see. OK.
> > And just to increase the need for Aspirin, any idea how regex handles
> > this case?
>
> No idea, sorry. Rounds of aspirin for everybody!
Well, I am comforted that this has not been a big issue in practice (yet!).
If I get ambitious I will try to look at it; chances are that you, Paul,
will beat me to it.
It's been an interesting discussion, anyway. :-)
Thanks,
Arnold
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sat, 08 Mar 2014 18:17:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Aharon Robbins <arnold <at> skeeve.com>
:
bug acknowledged by developer.
(Sat, 08 Mar 2014 18:17:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 16581-done <at> debbugs.gnu.org (full text, mbox):
This particular issue seems to have been put to bed in the savannah git
master so I'm marking it as done.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 06 Apr 2014 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.