GNU bug report logs - #16581
suggested code simplification in dfa.c

Previous Next

Package: grep;

Reported by: Aharon Robbins <arnold <at> skeeve.com>

Date: Tue, 28 Jan 2014 20:12:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16581 in the body.
You can then email your comments to 16581 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Tue, 28 Jan 2014 20:12:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aharon Robbins <arnold <at> skeeve.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 28 Jan 2014 20:12:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: bug-grep <at> gnu.org
Subject: suggested code simplification in dfa.c
Date: Tue, 28 Jan 2014 22:11:14 +0200
Hi.

The code in atom() looks to me like it could use a little refactoring
and simplification. I suggest the diff below. With it both grep and gawk
still pass their tests.

Thanks,

Arnold

diff --git a/src/dfa.c b/src/dfa.c
index b79c604..d2916ee 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1725,17 +1725,20 @@ add_utf8_anychar (void)
 static void
 atom (void)
 {
-  if (0)
+  if (MBS_SUPPORT && tok == WCHAR)
     {
-      /* empty */
-    }
-  else if (MBS_SUPPORT && tok == WCHAR)
-    {
-      addtok_wc (case_fold ? towlower (wctok) : wctok);
-      if (case_fold && iswalpha (wctok))
+      if (! case_fold)
+        {
+          addtok_wc (wctok);
+        }
+      else
         {
-          addtok_wc (towupper (wctok));
-          addtok (OR);
+          addtok_wc (towlower (wctok));
+          if (iswalpha (wctok))
+   	  {
+                addtok_wc (towupper (wctok));
+                addtok (OR);
+   	  }
         }
 
       tok = lex ();




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Tue, 28 Jan 2014 21:51:01 GMT) Full text and rfc822 format available.

Message #8 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aharon Robbins <arnold <at> skeeve.com>, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Tue, 28 Jan 2014 13:50:52 -0800
[Message part 1 (text/plain, inline)]
I like that as far as it goes, but it pulls loose a thread that has been 
nagging me for a while.  How about the attached instead?  It includes 
somewhat more simplification, entailing more-efficient handling of 
caseless letters when ignoring case.
[0001-Simplify-handling-of-letter-case.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 02:52:02 GMT) Full text and rfc822 format available.

Message #11 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: eggert <at> cs.ucla.edu, arnold <at> skeeve.com, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 04:50:47 +0200
Hi Paul.

I skimmed the patch.

All that exclusive-ORing looks a little scary to me. Will that work,
for example, on EBCDIC systems?  Gawk supports z/OS - a POSIX enviornment
on top of OS/390.  Will it work on systems using some of the older
far Eastern, non-Unicode locales?

What is it even doing?  What do you expect to get from

	wc ^ towlower(wc) ^ towupper(wc)

?

I'm worried that you've embedded a deep assumption about how characters
are encoded and how upper and lower case relate to each other in
every possible character set we might be called upon to handle, and
it feels really risky to me.

I think I'd be happier if you did the simplification in smaller, more
comprehensible, steps.

My two cents, of course. :-)

Thanks,

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 06:49:02 GMT) Full text and rfc822 format available.

Message #14 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aharon Robbins <arnold <at> skeeve.com>, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Tue, 28 Jan 2014 22:48:16 -0800
[Message part 1 (text/plain, inline)]
Aharon Robbins wrote:
> that exclusive-ORing looks a little scary to me.

It is derived from a coding hack that I picked up from Dijkstra back in 
the 1970s.  He gave the most boring computer-science lecture I have ever 
attended -- so boring that Kit Fine walked out a few minutes into it -- 
but I stubbornly stayed through to the end and I've never forgotten the 
hack.

The hack works everywhere, including platforms that use EBCDIC, 
shift-JIS, DBCS, etc., because it doesn't rely on the encoding scheme at 
all.  Attached is a revised patch that adds some commentary and breaks 
the hack into some functions that I hope help explain things.  Sorry, I 
don't know how to break this into smaller patches that would be easier 
to understand.
[0001-Simplify-handling-of-letter-case.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 13:43:02 GMT) Full text and rfc822 format available.

Message #17 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Aharon Robbins <arnold <at> skeeve.com>,
 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 06:42:10 -0700
[Message part 1 (text/plain, inline)]
On 01/28/2014 11:48 PM, Paul Eggert wrote:

>  
> +/* The following functions exploit the commutativity and associativity of ^,
> +   and the fact that X ^ X is zero.  POSIX requires that C equals
> +   either tolower (C) or toupper (C);

Unfortunately, while this is true, I'm not sure if it accurately covers
all possible case-folded comparisons outside of the C locale.

http://www.unicode.org/faq/casemap_charprop.html

Consider the Greek locale, el_GR.UTF-8, which has two lower-case sigma:
L'\x3c3' and L'\x3c2', but only one upper-case: L'\x3a3'.  As a result,
all three wchar_t values must compare case-insensitively to one another.

Or consider titlecase characters, such as Unicode L'\x1c8' (Lj), which
has both an uppercase mapping L'\x1c7' (LJ) and lowercase mapping
L'\x1c9' (lj) - again, all three wchar_t values must compare
case-insensitively to one another.

Your hack is great at finding characters that have a case mapping, but
not necessarily at finding all such characters that map to the same
result when passed through towlower(towupper(c)).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 13:50:01 GMT) Full text and rfc822 format available.

Message #20 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Aharon Robbins <arnold <at> skeeve.com>,
 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 06:49:13 -0700
[Message part 1 (text/plain, inline)]
On 01/29/2014 06:42 AM, Eric Blake wrote:

> Your hack is great at finding characters that have a case mapping, but
> not necessarily at finding all such characters that map to the same
> result when passed through towlower(towupper(c)).
> 

In particular, note that the Java language has formalized
case-insensitive comparison as follows:

http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#equalsIgnoreCase%28java.lang.String%29

Two characters c1 and c2 are considered the same, ignoring case if at
least one of the following is true:

    The two characters are the same (as compared by the == operator).
    Applying the method Character.toUpperCase(char) to each character
produces the same result.
    Applying the method Character.toLowerCase(char) to each character
produces the same result.

and lower down, compareToIgnoreCase():

Compares two strings lexicographically, ignoring case differences. This
method returns an integer whose sign is that of calling compareTo with
normalized versions of the strings where case differences have been
eliminated by calling
Character.toLowerCase(Character.toUpperCase(character)) on each character.

Note that this method does not take locale into account, and will result
in an unsatisfactory ordering for certain locales. The java.text package
provides collators to allow locale-sensitive ordering.


In particular, the specification was careful to require double-case
conversion, with uppercase first, in order to normalize all
single-character oddities, while still mentioning that true Unicode
collation has even more special cases that can't be decided on a
character-by-character basis.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 15:30:03 GMT) Full text and rfc822 format available.

Message #23 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: eggert <at> cs.ucla.edu
Cc: 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 08:29:26 -0700
Aaron Crane <grep <at> aaroncrane.co.uk> wrote:

> I don't think this works for the wide-character case. For example, ....

Maybe just use the xor stuff for the single byte case and the more
straightforward code for the multibyte case?

Otherwise it sounds like we're asking for trouble.

Also, maybe name the routines  ..._other_case instead of just _other ?

Thanks,

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Wed, 29 Jan 2014 16:58:03 GMT) Full text and rfc822 format available.

Message #26 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Aaron Crane <grep <at> aaroncrane.co.uk>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Aharon Robbins <arnold <at> skeeve.com>, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Wed, 29 Jan 2014 14:20:10 +0000
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> +/* The following functions exploit the commutativity and associativity of ^,
> +   and the fact that X ^ X is zero.  POSIX requires that C equals
> +   either tolower (C) or toupper (C); if the former, then C ^ tolower (C)
> +   is zero so C ^ xor_other (C) equals toupper (C), and similarly
> +   for the latter.  */
> +
> +/* Return the exclusive-OR of C and C's other case, or zero if C is
> +   not a letter that changes case.  */
> +
> +static wint_t
> +xor_wother (wint_t c)
> +{
> +  return towlower (c) ^ towupper (c);
> +}
[…]
> +      if (case_fold)
>          {
> +          wchar_t xor = xor_wother (wc);
> +          if (xor)
> +            {
> +              addtok_wc (wc ^ xor);
> +              addtok (OR);
> +            }

I don't think this works for the wide-character case. For example, in
a suitable locale, I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
towlower(). This matches the behaviour I can observe with a simple
test program under the en_GB.UTF-8 locale on both Linux and Mac OS.

Since 0x1c7 ^ 0x1c9 == 14, and 0x1c8 ^ 14 == 0x1c6, this means we'd
call addtok_wc(0x1c6), and U+01C6 is LATIN SMALL LETTER DZ WITH CARON,
which isn't a desired character.

-- 
Aaron Crane ** http://aaroncrane.co.uk/




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Thu, 30 Jan 2014 15:39:02 GMT) Full text and rfc822 format available.

Message #29 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Aaron Crane <grep <at> aaroncrane.co.uk>
Cc: Aharon Robbins <arnold <at> skeeve.com>, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Thu, 30 Jan 2014 07:37:53 -0800
Aaron Crane wrote:
> I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
> LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
> under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
> towlower().

Ouch, thanks, I hadn't considered that.  So my idea was all wrong.  But 
this means the current code is all wrong too.  I'll take a look at it. I 
hope I don't regret picking up this thread....




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Thu, 30 Jan 2014 15:53:02 GMT) Full text and rfc822 format available.

Message #32 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: grep <at> aaroncrane.co.uk, eggert <at> cs.ucla.edu
Cc: arnold <at> skeeve.com, 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Thu, 30 Jan 2014 08:51:34 -0700
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Aaron Crane wrote:
> > I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
> > LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
> > under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
> > towlower().
>
> Ouch, thanks, I hadn't considered that.  So my idea was all wrong.  But 
> this means the current code is all wrong too.  I'll take a look at it. I 
> hope I don't regret picking up this thread....

This seems to be a weird (and very much corner) case: wc != towlower(wc)
and wc != towupper(wc).  It can only be an issue if doing case folding,
and there are only a few spots in the code that deal with case folding
when compiling the dfa.

I suggest starting with the XOR changes for unibyte locales - they seem
(to me) to be good no matter what. And then separately try to deal with
the multibyte case.

And just to increase the need for Aspirin, any idea how regex handles
this case?  I would not be surprised if the code there also doesn't
catch this.  Wheeeeeeeee!  :-)

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Thu, 30 Jan 2014 21:57:02 GMT) Full text and rfc822 format available.

Message #35 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com, grep <at> aaroncrane.co.uk
Cc: 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Thu, 30 Jan 2014 13:56:10 -0800
On 01/30/2014 07:51 AM, arnold <at> skeeve.com wrote:
> I suggest starting with the XOR changes for unibyte locales - they seem
> (to me) to be good no matter what. And then separately try to deal with
> the multibyte case.

Unfortunately the changes don't work even for unibyte locales, since 
unibyte locales can have the same problem, i.e., c != tolower (c) && c 
!= toupper (c).  Admittedly this is rare, but it's possible (users can 
define their own locales, after all), and fixing the code for the 
multibyte case will induce similar changes for unibyte, I hope.

>
> And just to increase the need for Aspirin, any idea how regex handles
> this case?

No idea, sorry.  Rounds of aspirin for everybody!




Information forwarded to bug-grep <at> gnu.org:
bug#16581; Package grep. (Fri, 31 Jan 2014 09:21:02 GMT) Full text and rfc822 format available.

Message #38 received at 16581 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: grep <at> aaroncrane.co.uk, eggert <at> cs.ucla.edu
Cc: 16581 <at> debbugs.gnu.org
Subject: Re: bug#16581: suggested code simplification in dfa.c
Date: Fri, 31 Jan 2014 11:20:52 +0200
> > I suggest starting with the XOR changes for unibyte locales - they seem
> > (to me) to be good no matter what. And then separately try to deal with
> > the multibyte case.
>
> Unfortunately the changes don't work even for unibyte locales, since 
> unibyte locales can have the same problem, i.e., c != tolower (c) && c 
> != toupper (c).  Admittedly this is rare, but it's possible (users can 
> define their own locales, after all), and fixing the code for the 
> multibyte case will induce similar changes for unibyte, I hope.

I see. OK.

> > And just to increase the need for Aspirin, any idea how regex handles
> > this case?
>
> No idea, sorry.  Rounds of aspirin for everybody!

Well, I am comforted that this has not been a big issue in practice (yet!).

If I get ambitious I will try to look at it; chances are that you, Paul,
will beat me to it.

It's been an interesting discussion, anyway. :-)

Thanks,

Arnold




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 08 Mar 2014 18:17:02 GMT) Full text and rfc822 format available.

Notification sent to Aharon Robbins <arnold <at> skeeve.com>:
bug acknowledged by developer. (Sat, 08 Mar 2014 18:17:02 GMT) Full text and rfc822 format available.

Message #43 received at 16581-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16581-done <at> debbugs.gnu.org
Subject: Re:  suggested code simplification in dfa.c
Date: Sat, 08 Mar 2014 10:16:45 -0800
This particular issue seems to have been put to bed in the savannah git 
master so I'm marking it as done.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 06 Apr 2014 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 78 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.