GNU bug report logs -
#16581
suggested code simplification in dfa.c
Previous Next
Reported by: Aharon Robbins <arnold <at> skeeve.com>
Date: Tue, 28 Jan 2014 20:12:01 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #26 received at 16581 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> +/* The following functions exploit the commutativity and associativity of ^,
> + and the fact that X ^ X is zero. POSIX requires that C equals
> + either tolower (C) or toupper (C); if the former, then C ^ tolower (C)
> + is zero so C ^ xor_other (C) equals toupper (C), and similarly
> + for the latter. */
> +
> +/* Return the exclusive-OR of C and C's other case, or zero if C is
> + not a letter that changes case. */
> +
> +static wint_t
> +xor_wother (wint_t c)
> +{
> + return towlower (c) ^ towupper (c);
> +}
[…]
> + if (case_fold)
> {
> + wchar_t xor = xor_wother (wc);
> + if (xor)
> + {
> + addtok_wc (wc ^ xor);
> + addtok (OR);
> + }
I don't think this works for the wide-character case. For example, in
a suitable locale, I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL
LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ")
under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under
towlower(). This matches the behaviour I can observe with a simple
test program under the en_GB.UTF-8 locale on both Linux and Mac OS.
Since 0x1c7 ^ 0x1c9 == 14, and 0x1c8 ^ 14 == 0x1c6, this means we'd
call addtok_wc(0x1c6), and U+01C6 is LATIN SMALL LETTER DZ WITH CARON,
which isn't a desired character.
--
Aaron Crane ** http://aaroncrane.co.uk/
This bug report was last modified 11 years and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.