GNU bug report logs - #16919
[PATCH] fix mismatch between dfa and regex for treatment of titlecase

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sun, 2 Mar 2014 00:34:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #29 received at 16919 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16919 <at> debbugs.gnu.org
Subject: Re: bug#16919: [PATCH] fix mismatch between dfa and regex for
 treatment of titlecase
Date: Wed, 05 Mar 2014 10:50:54 -0800
On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:
> I still believe that upper or lower case of a character should
> also match title case

The (soon-to-be-fixed) gnulib regex code agrees with you, assuming that 
towupper (X) agrees for all three values of X, because it uses (towupper 
(input) == towupper (pattern)). However, the most-plausible reading of 
POSIX does not agree with you, as it would require (input == pattern || 
towlower (input) == pattern || towupper (input) == pattern), which means 
a titlecase pattern will match only itself.

It seems pretty clear to me that the most-plausible reading of POSIX is 
buggy, for this reason.  No wonder so many implementations fail to 
conform to it.

I thought of a different way where gnulib/glibc regex does not conform 
to POSIX, and here there doesn't seem to be any ambiguity about it.  In 
the POSIX locale when ignoring case, the pattern '[Z-a]' matches the 
data 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' that 
collate between 'Z' and 'a'.  But the glibc regex code rejects that 
pattern entirely.  Conversely, in the same situation the glibc regex 
code says '[A-z]' matches only alphabetic characters, whereas POSIX says 
it should also match the nonalphabetic characters like '^' that collate 
between 'Z' and 'a'.  It appears that nobody cares, as this 
incompatibility has been present for years and I don't recall anyone 
complaining.  Though it is weird that this means "grep PAT" can match 
some lines that "grep -i PAT" doesn't.

Here POSIX is not merely ambiguous, it's clearly disagreeing with common 
practice.  It's not clear whether the bug is in POSIX or in the 
implementation.




This bug report was last modified 11 years and 135 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.