GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Previous Next

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sat, 20 Dec 2014 03:13:39 +0100
On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote:
> On Fri, 19 Dec 2014 23:00:38 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G .
> Binary file (standard input) matches
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P .
> $
> 
> regex also behaves same as grep -G, e.g. sed only using regex returns the
> line.  Therefore, I think that what a character in the surrogate area
> matches a period with grep -G is not a bug, although the behavior might
> not obey a standard.
> 
> $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p'
> 
> By the way, mbrlen() returns (size_t) -1 for the character.

IMHO, both grep and sed should be fixed to obey RFC 3629, which
specifies UTF-8. And other tools too (iconv...).

> OTOH, if a character in the surrogate area does not match a period in
> PCRE, I think that the character should not also match a period grep -P.

I agree.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




This bug report was last modified 3 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.