GNU bug report logs - #26193
[0-9] versus [[:digit:]]

Previous Next

Package: grep;

Reported by: "John P. Linderman" <jpl.jpl <at> gmail.com>

Date: Mon, 20 Mar 2017 17:00:02 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Jim Meyering <jim <at> meyering.net>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#26193: closed ([0-9] versus [[:digit:]])
Date: Thu, 23 Mar 2017 01:58:02 +0000

[Message part 1 (text/plain, inline)]

Your message dated Wed, 22 Mar 2017 18:57:05 -0700
with message-id <CA+8g5KHJu0gLBfDTk5Y1OGgr-rVQ+G8NWfeXA+QzTmMMe_xekA <at> mail.gmail.com>
and subject line Re: bug#26193: [0-9] versus [[:digit:]]
has caused the debbugs.gnu.org bug report #26193,
regarding [0-9] versus [[:digit:]]
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
26193: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=26193
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: "John P. Linderman" <jpl.jpl <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: [0-9] versus [[:digit:]]
Date: Mon, 20 Mar 2017 11:34:05 -0400

[Message part 3 (text/plain, inline)]

In what follows, file "conjectures" is a 6 billion bytes file in which each
line contains at most one letter P, and few (see output) have a digit
following a P. "rusage" is just a home-brew resource usage summary command.

  rusage egrep 'P[0-9]' conjectures > xxx
695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
2488 mx 0 ix 0 id 0 is

  cat xxx
A[21]=11{11}:22<LP3

  rusage egrep 'P[[:digit:]]' conjectures > xxx
14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
2500 mx 0 ix 0 id 0 is

  cat xxx
A[21]=11{11}:22<LP3

Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.

  grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/
grep.git/tree/AUTHORS>.

  uname -a
Linux jpl 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux

[Message part 4 (text/html, inline)]

[Message part 5 (message/rfc822, inline)]

From: Jim Meyering <jim <at> meyering.net>
To: "John P. Linderman" <jpl.jpl <at> gmail.com>
Cc: 26193-done <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Wed, 22 Mar 2017 18:57:05 -0700

[Message part 6 (text/plain, inline)]

On Wed, Mar 22, 2017 at 2:58 PM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
> dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
> US, but something always went pear-shaped. I finally give up. I am perfectly
> happy to suffer a tiny bit of performance, to have most things work without
> thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
> intensely. That's how I discovered the performance problem to begin with.
> Anyway, thank you for fixing my problem. I suspect that many of us pioneers
> (using UNIX since 1973) have '[0-9]' wired into our fingers.
>
> On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>>
>> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>>>
>>> That puts the runtimes on equal footing:
>>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
>> glibc regex, due to a performance problem in glibc. You can work around it
>> by configuring --with-included-regex. It's probably not worth worrying
>> about, though.
>>
>> By the way, using LC_ALL=C should help avoid performance problems like
>> these in the future, if all you're doing is something where single-byte
>> pattern matching suffices.

I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:

[grep-gnulib-dfa-NEWS.diff (text/plain, attachment)]

This bug report was last modified 8 years and 66 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #26193 [0-9] versus [[:digit:]]

GNU bug report logs - #26193
[0-9] versus [[:digit:]]