GNU bug report logs -
#26193
[0-9] versus [[:digit:]]
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Wed, 22 Mar 2017 18:57:05 -0700
with message-id <CA+8g5KHJu0gLBfDTk5Y1OGgr-rVQ+G8NWfeXA+QzTmMMe_xekA <at> mail.gmail.com>
and subject line Re: bug#26193: [0-9] versus [[:digit:]]
has caused the debbugs.gnu.org bug report #26193,
regarding [0-9] versus [[:digit:]]
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
26193: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=26193
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
In what follows, file "conjectures" is a 6 billion bytes file in which each
line contains at most one letter P, and few (see output) have a digit
following a P. "rusage" is just a home-brew resource usage summary command.
rusage egrep 'P[0-9]' conjectures > xxx
695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
2488 mx 0 ix 0 id 0 is
cat xxx
A[21]=11{11}:22<LP3
rusage egrep 'P[[:digit:]]' conjectures > xxx
14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
2500 mx 0 ix 0 id 0 is
cat xxx
A[21]=11{11}:22<LP3
Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.
grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/
grep.git/tree/AUTHORS>.
uname -a
Linux jpl 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux
[Message part 4 (text/html, inline)]
[Message part 5 (message/rfc822, inline)]
[Message part 6 (text/plain, inline)]
On Wed, Mar 22, 2017 at 2:58 PM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
> dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
> US, but something always went pear-shaped. I finally give up. I am perfectly
> happy to suffer a tiny bit of performance, to have most things work without
> thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
> intensely. That's how I discovered the performance problem to begin with.
> Anyway, thank you for fixing my problem. I suspect that many of us pioneers
> (using UNIX since 1973) have '[0-9]' wired into our fingers.
>
> On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>>
>> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>>>
>>> That puts the runtimes on equal footing:
>>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
>> glibc regex, due to a performance problem in glibc. You can work around it
>> by configuring --with-included-regex. It's probably not worth worrying
>> about, though.
>>
>> By the way, using LC_ALL=C should help avoid performance problems like
>> these in the future, if all you're doing is something where single-byte
>> pattern matching suffices.
I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:
[grep-gnulib-dfa-NEWS.diff (text/plain, attachment)]
This bug report was last modified 8 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.