GNU bug report logs -
#26193
[0-9] versus [[:digit:]]
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 26193 in the body.
You can then email your comments to 26193 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Mon, 20 Mar 2017 17:00:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"John P. Linderman" <jpl.jpl <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 20 Mar 2017 17:00:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
In what follows, file "conjectures" is a 6 billion bytes file in which each
line contains at most one letter P, and few (see output) have a digit
following a P. "rusage" is just a home-brew resource usage summary command.
rusage egrep 'P[0-9]' conjectures > xxx
695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
2488 mx 0 ix 0 id 0 is
cat xxx
A[21]=11{11}:22<LP3
rusage egrep 'P[[:digit:]]' conjectures > xxx
14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
2500 mx 0 ix 0 id 0 is
cat xxx
A[21]=11{11}:22<LP3
Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.
grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/
grep.git/tree/AUTHORS>.
uname -a
Linux jpl 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Mon, 20 Mar 2017 19:51:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 26193 <at> debbugs.gnu.org (full text, mbox):
tags 26193 moreinfo
done
On Mon, Mar 20, 2017 at 8:34 AM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> In what follows, file "conjectures" is a 6 billion bytes file in which each
> line contains at most one letter P, and few (see output) have a digit
> following a P. "rusage" is just a home-brew resource usage summary command.
>
> rusage egrep 'P[0-9]' conjectures > xxx
> 695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
> 2488 mx 0 ix 0 id 0 is
>
> cat xxx
> A[21]=11{11}:22<LP3
>
> rusage egrep 'P[[:digit:]]' conjectures > xxx
> 14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
> 2500 mx 0 ix 0 id 0 is
>
> cat xxx
> A[21]=11{11}:22<LP3
>
> Using what is to me the more obvious [0-9] pattern takes almost 50 times as
> long as using the [[:digit:]] pattern. Seems very strange.
...
Thank you for the report. However, there have been numerous
improvements since grep-2.25, which was released nearly a year ago.
The latest is grep-3.0, and using it, I am unable to reproduce the
problem on an input of 333M lines, each of length 19, and ending in
"P":
$ yes 'A[21]=11{11}:22<LP'| head -333000000 > /dev/shm/k
$ env time grep 'P[0-9]' /dev/shm/k
7.84user 2.06system 0:09.90elapsed 100%CPU (0avgtext+0avgdata
2008maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
[Exit 1]
$ env time grep 'P[[:digit:]]' /dev/shm/k
7.86user 1.96system 0:09.83elapsed 99%CPU (0avgtext+0avgdata 2004maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
[Exit 1]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Wed, 22 Mar 2017 02:10:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 26193 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
John P. Linderman wrote:
> Using what is to me the more obvious [0-9] pattern takes almost 50 times as
> long as using the [[:digit:]] pattern. Seems very strange.
Thanks for reporting that. In general, patterns like [a-z] can be much slower
than [[:lower:]] due to poorly-thought-out POSIX interfaces. However, [0-9] is a
special case: we can optimize such patterns safely if both ends are ASCII
digits. I installed the attached patch to Gnulib to do that; it fixes the
performance glitch you noticed, at least for me.
[0001-dfa-make-0-9-faster-in-non-C-locales.patch (text/x-diff, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Wed, 22 Mar 2017 04:29:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 26193 <at> debbugs.gnu.org (full text, mbox):
On Tue, Mar 21, 2017 at 7:09 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> John P. Linderman wrote:
>>
>> Using what is to me the more obvious [0-9] pattern takes almost 50 times
>> as
>> long as using the [[:digit:]] pattern. Seems very strange.
>
>
> Thanks for reporting that. In general, patterns like [a-z] can be much
> slower than [[:lower:]] due to poorly-thought-out POSIX interfaces. However,
> [0-9] is a special case: we can optimize such patterns safely if both ends
> are ASCII digits. I installed the attached patch to Gnulib to do that; it
> fixes the performance glitch you noticed, at least for me.
Thank you, Paul. I confirmed that that solves it for me, too, with a
multibyte locale. I didn't reproduce it initially because I was using
LC_ALL=C.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Wed, 22 Mar 2017 12:45:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 26193 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks, all. That puts the runtimes on equal footing:
+ wc conjectures
125441818 125441818 6249180939 conjectures
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.85 real 5.14 user 0.70 sys 0 pf 118 pr 0 sw 0 rb 0 wb 1 vcx 11 icx 2420
mx 0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.77 real 5.10 user 0.67 sys 0 pf 121 pr 0 sw 0 rb 0 wb 1 vcx 7 icx 2492 mx
0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[0-9] conjectures
A[21]=11{11}:22<LP3
5.80 real 5.15 user 0.62 sys 0 pf 119 pr 0 sw 0 rb 0 wb 1 vcx 1001 icx 2424
mx 0 ix 0 id 0 is
On Wed, Mar 22, 2017 at 12:28 AM, Jim Meyering <jim <at> meyering.net> wrote:
> On Tue, Mar 21, 2017 at 7:09 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > John P. Linderman wrote:
> >>
> >> Using what is to me the more obvious [0-9] pattern takes almost 50 times
> >> as
> >> long as using the [[:digit:]] pattern. Seems very strange.
> >
> >
> > Thanks for reporting that. In general, patterns like [a-z] can be much
> > slower than [[:lower:]] due to poorly-thought-out POSIX interfaces.
> However,
> > [0-9] is a special case: we can optimize such patterns safely if both
> ends
> > are ASCII digits. I installed the attached patch to Gnulib to do that; it
> > fixes the performance glitch you noticed, at least for me.
>
> Thank you, Paul. I confirmed that that solves it for me, too, with a
> multibyte locale. I didn't reproduce it initially because I was using
> LC_ALL=C.
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Wed, 22 Mar 2017 18:02:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 26193 <at> debbugs.gnu.org (full text, mbox):
On 03/22/2017 05:44 AM, John P. Linderman wrote:
> That puts the runtimes on equal footing:
>
In my measurements, P[0-9] is still a tiny bit slower if one is using
glibc regex, due to a performance problem in glibc. You can work around
it by configuring --with-included-regex. It's probably not worth
worrying about, though.
By the way, using LC_ALL=C should help avoid performance problems like
these in the future, if all you're doing is something where single-byte
pattern matching suffices.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#26193
; Package
grep
.
(Wed, 22 Mar 2017 21:59:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 26193 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
US, but something always went pear-shaped. I finally give up. I am
perfectly happy to suffer a tiny bit of performance, to have most things
work without thinking. A factor of 6, or 35, is not tiny, since I use grep
and friends intensely. That's how I discovered the performance problem to
begin with. Anyway, thank you for fixing my problem. I suspect that many of
us pioneers (using UNIX since 1973) have '[0-9]' wired into our fingers.
On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>
>> That puts the runtimes on equal footing:
>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
> glibc regex, due to a performance problem in glibc. You can work around it
> by configuring --with-included-regex. It's probably not worth worrying
> about, though.
>
> By the way, using LC_ALL=C should help avoid performance problems like
> these in the future, if all you're doing is something where single-byte
> pattern matching suffices.
>
>
[Message part 2 (text/html, inline)]
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Thu, 23 Mar 2017 01:58:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
"John P. Linderman" <jpl.jpl <at> gmail.com>
:
bug acknowledged by developer.
(Thu, 23 Mar 2017 01:58:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 26193-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Wed, Mar 22, 2017 at 2:58 PM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
> dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
> US, but something always went pear-shaped. I finally give up. I am perfectly
> happy to suffer a tiny bit of performance, to have most things work without
> thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
> intensely. That's how I discovered the performance problem to begin with.
> Anyway, thank you for fixing my problem. I suspect that many of us pioneers
> (using UNIX since 1973) have '[0-9]' wired into our fingers.
>
> On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>>
>> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>>>
>>> That puts the runtimes on equal footing:
>>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
>> glibc regex, due to a performance problem in glibc. You can work around it
>> by configuring --with-included-regex. It's probably not worth worrying
>> about, though.
>>
>> By the way, using LC_ALL=C should help avoid performance problems like
>> these in the future, if all you're doing is something where single-byte
>> pattern matching suffices.
I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:
[grep-gnulib-dfa-NEWS.diff (text/plain, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 20 Apr 2017 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.