GNU bug report logs - #26193
[0-9] versus [[:digit:]]

Previous Next

Package: grep;

Reported by: "John P. Linderman" <jpl.jpl <at> gmail.com>

Date: Mon, 20 Mar 2017 17:00:02 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 26193 in the body.
You can then email your comments to 26193 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Mon, 20 Mar 2017 17:00:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "John P. Linderman" <jpl.jpl <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 20 Mar 2017 17:00:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "John P. Linderman" <jpl.jpl <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: [0-9] versus [[:digit:]]
Date: Mon, 20 Mar 2017 11:34:05 -0400
[Message part 1 (text/plain, inline)]
In what follows, file "conjectures" is a 6 billion bytes file in which each
line contains at most one letter P, and few (see output) have a digit
following a P. "rusage" is just a home-brew resource usage summary command.

  rusage egrep 'P[0-9]' conjectures > xxx
695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
2488 mx 0 ix 0 id 0 is

  cat xxx
A[21]=11{11}:22<LP3

  rusage egrep 'P[[:digit:]]' conjectures > xxx
14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
2500 mx 0 ix 0 id 0 is

  cat xxx
A[21]=11{11}:22<LP3

Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.

  grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/
grep.git/tree/AUTHORS>.

  uname -a
Linux jpl 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Mon, 20 Mar 2017 19:51:02 GMT) Full text and rfc822 format available.

Message #8 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "John P. Linderman" <jpl.jpl <at> gmail.com>
Cc: 26193 <at> debbugs.gnu.org
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Mon, 20 Mar 2017 12:49:52 -0700
tags 26193 moreinfo
done

On Mon, Mar 20, 2017 at 8:34 AM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> In what follows, file "conjectures" is a 6 billion bytes file in which each
> line contains at most one letter P, and few (see output) have a digit
> following a P. "rusage" is just a home-brew resource usage summary command.
>
>   rusage egrep 'P[0-9]' conjectures > xxx
> 695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx
> 2488 mx 0 ix 0 id 0 is
>
>   cat xxx
> A[21]=11{11}:22<LP3
>
>   rusage egrep 'P[[:digit:]]' conjectures > xxx
> 14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx
> 2500 mx 0 ix 0 id 0 is
>
>   cat xxx
> A[21]=11{11}:22<LP3
>
> Using what is to me the more obvious [0-9] pattern takes almost 50 times as
> long as using the [[:digit:]] pattern. Seems very strange.
...

Thank you for the report. However, there have been numerous
improvements since grep-2.25, which was released nearly a year ago.
The latest is grep-3.0, and using it, I am unable to reproduce the
problem on an input of 333M lines, each of length 19, and ending in
"P":

  $ yes 'A[21]=11{11}:22<LP'| head -333000000 > /dev/shm/k
  $ env time grep 'P[0-9]' /dev/shm/k
  7.84user 2.06system 0:09.90elapsed 100%CPU (0avgtext+0avgdata
2008maxresident)k
  0inputs+0outputs (0major+97minor)pagefaults 0swaps
  [Exit 1]
  $ env time grep 'P[[:digit:]]' /dev/shm/k
  7.86user 1.96system 0:09.83elapsed 99%CPU (0avgtext+0avgdata 2004maxresident)k
  0inputs+0outputs (0major+97minor)pagefaults 0swaps
  [Exit 1]




Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Wed, 22 Mar 2017 02:10:01 GMT) Full text and rfc822 format available.

Message #11 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "John P. Linderman" <jpl.jpl <at> gmail.com>, 26193 <at> debbugs.gnu.org
Cc: Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Tue, 21 Mar 2017 19:09:41 -0700
[Message part 1 (text/plain, inline)]
John P. Linderman wrote:
> Using what is to me the more obvious [0-9] pattern takes almost 50 times as
> long as using the [[:digit:]] pattern. Seems very strange.

Thanks for reporting that. In general, patterns like [a-z] can be much slower 
than [[:lower:]] due to poorly-thought-out POSIX interfaces. However, [0-9] is a 
special case: we can optimize such patterns safely if both ends are ASCII 
digits. I installed the attached patch to Gnulib to do that; it fixes the 
performance glitch you noticed, at least for me.
[0001-dfa-make-0-9-faster-in-non-C-locales.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Wed, 22 Mar 2017 04:29:01 GMT) Full text and rfc822 format available.

Message #14 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: "John P. Linderman" <jpl.jpl <at> gmail.com>, 26193 <at> debbugs.gnu.org,
 Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Tue, 21 Mar 2017 21:28:05 -0700
On Tue, Mar 21, 2017 at 7:09 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> John P. Linderman wrote:
>>
>> Using what is to me the more obvious [0-9] pattern takes almost 50 times
>> as
>> long as using the [[:digit:]] pattern. Seems very strange.
>
>
> Thanks for reporting that. In general, patterns like [a-z] can be much
> slower than [[:lower:]] due to poorly-thought-out POSIX interfaces. However,
> [0-9] is a special case: we can optimize such patterns safely if both ends
> are ASCII digits. I installed the attached patch to Gnulib to do that; it
> fixes the performance glitch you noticed, at least for me.

Thank you, Paul. I confirmed that that solves it for me, too, with a
multibyte locale. I didn't reproduce it initially because I was using
LC_ALL=C.




Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Wed, 22 Mar 2017 12:45:02 GMT) Full text and rfc822 format available.

Message #17 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: "John P. Linderman" <jpl.jpl <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 26193 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Wed, 22 Mar 2017 08:44:23 -0400
[Message part 1 (text/plain, inline)]
Thanks, all. That puts the runtimes on equal footing:

+ wc conjectures
 125441818  125441818 6249180939 conjectures
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.85 real 5.14 user 0.70 sys 0 pf 118 pr 0 sw 0 rb 0 wb 1 vcx 11 icx 2420
mx 0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.77 real 5.10 user 0.67 sys 0 pf 121 pr 0 sw 0 rb 0 wb 1 vcx 7 icx 2492 mx
0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[0-9] conjectures
A[21]=11{11}:22<LP3
5.80 real 5.15 user 0.62 sys 0 pf 119 pr 0 sw 0 rb 0 wb 1 vcx 1001 icx 2424
mx 0 ix 0 id 0 is


On Wed, Mar 22, 2017 at 12:28 AM, Jim Meyering <jim <at> meyering.net> wrote:

> On Tue, Mar 21, 2017 at 7:09 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> > John P. Linderman wrote:
> >>
> >> Using what is to me the more obvious [0-9] pattern takes almost 50 times
> >> as
> >> long as using the [[:digit:]] pattern. Seems very strange.
> >
> >
> > Thanks for reporting that. In general, patterns like [a-z] can be much
> > slower than [[:lower:]] due to poorly-thought-out POSIX interfaces.
> However,
> > [0-9] is a special case: we can optimize such patterns safely if both
> ends
> > are ASCII digits. I installed the attached patch to Gnulib to do that; it
> > fixes the performance glitch you noticed, at least for me.
>
> Thank you, Paul. I confirmed that that solves it for me, too, with a
> multibyte locale. I didn't reproduce it initially because I was using
> LC_ALL=C.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Wed, 22 Mar 2017 18:02:02 GMT) Full text and rfc822 format available.

Message #20 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "John P. Linderman" <jpl.jpl <at> gmail.com>, Jim Meyering <jim <at> meyering.net>
Cc: 26193 <at> debbugs.gnu.org, Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Wed, 22 Mar 2017 11:01:43 -0700
On 03/22/2017 05:44 AM, John P. Linderman wrote:
> That puts the runtimes on equal footing:
>
In my measurements, P[0-9] is still a tiny bit slower if one is using 
glibc regex, due to a performance problem in glibc. You can work around 
it by configuring --with-included-regex. It's probably not worth 
worrying about, though.

By the way, using LC_ALL=C should help avoid performance problems like 
these in the future, if all you're doing is something where single-byte 
pattern matching suffices.





Information forwarded to bug-grep <at> gnu.org:
bug#26193; Package grep. (Wed, 22 Mar 2017 21:59:02 GMT) Full text and rfc822 format available.

Message #23 received at 26193 <at> debbugs.gnu.org (full text, mbox):

From: "John P. Linderman" <jpl.jpl <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 26193 <at> debbugs.gnu.org, Gnulib bugs <bug-gnulib <at> gnu.org>,
 Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Wed, 22 Mar 2017 17:58:39 -0400
[Message part 1 (text/plain, inline)]
I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
US, but something always went pear-shaped. I finally give up. I am
perfectly happy to suffer a tiny bit of performance, to have most things
work without thinking. A factor of 6, or 35, is not tiny, since I use grep
and friends intensely. That's how I discovered the performance problem to
begin with. Anyway, thank you for fixing my problem. I suspect that many of
us pioneers (using UNIX since 1973) have '[0-9]' wired into our fingers.

On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>
>> That puts the runtimes on equal footing:
>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
> glibc regex, due to a performance problem in glibc. You can work around it
> by configuring --with-included-regex. It's probably not worth worrying
> about, though.
>
> By the way, using LC_ALL=C should help avoid performance problems like
> these in the future, if all you're doing is something where single-byte
> pattern matching suffices.
>
>
[Message part 2 (text/html, inline)]

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Thu, 23 Mar 2017 01:58:02 GMT) Full text and rfc822 format available.

Notification sent to "John P. Linderman" <jpl.jpl <at> gmail.com>:
bug acknowledged by developer. (Thu, 23 Mar 2017 01:58:02 GMT) Full text and rfc822 format available.

Message #28 received at 26193-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: "John P. Linderman" <jpl.jpl <at> gmail.com>
Cc: 26193-done <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Gnulib bugs <bug-gnulib <at> gnu.org>
Subject: Re: bug#26193: [0-9] versus [[:digit:]]
Date: Wed, 22 Mar 2017 18:57:05 -0700
[Message part 1 (text/plain, inline)]
On Wed, Mar 22, 2017 at 2:58 PM, John P. Linderman <jpl.jpl <at> gmail.com> wrote:
> I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
> dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
> US, but something always went pear-shaped. I finally give up. I am perfectly
> happy to suffer a tiny bit of performance, to have most things work without
> thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
> intensely. That's how I discovered the performance problem to begin with.
> Anyway, thank you for fixing my problem. I suspect that many of us pioneers
> (using UNIX since 1973) have '[0-9]' wired into our fingers.
>
> On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>>
>> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>>>
>>> That puts the runtimes on equal footing:
>>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
>> glibc regex, due to a performance problem in glibc. You can work around it
>> by configuring --with-included-regex. It's probably not worth worrying
>> about, though.
>>
>> By the way, using LC_ALL=C should help avoid performance problems like
>> these in the future, if all you're doing is something where single-byte
>> pattern matching suffices.

I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:
[grep-gnulib-dfa-NEWS.diff (text/plain, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 20 Apr 2017 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.