GNU bug report logs - #19837
Announcing initial release of hstbm, Version 0.13.

Previous Next

Package: grep;

Reported by: sur-behoffski <sur_behoffski <at> grouse.com.au>

Date: Wed, 11 Feb 2015 16:51:03 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: undisclosed-recipients:;
Subject: Announcing initial release of hstbm, Version 0.13.
Date: Thu, 12 Feb 2015 00:20:41 +1030
G'day,

I've been working on an "exploratory fork" of GNU Grep, in order
to provide a framework where I can set up a pared-down program
that provides support for some simple search cases, but does not
have the daunting complexity and (to some extent) extensive
legacy support baggage associated with GNU Grep.

This has resulted in a new project, "hstbm", which is an acronym
for Hybrid Self-Tuning Boyer-Moore.

The Savannah folks have kindly agreed to host hstbm as a non-GNU
project at:

    https://savannah.nongnu.org/projects/hstbm

The release can be downloaded as a separate tarball from:

    http://git.savannah.gnu.org/cgit/hstbm.git/snapshot/hstbm-0.13.tar.gz

Git users probably want to clone the repository instead; the
latest sources (which may include changes after the 0.13 release)
can be retrieved via "git clone":

    git clone git://git.savannah.nongnu.org/hstbm.git

-------------

hstbm is a partner to the "untangle" script that I've been posting
to the GNU Grep over time; whereas "untangle" tries to lay the
foundations for a "bottom-up" refactoring of GNU Grep code, hstbm
is a "top-down" approach, reducing the layers of code between the
caller, pattern interpretation (lexing and/or parsing), and the
underlying search algorithm -- which is also called hstbm.

Only a small fraction of regular expression syntax -- classes --
is recognised, and the class support is competent but not
bullet-proof.  It's best to think of the current level of pattern
support as "fgrep (or grep -F) plus some class support".  This
level of support is enough to fulfil the needs of the string
search algorithm included in this release; more support, such as
anchors, optional elements, repetition, backreferences etc. are
not included for now (although there's a nod in one or two places
that these could included in the future).

-------------

The hstbm algorithm itself is competitive with the equivalent
algorithm in GNU Grep (culminating in bmexec () in src/kwset.c).
In my limited testing, choosing data and patterns that are
somewhat biased towards showing hstbm's strong points, it can
outperform GNU Grep.  I'm hopeful that these improvements can
hold up over a very wide range of data and patterns, with very
few or no performance regressions, such that it might be
worthwhile incorporating hstbm into GNU Grep:

["repl" is a Lua script to wrap (roughly) the following shell
script, run it multiple times (throwing the first iteration
away), and log the results, as well as outputting them:

     Cmd=$*
     time (let j=1 \
           while [ $((j++)) -lt 1000 ] ; do \
                 $Cmd    >/dev/null   \
           done  \
     ) 2>&1
]

--------------

$ grep --version | head 1
grep (GNU grep) 2.21
$ ./h --version | head -1
hstbm (Hybrid Self-Tuning Boyer-Moore) 0.13


$ cat /usr/src/linux-3.14.32-gentoo/*/*.[ch] >csrc
$ wc csrc
  420778  1348027 11521841 csrc


$ ./repl grep -c "123[abc]456" csrc
Cmd:    "grep" "-c" "123[abc]456" "csrc"
0
  real  0m7.268s  user  0m3.140s  sys   0m3.767s
  real  0m7.299s  user  0m3.317s  sys   0m3.590s
  real  0m7.298s  user  0m3.253s  sys   0m3.657s
  real  0m7.245s  user  0m3.233s  sys   0m3.673s

$ ./repl ./h -c "123[abc]456" csrc
Cmd:    "./h" "-c" "123[abc]456" "csrc"
0
  real  0m6.570s  user  0m2.237s  sys   0m1.327s
  real  0m6.586s  user  0m2.200s  sys   0m1.383s
  real  0m6.647s  user  0m2.317s  sys   0m1.367s
  real  0m6.598s  user  0m2.247s  sys   0m1.367s

$ ./repl grep -ci "123[abc]456" csrc
Cmd:    "grep" "-ci" "123[abc]456" "csrc"
0
  real  0m7.307s  user  0m3.163s  sys   0m3.743s
  real  0m7.269s  user  0m3.373s  sys   0m3.533s
  real  0m7.266s  user  0m3.273s  sys   0m3.633s
  real  0m7.266s  user  0m3.357s  sys   0m3.553s

$ ./repl ./h -ci "123[abc]456" csrc
Cmd:    "./h" "-ci" "123[abc]456" "csrc"
0
  real  0m6.582s  user  0m2.250s  sys   0m1.313s
  real  0m6.589s  user  0m2.287s  sys   0m1.293s
  real  0m6.629s  user  0m2.197s  sys   0m1.370s
  real  0m6.637s  user  0m2.217s  sys   0m1.357s

$ ./repl grep -c "123456[abc]" csrc
Cmd:    "grep" "-c" "123456[abc]" "csrc"
0
  real  0m6.652s  user  0m2.170s  sys   0m1.407s
  real  0m6.659s  user  0m2.257s  sys   0m1.330s
  real  0m6.662s  user  0m2.130s  sys   0m1.450s
  real  0m6.658s  user  0m2.157s  sys   0m1.430s

$ ./repl ./h -c "123456[abc]" csrc
Cmd:    "./h" "-c" "123456[abc]" "csrc"
0
  real  0m6.584s  user  0m2.227s  sys   0m1.337s
  real  0m6.546s  user  0m2.210s  sys   0m1.360s
  real  0m6.546s  user  0m2.180s  sys   0m1.393s
  real  0m6.544s  user  0m2.210s  sys   0m1.363s

$ ./repl grep -ci "123456[abc]" csrc
Cmd:    "grep" "-ci" "123456[abc]" "csrc"
0
  real  0m6.676s  user  0m2.177s  sys   0m1.407s
  real  0m6.681s  user  0m2.060s  sys   0m1.537s
  real  0m6.671s  user  0m2.250s  sys   0m1.337s
  real  0m6.683s  user  0m2.090s  sys   0m1.497s

$ ./repl ./h -ci "123456[abc]" csrc
Cmd:    "./h" "-ci" "123456[abc]" "csrc"
0
  real  0m6.644s  user  0m2.333s  sys   0m1.233s
  real  0m6.689s  user  0m2.113s  sys   0m1.457s
  real  0m6.602s  user  0m2.187s  sys   0m1.383s
  real  0m6.604s  user  0m2.220s  sys   0m1.353s

$ ./repl grep -c "123456a" csrc
Cmd:    "grep" "-c" "123456a" "csrc"
0
  real  0m12.035s  user 0m7.743s  sys   0m2.490s
  real  0m12.024s  user 0m7.763s  sys   0m2.473s
  real  0m12.019s  user 0m7.820s  sys   0m2.413s
  real  0m12.017s  user 0m7.717s  sys   0m2.517s

$ ./repl ./h -c "123456a" csrc
Cmd:    "./h" "-c" "123456a" "csrc"
0
  real  0m6.266s  user  0m1.923s  sys   0m1.640s
  real  0m6.273s  user  0m1.980s  sys   0m1.590s
  real  0m6.252s  user  0m2.217s  sys   0m1.357s
  real  0m6.254s  user  0m2.033s  sys   0m1.540s

$ ./repl grep -ci "123456a" csrc
Cmd:    "grep" "-ci" "123456a" "csrc"
0
  real  0m15.220s  user 0m11.060s  sys  0m2.507s
  real  0m15.327s  user 0m10.980s  sys  0m2.583s
  real  0m15.215s  user 0m11.090s  sys  0m2.473s
  real  0m15.218s  user 0m10.900s  sys  0m2.663s

$ ./repl ./h -ci "123456a" csrc
Cmd:    "./h" "-ci" "123456a" "csrc"
0
  real  0m6.651s  user  0m2.307s  sys   0m1.260s
  real  0m6.665s  user  0m2.253s  sys   0m1.327s
  real  0m6.602s  user  0m2.320s  sys   0m1.243s
  real  0m6.611s  user  0m2.243s  sys   0m1.327s



$ wc candide-8859-1.txt
  4352  35591 221126 candide-8859-1.txt


$ LC_ALL=fr_FR.iso88591 ./repl grep -c "tr[[=e=]]s" candide-8859-1.txt
Cmd:    "grep" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
154
  real  0m4.330s  user  0m3.240s  sys   0m0.290s
  real  0m4.340s  user  0m3.273s  sys   0m0.263s
  real  0m4.345s  user  0m3.213s  sys   0m0.323s
  real  0m4.346s  user  0m3.213s  sys   0m0.327s

$ LC_ALL=fr_FR.iso88591 ./repl ./h -c "tr[[=e=]]s" candide-8859-1.txt
Cmd:    "./h" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
154
  real  0m1.376s  user  0m0.043s  sys   0m0.133s
  real  0m1.382s  user  0m0.073s  sys   0m0.103s
  real  0m1.383s  user  0m0.043s  sys   0m0.133s
  real  0m1.386s  user  0m0.050s  sys   0m0.127s

$ LC_ALL=fr_FR.iso88591 ./repl grep -ci "tr[[=e=]]s" candide-8859-1.txt
Cmd:    "grep" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
155
  real  0m5.704s  user  0m3.357s  sys   0m0.177s
  real  0m5.683s  user  0m3.383s  sys   0m0.150s
  real  0m5.686s  user  0m3.343s  sys   0m0.190s
  real  0m5.690s  user  0m3.363s  sys   0m0.173s

$ LC_ALL=fr_FR.iso88591 ./repl ./h -ci "tr[[=e=]]s" candide-8859-1.txt
Cmd:    "./h" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
155
  real  0m1.598s  user  0m0.043s  sys   0m0.137s
  real  0m1.600s  user  0m0.060s  sys   0m0.117s
  real  0m1.608s  user  0m0.040s  sys   0m0.140s
  real  0m1.609s  user  0m0.070s  sys   0m0.110s


--------------------


While many, many features of GNU Grep are not supported, two switches,
-J and -K=SKIP_LOOP_EFFICIENCY_THRESHOLD, are included.  Here's an
example of the debug output created when -J is specified:




$ ./h -J -c "tr[[=e=]]s" candide-8859-1.txt
* show_internals:  Compiled pattern:  Number of tokens: 4
    OP_CHAR_DIRECT('t')      -- 0x74
    OP_CHAR_DIRECT('r')      -- 0x72
    OP_CHAR_DIRECT('e')      -- 0x65
    OP_CHAR_DIRECT('s')      -- 0x73

* debug: hstbm context at 0x20ee000:
- fold_table (not used):
    ................................ ................................
    ................................ ................................
    ................................ ................................
    ................................ ................................
- delta1:
    44444444444444444444444444444444 44444444444444444444444444444444
    44444444444444444444444444444444 44444144444444444420344444444444
    44444444444444444444444444444444 44444444444444444444444444444444
    44444444444444444444444444444444 44444444444444444444444444444444
-      Pattern:    t   r   e   s

* Hybrid (memchr/memchr2) efficiency variables:
- skip_loop_efficiency_threshold: 80
-   nr_aliases:    0   0   0   0
-  first_alias:    .   .   .   .
- memchr_offset_list [negated]:    3   2   1   0
- current_memchr_offset_index: 0

* Guard (delta2) and match (delta 3/4) skip variables:
- delta2 candidate(s):   4
- delta2: 4
-       delta3:    4   4   4
-       delta4:    4   4

* context check... [OK]
93
$ ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
* show_internals:  Compiled pattern:  Number of tokens: 4
    OP_CHAR_CLASS(ref=1)     -- members:2
      -- 54 74                                             Tt
    OP_CHAR_CLASS(ref=2)     -- members:2
      -- 52 72                                             Rr
    OP_CHAR_CLASS(ref=3)     -- members:2
      -- 45 65                                             Ee
    OP_CHAR_CLASS(ref=4)     -- members:2
      -- 53 73                                             Ss

* debug: hstbm context at 0x1a4e000:
- fold_table:
    ................................  !"#$%&'()*+,-./0123456789:;<=>?
    @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
    ................................ ................................
    ................................ ................................
- delta1:
    44444444444444444444444444444444 44444444444444444444444444444444
    44444144444444444420344444444444 44444144444444444420344444444444
    44444444444444444444444444444444 44444444444444444444444444444444
    44444444444444444444444444444444 44444444444444444444444444444444
-      Pattern:    T   R   E   S

* Hybrid (memchr/memchr2) efficiency variables:
- skip_loop_efficiency_threshold: 80
-   nr_aliases:    1   1   1   1
-  first_alias:    t   r   e   s
- memchr_offset_list [negated]:    3   2   1   0
- current_memchr_offset_index: 0

* Guard (delta2) and match (delta 3/4) skip variables:
- delta2 candidate(s):   4
- delta2: 4
-       delta3:    4   4   4
-       delta4:    4   4

* context check... [OK]
93
$ LC_ALL=fr_FR.iso88591 ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
* show_internals:  Compiled pattern:  Number of tokens: 4
    OP_CHAR_CLASS(ref=1)     -- members:2
      -- 54 74                                             Tt
    OP_CHAR_CLASS(ref=2)     -- members:2
      -- 52 72                                             Rr
    OP_CHAR_CLASS(ref=3)     -- members:10
      -- 45 65 c8 c9 ca cb e8 e9  ea eb                    EeÈÉÊËèé êë
    OP_CHAR_CLASS(ref=4)     -- members:2
      -- 53 73                                             Ss

* debug: hstbm context at 0x1ae7000:
- fold_table:
    ................................  !"#$%&'()*+,-./0123456789:;<=>?
    @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
    ................................  ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
    ÀÁÂÃÄÅÆÇEEEEÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß àáâãäåæçEEEEìíîïðñòóôõö÷øùúûüýþÿ
- delta1:
    44444444444444444444444444444444 44444444444444444444444444444444
    44444144444444444420344444444444 44444144444444444420344444444444
    44444444444444444444444444444444 44444444444444444444444444444444
    44444444111144444444444444444444 44444444111144444444444444444444
-      Pattern:    T   R   E   S

* Hybrid (memchr/memchr2) efficiency variables:
- skip_loop_efficiency_threshold: 80
-   nr_aliases:    1   1   9   1
-  first_alias:    t   r   e   s
- memchr_offset_list [negated]:    3   2   0
- current_memchr_offset_index: 0

* Guard (delta2) and match (delta 3/4) skip variables:
- delta2 candidate(s):   4
- delta2: 4
-       delta3:    4   4   4
-       delta4:    4   4

* context check... [OK]
155
$



--------------------



WARNING:  Because hstbm supports far fewer features than GNU Grep,
it takes less time to initialise.  This will give hstbm an advantage
when the total input is small (?? perhaps less than 4k bytes?), and
should be factored out when comparing performance.  (Note that both
csrc and candide-8859-1.txt above are greater than 100k bytes, so
this effect should not be a large factor in the above results.)


$ ./repl grep -c "123[abc]456" /dev/null
Cmd:    "grep" "-c" "123[abc]456" "/dev/null"
0
  real  0m0.990s  user  0m0.073s  sys   0m0.097s
  real  0m1.001s  user  0m0.050s  sys   0m0.123s
  real  0m0.993s  user  0m0.020s  sys   0m0.147s
  real  0m0.989s  user  0m0.040s  sys   0m0.127s

$ ./repl ./h -c "123[abc]456" /dev/null
Cmd:    "./h" "-c" "123[abc]456" "/dev/null"
0
  real  0m0.748s  user  0m0.047s  sys   0m0.103s
  real  0m0.758s  user  0m0.057s  sys   0m0.093s
  real  0m0.762s  user  0m0.037s  sys   0m0.113s
  real  0m0.754s  user  0m0.050s  sys   0m0.100s

$ ./repl grep -ci "123[abc]456" /dev/null
Cmd:    "grep" "-ci" "123[abc]456" "/dev/null"
0
  real  0m0.993s  user  0m0.043s  sys   0m0.127s
  real  0m0.998s  user  0m0.057s  sys   0m0.110s
  real  0m0.995s  user  0m0.037s  sys   0m0.130s
  real  0m0.993s  user  0m0.053s  sys   0m0.110s

$ ./repl ./h -ci "123[abc]456" /dev/null
Cmd:    "./h" "-ci" "123[abc]456" "/dev/null"
0
  real  0m0.768s  user  0m0.037s  sys   0m0.113s
  real  0m0.787s  user  0m0.033s  sys   0m0.123s
  real  0m0.770s  user  0m0.037s  sys   0m0.110s
  real  0m0.779s  user  0m0.027s  sys   0m0.127s



--------------------



Have fun hacking, and reading the code... all feedback, comments,
patches, etc. welcome.

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software




This bug report was last modified 9 years and 359 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.