I believe that memchr() isn't slower on HP-UX and Solaris but more
faster on x86 processors.

Now I wrote the additional patch.

It replaces `tp += d' to the other code in bmexec() and cwexec() for x86
processors.

I see that the code is more complex and slower theoretically, but it's
faster actually when `d' is small.  (We can test it by `grep -i jk k'
after patch#17230 and this patch.)

Therefore, I think that memchr() is faster than `tp += d' because the
increment instruction is more faster than the add instruction on x86
processors.  KWSet may be slower than DFA on them in the worst case,
even if doesn't use delta2.  Try to run and compare below and compare
without this patches.  The former uses only KWSet and the later uses only
DFA.  Later is faster on Linux x86, because DFA uses increment instrument
for shift of text pointer.  It may generate a wrong usage among users.
I think that it should be avoided.

  $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 >k
  $ env LANG=C time -p src/grep jk k
  $ env LANG=C time -p src/grep -i jk k

> sometimes the former is more important than the latter, and this may be
> one of those times.

I see that grep attaches a high value to performance.  It uses not only
regex but also DFA.  Further more, DFA has the specific codes for utf-8
locale, which is used most frequently.  I think that grep may also have
specific codes for x86 processors, which are also used most frequently.

Norihiro