GNU bug report logs -
#17376
[PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Wed, 30 Apr 2014 15:03:01 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Sorry, tow test cases are wrong. It's as follows surely.
encode() { echo "$1" | tr ABC '\357\274\241'; }
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
^
encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
^
We will expect none of the commands output anything, but we get 1 row in
4th. We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.
return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);
We must check whether it's valid at not the top but a part of last of
matched pattern. Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'. ^
^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.
Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on. No longer tokens are broken in 1st case.
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
dfaanalyze:
0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT
I expect below, becuase `encode ABC' is `ef bc a1'.
dfaanalyze:
0:ef 1:71 2:OR 3:END 4:CAT
However, It will be also difficult to fix it correctly. Therefore,
I propose the simple fix in the patch.
Thanks,
Norihiro
This bug report was last modified 11 years and 15 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.