On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova wrote: > I've explored the following case: > > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c > 6 > > It's a bug (there should be no match). > > This is what grep does: > > * triesto build DFA (as indfa.c) > * fails to expand character range [1-4] because of multibyte > localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF > that suppressesall dfa.c-related code), note the difference with > [1234] casein whichthere's no need to expand multibyte range > * falls back to Regex (gnulib extension of regex.h) > * Regex doesn't support '-z'semantics(the closest configuration to > '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP > set), so '\n'is treated as newline and match erroneously succeeds > > I think this should be worked around in grep: before calling 're_search' it > should split the input string by 'eolbyte'. > > The bug also present with PCRE engine: > > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c > 6 > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c > 6 Thank you for the analysis and the report. I have fixed the regex-oriented problem with the attached patch, but not yet the case using -P -z (PCRE + --null-data):