GNU bug report logs -
#24160
[PATCH 1/2] sed: cache results of mbrtowc for speed
Previous Next
Full log
Message #11 received at 24160 <at> debbugs.gnu.org (full text, mbox):
On Fri, 5 Aug 2016 10:45:59 -0400
Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello Norihiro,
>
> Thank you for the patch.
>
> By using a cache table, isn't this code ignoring mbstate ?
> For example, in shift-jis encoding, the character '[' can either be standalone,
> or a second character in a sequence such as '\x83\x5b' ?
> Wouldn't it also prevent detection of invalid sequences ?
>
> As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
> so this change will not likely affect utf8 or C locales.
>
> regards,
> - assaf
Hi Assaf,
Thanks for review.
When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always
initial state or the equivalent to a state with initial state except
invalid sequence and incomplete sequence found, as shift-jis is
state-less encoding.
Even if their sequences were found, mbstate should be set to initial
state manually to check following characters in the string. So I think
that we can ignore mbstate in state-less encoding.
However, the assumption is wrong for state-full encoding as ISO-2022 and
UTF-7. Does sed support state-full encoding which has shift sequence?
At least, It seems that regex does not support state-full encoding.
Thanks,
Norihiro
This bug report was last modified 8 years and 270 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.