GNU bug report logs -
#24160
[PATCH 1/2] sed: cache results of mbrtowc for speed
Previous Next
Full log
Message #8 received at 24160 <at> debbugs.gnu.org (full text, mbox):
Hello Norihiro,
Thank you for the patch.
On 08/05/2016 09:51 AM, Norihiro Tanaka wrote:
> We can speeds up sed by caching result of result mbrtowc() for single
> byte characters. It is effective especially in non-UTF8 multibyte
> locales which is expensive calculatation.
Regarding this:
====
#define MBRTOWC(pwc, s, n, ps) \
- (mb_cur_max == 1 ? \
- (*(pwc) = btowc (*(unsigned char *) (s)), 1) : \
+ (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+ (*(pwc) = mbrtowc_cache[*(unsigned char *) (s)], 1) : \
mbrtowc ((pwc), (s), (n), (ps)))
#define MBRLEN(s, n, ps) \
- (mb_cur_max == 1 ? 1 : mbrtowc (NULL, s, n, ps))
+ (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+ 1 : mbrtowc (NULL, s, n, ps))
====
By using a cache table, isn't this code ignoring mbstate ?
For example, in shift-jis encoding, the character '[' can either be standalone,
or a second character in a sequence such as '\x83\x5b' ?
Wouldn't it also prevent detection of invalid sequences ?
As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
so this change will not likely affect utf8 or C locales.
regards,
- assaf
This bug report was last modified 8 years and 270 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.