GNU bug report logs -
#24160
[PATCH 1/2] sed: cache results of mbrtowc for speed
Previous Next
To reply to this bug, email your comments to 24160 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-sed <at> gnu.org
:
bug#24160
; Package
sed
.
(Fri, 05 Aug 2016 13:52:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-sed <at> gnu.org
.
(Fri, 05 Aug 2016 13:52:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
We can speeds up sed by caching result of result mbrtowc() for single
byte characters. It is effective especially in non-UTF8 multibyte
locales which is expensive calculatation.
$ yes $(printf %040d 0) | head -1000000 >k
Before:
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
real 1.93
user 1.61
sys 0.27
After patching
$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
real 0.46
user 0.42
sys 0.03
Thanks,
Norihiro
[0001-sed-cache-results-of-mbrtowc-for-speed.patch (text/plain, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24160
; Package
sed
.
(Fri, 05 Aug 2016 14:47:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 24160 <at> debbugs.gnu.org (full text, mbox):
Hello Norihiro,
Thank you for the patch.
On 08/05/2016 09:51 AM, Norihiro Tanaka wrote:
> We can speeds up sed by caching result of result mbrtowc() for single
> byte characters. It is effective especially in non-UTF8 multibyte
> locales which is expensive calculatation.
Regarding this:
====
#define MBRTOWC(pwc, s, n, ps) \
- (mb_cur_max == 1 ? \
- (*(pwc) = btowc (*(unsigned char *) (s)), 1) : \
+ (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+ (*(pwc) = mbrtowc_cache[*(unsigned char *) (s)], 1) : \
mbrtowc ((pwc), (s), (n), (ps)))
#define MBRLEN(s, n, ps) \
- (mb_cur_max == 1 ? 1 : mbrtowc (NULL, s, n, ps))
+ (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+ 1 : mbrtowc (NULL, s, n, ps))
====
By using a cache table, isn't this code ignoring mbstate ?
For example, in shift-jis encoding, the character '[' can either be standalone,
or a second character in a sequence such as '\x83\x5b' ?
Wouldn't it also prevent detection of invalid sequences ?
As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
so this change will not likely affect utf8 or C locales.
regards,
- assaf
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24160
; Package
sed
.
(Sat, 06 Aug 2016 07:14:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 24160 <at> debbugs.gnu.org (full text, mbox):
On Fri, 5 Aug 2016 10:45:59 -0400
Assaf Gordon <assafgordon <at> gmail.com> wrote:
> Hello Norihiro,
>
> Thank you for the patch.
>
> By using a cache table, isn't this code ignoring mbstate ?
> For example, in shift-jis encoding, the character '[' can either be standalone,
> or a second character in a sequence such as '\x83\x5b' ?
> Wouldn't it also prevent detection of invalid sequences ?
>
> As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
> so this change will not likely affect utf8 or C locales.
>
> regards,
> - assaf
Hi Assaf,
Thanks for review.
When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always
initial state or the equivalent to a state with initial state except
invalid sequence and incomplete sequence found, as shift-jis is
state-less encoding.
Even if their sequences were found, mbstate should be set to initial
state manually to check following characters in the string. So I think
that we can ignore mbstate in state-less encoding.
However, the assumption is wrong for state-full encoding as ISO-2022 and
UTF-7. Does sed support state-full encoding which has shift sequence?
At least, It seems that regex does not support state-full encoding.
Thanks,
Norihiro
Information forwarded
to
bug-sed <at> gnu.org
:
bug#24160
; Package
sed
.
(Mon, 19 Sep 2016 02:33:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 24160 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri, 05 Aug 2016 22:51:16 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Hi,
>
> We can speeds up sed by caching result of result mbrtowc() for single
> byte characters. It is effective especially in non-UTF8 multibyte
> locales which is expensive calculatation.
>
> $ yes $(printf %040d 0) | head -1000000 >k
>
> Before:
>
> $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
> real 1.93
> user 1.61
> sys 0.27
>
> After patching
>
> $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
> real 0.46
> user 0.42
> sys 0.03
>
> Thanks,
> Norihiro
I rewrote the patch as using localeinfo in gnulib.
[0001-sed-use-cache-provided-by-localeinfo-for-mbrtowc-and.patch (text/plain, attachment)]
This bug report was last modified 8 years and 269 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.