GNU bug report logs - #24160
[PATCH 1/2] sed: cache results of mbrtowc for speed

Previous Next

Package: sed;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Fri, 5 Aug 2016 13:52:02 UTC

Severity: normal

Tags: patch

To reply to this bug, email your comments to 24160 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#24160; Package sed. (Fri, 05 Aug 2016 13:52:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Fri, 05 Aug 2016 13:52:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-sed <at> gnu.org>
Subject: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date: Fri, 05 Aug 2016 22:51:16 +0900
[Message part 1 (text/plain, inline)]
Hi,

We can speeds up sed by caching result of result mbrtowc() for single
byte characters.  It is effective especially in non-UTF8 multibyte
locales which is expensive calculatation.

$ yes $(printf %040d 0) | head -1000000 >k

Before:

$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
real 1.93
user 1.61
sys 0.27

After patching

$ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
real 0.46
user 0.42
sys 0.03

Thanks,
Norihiro
[0001-sed-cache-results-of-mbrtowc-for-speed.patch (text/plain, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#24160; Package sed. (Fri, 05 Aug 2016 14:47:01 GMT) Full text and rfc822 format available.

Message #8 received at 24160 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 24160 <at> debbugs.gnu.org
Subject: Re: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date: Fri, 5 Aug 2016 10:45:59 -0400
Hello Norihiro,

Thank you for the patch.

On 08/05/2016 09:51 AM, Norihiro Tanaka wrote:
> We can speeds up sed by caching result of result mbrtowc() for single
> byte characters.  It is effective especially in non-UTF8 multibyte
> locales which is expensive calculatation.

Regarding this:
====
 #define MBRTOWC(pwc, s, n, ps) \
-  (mb_cur_max == 1 ? \
-   (*(pwc) = btowc (*(unsigned char *) (s)), 1) : \
+  (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+   (*(pwc) = mbrtowc_cache[*(unsigned char *) (s)], 1) : \
    mbrtowc ((pwc), (s), (n), (ps)))
 
 #define MBRLEN(s, n, ps) \
-  (mb_cur_max == 1 ? 1 : mbrtowc (NULL, s, n, ps))
+  (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+   1 : mbrtowc (NULL, s, n, ps))
====

By using a cache table, isn't this code ignoring mbstate ?
For example, in shift-jis encoding, the character '[' can either be standalone,
or a second character in a sequence such as '\x83\x5b' ?
Wouldn't it also prevent detection of invalid sequences ?

As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
so this change will not likely affect utf8 or C locales.

regards,
 - assaf





Information forwarded to bug-sed <at> gnu.org:
bug#24160; Package sed. (Sat, 06 Aug 2016 07:14:02 GMT) Full text and rfc822 format available.

Message #11 received at 24160 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 24160 <at> debbugs.gnu.org
Subject: Re: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date: Sat, 06 Aug 2016 16:13:27 +0900
On Fri, 5 Aug 2016 10:45:59 -0400
Assaf Gordon <assafgordon <at> gmail.com> wrote:

> Hello Norihiro,
> 
> Thank you for the patch.
> 
> By using a cache table, isn't this code ignoring mbstate ?
> For example, in shift-jis encoding, the character '[' can either be standalone,
> or a second character in a sequence such as '\x83\x5b' ?
> Wouldn't it also prevent detection of invalid sequences ?
> 
> As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input,
> so this change will not likely affect utf8 or C locales.
> 
> regards,
>   - assaf

Hi Assaf,

Thanks for review.

When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always
initial state or the equivalent to a state with initial state except
invalid sequence and incomplete sequence found, as shift-jis is
state-less encoding.

Even if their sequences were found, mbstate should be set to initial
state manually to check following characters in the string.  So I think
that we can ignore mbstate in state-less encoding.

However, the assumption is wrong for state-full encoding as ISO-2022 and
UTF-7.  Does sed support state-full encoding which has shift sequence?
At least, It seems that regex does not support state-full encoding.

Thanks,
Norihiro





Information forwarded to bug-sed <at> gnu.org:
bug#24160; Package sed. (Mon, 19 Sep 2016 02:33:01 GMT) Full text and rfc822 format available.

Message #14 received at 24160 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 24160 <at> debbugs.gnu.org
Subject: Re: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date: Mon, 19 Sep 2016 11:32:20 +0900
[Message part 1 (text/plain, inline)]
On Fri, 05 Aug 2016 22:51:16 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Hi,
> 
> We can speeds up sed by caching result of result mbrtowc() for single
> byte characters.  It is effective especially in non-UTF8 multibyte
> locales which is expensive calculatation.
> 
> $ yes $(printf %040d 0) | head -1000000 >k
> 
> Before:
> 
> $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
> real 1.93
> user 1.61
> sys 0.27
> 
> After patching
> 
> $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k
> real 0.46
> user 0.42
> sys 0.03
> 
> Thanks,
> Norihiro

I rewrote the patch as using localeinfo in gnulib.
[0001-sed-use-cache-provided-by-localeinfo-for-mbrtowc-and.patch (text/plain, attachment)]

This bug report was last modified 8 years and 269 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.