GNU bug report logs - #41535
[PATCH] performance optimization for aarch64

Previous Next

Package: gzip;

Reported by: l00374334 <liqiang64 <at> huawei.com>

Date: Tue, 26 May 2020 05:18:02 UTC

Severity: normal

Tags: patch, wontfix

To reply to this bug, email your comments to 41535 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gzip <at> gnu.org:
bug#41535; Package gzip. (Tue, 26 May 2020 05:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to l00374334 <liqiang64 <at> huawei.com>:
New bug report received and forwarded. Copy sent to bug-gzip <at> gnu.org. (Tue, 26 May 2020 05:18:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: l00374334 <liqiang64 <at> huawei.com>
To: <bug-gzip <at> gnu.org>, <eggert <at> cs.ucla.edu>
Cc: luanjianhai <at> huawei.com, liqiang64 <at> huawei.com, sangyan <at> huawei.com,
 luchunhua <at> huawei.com
Subject: [PATCH] performance optimization for aarch64
Date: Tue, 26 May 2020 10:39:40 +0800

From: liqiang <liqiang64 <at> huawei.com>

By analyzing the compression and decompression process of gzip, I found 
that the hot spots of CRC32 and longest_match function are very high.

On the aarch64 architecture, we can optimize the efficiency of crc32 
through the interface provided by the neon instruction set (12x faster 
in aarch64), and optimize the performance of random access code through 
prefetch instructions (about 5%~8% improvement). In some compression 
scenarios, loop expansion can also get a certain performance improvement 
(about 10%).

Modify by Li Qiang.

---
 configure | 14 ++++++++++++++
 deflate.c | 30 +++++++++++++++++++++++++++++-
 util.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index cab3daf..dc80cb6 100644
--- a/configure
+++ b/configure
@@ -14555,6 +14555,20 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
            ;;
 
          arm* | aarch64 )
+           cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if defined __ARM_NEON__ || defined __ARM_NEON
+                   int ok;
+                  #else
+                   error fail
+                  #endif
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"
+then :
+  CFLAGS="$CFLAGS -march=armv8-a+crc"
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
            # Assume arm with EABI.
            # On arm64 systems, the C compiler may be generating code in one of
            # these ABIs:
diff --git a/deflate.c b/deflate.c
index 9d379e9..ee77ffd 100644
--- a/deflate.c
+++ b/deflate.c
@@ -378,6 +378,9 @@ longest_match(IPos cur_match)
     register int len;                           /* length of current match */
     int best_len = prev_length;                 /* best match length so far */
     IPos limit = strstart > (IPos)MAX_DIST ? strstart - (IPos)MAX_DIST : NIL;
+#ifdef __aarch64__
+    IPos next_match;
+#endif
     /* Stop when cur_match becomes <= limit. To simplify the code,
      * we prevent matches with the string of window index 0.
      */
@@ -411,6 +414,10 @@ longest_match(IPos cur_match)
     do {
         Assert(cur_match < strstart, "no future");
         match = window + cur_match;
+#ifdef __aarch64__
+        next_match = prev[cur_match & WMASK];
+        __asm__("PRFM PLDL1STRM, [%0]"::"r"(&(prev[next_match & WMASK])));
+#endif
 
         /* Skip to next match if the match length cannot increase
          * or if the match length is less than 2:
@@ -488,8 +495,14 @@ longest_match(IPos cur_match)
             scan_end   = scan[best_len];
 #endif
         }
-    } while ((cur_match = prev[cur_match & WMASK]) > limit
+    }
+#ifdef __aarch64__
+    while ((cur_match = next_match) > limit
+             && --chain_length != 0);
+#else
+    while ((cur_match = prev[cur_match & WMASK]) > limit
              && --chain_length != 0);
+#endif
 
     return best_len;
 }
@@ -777,7 +790,22 @@ deflate (int pack_level)
             lookahead -= prev_length-1;
             prev_length -= 2;
             RSYNC_ROLL(strstart, prev_length+1);
+            while (prev_length >= 4) {
+                /* After actual verification, expanding this loop
+                 * can improve its performance in certain scenarios.
+                 */
+                prev_length -= 4;
+                strstart++;
+                INSERT_STRING(strstart, hash_head);
+                strstart++;
+                INSERT_STRING(strstart, hash_head);
+                strstart++;
+                INSERT_STRING(strstart, hash_head);
+                strstart++;
+                INSERT_STRING(strstart, hash_head);
+            }
             do {
+                if (prev_length == 0) break;
                 strstart++;
                 INSERT_STRING(strstart, hash_head);
                 /* strstart never exceeds WSIZE-MAX_MATCH, so there are
diff --git a/util.c b/util.c
index 0a0fc21..c9f0e52 100644
--- a/util.c
+++ b/util.c
@@ -38,6 +38,12 @@
 
 static int write_buffer (int, voidp, unsigned int);
 
+#if defined __ARM_NEON__ || defined __ARM_NEON
+#define CRC32D(crc, val) __asm__("crc32x %w[c], %w[c], %x[v]":[c]"+r"(crc):[v]"r"(val))
+#define CRC32W(crc, val) __asm__("crc32w %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
+#define CRC32H(crc, val) __asm__("crc32h %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
+#define CRC32B(crc, val) __asm__("crc32b %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
+#else
 /* ========================================================================
  * Table of CRC-32's of all single-byte values (made by makecrc.c)
  */
@@ -95,6 +101,7 @@ static const ulg crc_32_tab[] = {
   0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
   0x2d02ef8dL
 };
+#endif
 
 /* Shift register contents.  */
 static ulg crc = 0xffffffffL;
@@ -132,6 +139,43 @@ ulg updcrc(s, n)
     const uch *s;           /* pointer to bytes to pump through */
     unsigned n;             /* number of bytes in s[] */
 {
+#if defined __ARM_NEON__ || defined __ARM_NEON
+    register ulg c;
+    static ulg crc = (ulg)0xffffffffL;
+    register const uint8_t  *buf1;
+    register const uint16_t *buf2;
+    register const uint32_t *buf4;
+    register const uint64_t *buf8;
+    int64_t length = (int64_t)n;
+    buf8 = (const  uint64_t *)(const void *)s;
+
+    if (s == NULL) {
+        c = 0xffffffffL;
+    } else {
+        c = crc;
+        while(length >= sizeof(uint64_t)) {
+            CRC32D(c, *buf8++);
+            length -= sizeof(uint64_t);
+        }
+        buf4 = (const uint32_t *)(const void *)buf8;
+        if (length >= sizeof(uint32_t)) {
+            CRC32W(c, *buf4++);
+            length -= sizeof(uint32_t);
+        }
+        buf2 = (const uint16_t *)(const void *)buf4;
+        if(length >= sizeof(uint16_t)) {
+            CRC32H(c, *buf2++);
+            length -= sizeof(uint16_t);
+        }
+        buf1 = (const uint8_t *)(const void *)buf2;
+        if (length >= sizeof(uint8_t)) {
+            CRC32B(c, *buf1);
+            length -= sizeof(uint8_t);
+        }
+    }
+    crc = c;
+    return (c ^ 0xffffffffL);
+#else
     register ulg c;         /* temporary variable */
 
     if (s == NULL) {
@@ -144,6 +188,7 @@ ulg updcrc(s, n)
     }
     crc = c;
     return c ^ 0xffffffffL;       /* (instead of ~c for 64-bit machines) */
+#endif
 }
 
 /* Return a current CRC value.  */
-- 
2.17.1

Information forwarded to bug-gzip <at> gnu.org:
bug#41535; Package gzip. (Sat, 30 May 2020 09:19:02 GMT) Full text and rfc822 format available.

Message #8 received at 41535 <at> debbugs.gnu.org (full text, mbox):

From: Li Qiang <liqiang64 <at> huawei.com>
To: <41535 <at> debbugs.gnu.org>
Cc: luanjianhai <at> huawei.com, eggert <at> cs.ucla.edu, sangyan <at> huawei.com,
 colordev.jiang <at> huawei.com, luchunhua <at> huawei.com, huxinwei <at> huawei.com,
 meyering <at> fb.com
Subject: bug#41535: [PATCH] performance optimization for aarch64
Date: Sat, 30 May 2020 17:17:49 +0800


在 2020/5/26 10:39, l00374334 写道:
> From: liqiang <liqiang64 <at> huawei.com>
> 
> By analyzing the compression and decompression process of gzip, I found 
> 
> that the hot spots of CRC32 and longest_match function are very high.
> 
> 
> 
> On the aarch64 architecture, we can optimize the efficiency of crc32 
> 
> through the interface provided by the neon instruction set (12x faster 
> 
> in aarch64), and optimize the performance of random access code through 
> 
> prefetch instructions (about 5%~8% improvement). In some compression 
> 
> scenarios, loop expansion can also get a certain performance improvement 
> 
> (about 10%).
> 
> 
> 
> Modify by Li Qiang.
> 
> ---
>  configure | 14 ++++++++++++++
>  deflate.c | 30 +++++++++++++++++++++++++++++-
>  util.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/configure b/configure
> index cab3daf..dc80cb6 100644
> --- a/configure
> +++ b/configure
> @@ -14555,6 +14555,20 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
>             ;;
>  
>           arm* | aarch64 )
> +           cat confdefs.h - <<_ACEOF >conftest.$ac_ext
> +/* end confdefs.h.  */
> +#if defined __ARM_NEON__ || defined __ARM_NEON
> +                   int ok;
> +                  #else
> +                   error fail
> +                  #endif
> +
> +_ACEOF
> +if ac_fn_c_try_compile "$LINENO"
> +then :
> +  CFLAGS="$CFLAGS -march=armv8-a+crc"
> +fi
> +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
>             # Assume arm with EABI.
>             # On arm64 systems, the C compiler may be generating code in one of
>             # these ABIs:
> diff --git a/deflate.c b/deflate.c
> index 9d379e9..ee77ffd 100644
> --- a/deflate.c
> +++ b/deflate.c
> @@ -378,6 +378,9 @@ longest_match(IPos cur_match)
>      register int len;                           /* length of current match */
> 
>      int best_len = prev_length;                 /* best match length so far */
> 
>      IPos limit = strstart > (IPos)MAX_DIST ? strstart - (IPos)MAX_DIST : NIL;
> 
> +#ifdef __aarch64__
> 
> +    IPos next_match;
> 
> +#endif
> 
>      /* Stop when cur_match becomes <= limit. To simplify the code,
> 
>       * we prevent matches with the string of window index 0.
> 
>       */
> 
> @@ -411,6 +414,10 @@ longest_match(IPos cur_match)
>      do {
> 
>          Assert(cur_match < strstart, "no future");
> 
>          match = window + cur_match;
> 
> +#ifdef __aarch64__
> 
> +        next_match = prev[cur_match & WMASK];
> 
> +        __asm__("PRFM PLDL1STRM, [%0]"::"r"(&(prev[next_match & WMASK])));
> 
> +#endif
> 
>  
> 
>          /* Skip to next match if the match length cannot increase
> 
>           * or if the match length is less than 2:
> 
> @@ -488,8 +495,14 @@ longest_match(IPos cur_match)
>              scan_end   = scan[best_len];
> 
>  #endif
> 
>          }
> 
> -    } while ((cur_match = prev[cur_match & WMASK]) > limit
> 
> +    }
> 
> +#ifdef __aarch64__
> 
> +    while ((cur_match = next_match) > limit
> 
> +             && --chain_length != 0);
> 
> +#else
> 
> +    while ((cur_match = prev[cur_match & WMASK]) > limit
> 
>               && --chain_length != 0);
> 
> +#endif
> 
>  
> 
>      return best_len;
> 
>  }
> 
> @@ -777,7 +790,22 @@ deflate (int pack_level)
>              lookahead -= prev_length-1;
> 
>              prev_length -= 2;
> 
>              RSYNC_ROLL(strstart, prev_length+1);
> 
> +            while (prev_length >= 4) {
> 
> +                /* After actual verification, expanding this loop
> 
> +                 * can improve its performance in certain scenarios.
> 
> +                 */
> 
> +                prev_length -= 4;
> 
> +                strstart++;
> 
> +                INSERT_STRING(strstart, hash_head);
> 
> +                strstart++;
> 
> +                INSERT_STRING(strstart, hash_head);
> 
> +                strstart++;
> 
> +                INSERT_STRING(strstart, hash_head);
> 
> +                strstart++;
> 
> +                INSERT_STRING(strstart, hash_head);
> 
> +            }
> 
>              do {
> 
> +                if (prev_length == 0) break;
> 
>                  strstart++;
> 
>                  INSERT_STRING(strstart, hash_head);
> 
>                  /* strstart never exceeds WSIZE-MAX_MATCH, so there are
> 
> diff --git a/util.c b/util.c
> index 0a0fc21..c9f0e52 100644
> --- a/util.c
> +++ b/util.c
> @@ -38,6 +38,12 @@
>  
> 
>  static int write_buffer (int, voidp, unsigned int);
> 
>  
> 
> +#if defined __ARM_NEON__ || defined __ARM_NEON
> 
> +#define CRC32D(crc, val) __asm__("crc32x %w[c], %w[c], %x[v]":[c]"+r"(crc):[v]"r"(val))
> 
> +#define CRC32W(crc, val) __asm__("crc32w %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
> 
> +#define CRC32H(crc, val) __asm__("crc32h %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
> 
> +#define CRC32B(crc, val) __asm__("crc32b %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
> 
> +#else
> 
>  /* ========================================================================
> 
>   * Table of CRC-32's of all single-byte values (made by makecrc.c)
> 
>   */
> 
> @@ -95,6 +101,7 @@ static const ulg crc_32_tab[] = {
>    0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
> 
>    0x2d02ef8dL
> 
>  };
> 
> +#endif
> 
>  
> 
>  /* Shift register contents.  */
> 
>  static ulg crc = 0xffffffffL;
> 
> @@ -132,6 +139,43 @@ ulg updcrc(s, n)
>      const uch *s;           /* pointer to bytes to pump through */
> 
>      unsigned n;             /* number of bytes in s[] */
> 
>  {
> 
> +#if defined __ARM_NEON__ || defined __ARM_NEON
> 
> +    register ulg c;
> 
> +    static ulg crc = (ulg)0xffffffffL;
> 
> +    register const uint8_t  *buf1;
> 
> +    register const uint16_t *buf2;
> 
> +    register const uint32_t *buf4;
> 
> +    register const uint64_t *buf8;
> 
> +    int64_t length = (int64_t)n;
> 
> +    buf8 = (const  uint64_t *)(const void *)s;
> 
> +
> 
> +    if (s == NULL) {
> 
> +        c = 0xffffffffL;
> 
> +    } else {
> 
> +        c = crc;
> 
> +        while(length >= sizeof(uint64_t)) {
> 
> +            CRC32D(c, *buf8++);
> 
> +            length -= sizeof(uint64_t);
> 
> +        }
> 
> +        buf4 = (const uint32_t *)(const void *)buf8;
> 
> +        if (length >= sizeof(uint32_t)) {
> 
> +            CRC32W(c, *buf4++);
> 
> +            length -= sizeof(uint32_t);
> 
> +        }
> 
> +        buf2 = (const uint16_t *)(const void *)buf4;
> 
> +        if(length >= sizeof(uint16_t)) {
> 
> +            CRC32H(c, *buf2++);
> 
> +            length -= sizeof(uint16_t);
> 
> +        }
> 
> +        buf1 = (const uint8_t *)(const void *)buf2;
> 
> +        if (length >= sizeof(uint8_t)) {
> 
> +            CRC32B(c, *buf1);
> 
> +            length -= sizeof(uint8_t);
> 
> +        }
> 
> +    }
> 
> +    crc = c;
> 
> +    return (c ^ 0xffffffffL);
> 
> +#else
> 
>      register ulg c;         /* temporary variable */
> 
>  
> 
>      if (s == NULL) {
> 
> @@ -144,6 +188,7 @@ ulg updcrc(s, n)
>      }
> 
>      crc = c;
> 
>      return c ^ 0xffffffffL;       /* (instead of ~c for 64-bit machines) */
> 
> +#endif
> 
>  }
> 
>  
> 
>  /* Return a current CRC value.  */
> 

Please allow me to show a set of actual test data for this patch.

First, I made an original version of the program "gzip-1.10" based
on the gzip-1.10 source code, and then made an optimized version of
the program "gzip-optimized" after applying my optimization patch.

Next I use gzip-1.10 version to test the compression and decompression
time on some **xml** files:
[XML]# time ./gzip-1.10 *.xml

real    0m5.099s
user    0m4.384s
sys     0m0.176s
[XML]# time ./gzip-1.10 -d *.gz

real    0m2.173s
user    0m1.821s
sys     0m0.348s

Then use the optimized version to compare:
[XML]# time ./gzip-optimized *.xml

real    0m2.785s
user    0m2.576s
sys     0m0.204s
[XML]# time ./gzip-optimized -d *.gz

real    0m0.497s
user    0m0.176s
sys     0m0.320s


The next test object is a large **log** file:
[LOG]# time ./gzip-1.10 *.log

real    0m8.883s
user    0m8.652s
sys     0m0.217s
[LOG]# time ./gzip-1.10 -d *.gz

real    0m3.049s
user    0m2.604s
sys     0m0.439s

Also use the optimized version to compare:
[LOG]# time ./gzip-optimized *.log

real    0m6.882s
user    0m6.607s
sys     0m0.264s
[LOG]# time ./gzip-optimized -d *.gz

real    0m1.054s
user    0m0.622s
sys     0m0.431s

The above experimental data are from the aarch64 platform.

-- 
Best regards,
Li Qiang

Information forwarded to bug-gzip <at> gnu.org:
bug#41535; Package gzip. (Thu, 20 Aug 2020 08:56:01 GMT) Full text and rfc822 format available.

Message #11 received at 41535 <at> debbugs.gnu.org (full text, mbox):

From: Li Qiang <liqiang64 <at> huawei.com>
To: <41535 <at> debbugs.gnu.org>
Cc: meyering <at> fb.com, eggert <at> cs.ucla.edu
Subject: Re: bug#41535: [PATCH] performance optimization for aarch64
Date: Thu, 20 Aug 2020 16:55:26 +0800


在 2020/5/30 17:17, Li Qiang 写道:
> 
> 
> 在 2020/5/26 10:39, l00374334 写道:
>> From: liqiang <liqiang64 <at> huawei.com>
>>
>> By analyzing the compression and decompression process of gzip, I found 
>>
>> that the hot spots of CRC32 and longest_match function are very high.
>>
>>
>>
>> On the aarch64 architecture, we can optimize the efficiency of crc32 
>>
>> through the interface provided by the neon instruction set (12x faster 
>>
>> in aarch64), and optimize the performance of random access code through 
>>
>> prefetch instructions (about 5%~8% improvement). In some compression 
>>
>> scenarios, loop expansion can also get a certain performance improvement 
>>
>> (about 10%).
>>
>>
>>
>> Modify by Li Qiang.
>>
>> ---
>>  configure | 14 ++++++++++++++
>>  deflate.c | 30 +++++++++++++++++++++++++++++-
>>  util.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 88 insertions(+), 1 deletion(-)
>>
>> diff --git a/configure b/configure
>> index cab3daf..dc80cb6 100644
>> --- a/configure
>> +++ b/configure
>> @@ -14555,6 +14555,20 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
>>             ;;
>>  
>>           arm* | aarch64 )
>> +           cat confdefs.h - <<_ACEOF >conftest.$ac_ext
>> +/* end confdefs.h.  */
>> +#if defined __ARM_NEON__ || defined __ARM_NEON
>> +                   int ok;
>> +                  #else
>> +                   error fail
>> +                  #endif
>> +
>> +_ACEOF
>> +if ac_fn_c_try_compile "$LINENO"
>> +then :
>> +  CFLAGS="$CFLAGS -march=armv8-a+crc"
>> +fi
>> +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
>>             # Assume arm with EABI.
>>             # On arm64 systems, the C compiler may be generating code in one of
>>             # these ABIs:
>> diff --git a/deflate.c b/deflate.c
>> index 9d379e9..ee77ffd 100644
>> --- a/deflate.c
>> +++ b/deflate.c
>> @@ -378,6 +378,9 @@ longest_match(IPos cur_match)
>>      register int len;                           /* length of current match */
>>
>>      int best_len = prev_length;                 /* best match length so far */
>>
>>      IPos limit = strstart > (IPos)MAX_DIST ? strstart - (IPos)MAX_DIST : NIL;
>>
>> +#ifdef __aarch64__
>>
>> +    IPos next_match;
>>
>> +#endif
>>
>>      /* Stop when cur_match becomes <= limit. To simplify the code,
>>
>>       * we prevent matches with the string of window index 0.
>>
>>       */
>>
>> @@ -411,6 +414,10 @@ longest_match(IPos cur_match)
>>      do {
>>
>>          Assert(cur_match < strstart, "no future");
>>
>>          match = window + cur_match;
>>
>> +#ifdef __aarch64__
>>
>> +        next_match = prev[cur_match & WMASK];
>>
>> +        __asm__("PRFM PLDL1STRM, [%0]"::"r"(&(prev[next_match & WMASK])));
>>
>> +#endif
>>
>>  
>>
>>          /* Skip to next match if the match length cannot increase
>>
>>           * or if the match length is less than 2:
>>
>> @@ -488,8 +495,14 @@ longest_match(IPos cur_match)
>>              scan_end   = scan[best_len];
>>
>>  #endif
>>
>>          }
>>
>> -    } while ((cur_match = prev[cur_match & WMASK]) > limit
>>
>> +    }
>>
>> +#ifdef __aarch64__
>>
>> +    while ((cur_match = next_match) > limit
>>
>> +             && --chain_length != 0);
>>
>> +#else
>>
>> +    while ((cur_match = prev[cur_match & WMASK]) > limit
>>
>>               && --chain_length != 0);
>>
>> +#endif
>>
>>  
>>
>>      return best_len;
>>
>>  }
>>
>> @@ -777,7 +790,22 @@ deflate (int pack_level)
>>              lookahead -= prev_length-1;
>>
>>              prev_length -= 2;
>>
>>              RSYNC_ROLL(strstart, prev_length+1);
>>
>> +            while (prev_length >= 4) {
>>
>> +                /* After actual verification, expanding this loop
>>
>> +                 * can improve its performance in certain scenarios.
>>
>> +                 */
>>
>> +                prev_length -= 4;
>>
>> +                strstart++;
>>
>> +                INSERT_STRING(strstart, hash_head);
>>
>> +                strstart++;
>>
>> +                INSERT_STRING(strstart, hash_head);
>>
>> +                strstart++;
>>
>> +                INSERT_STRING(strstart, hash_head);
>>
>> +                strstart++;
>>
>> +                INSERT_STRING(strstart, hash_head);
>>
>> +            }
>>
>>              do {
>>
>> +                if (prev_length == 0) break;
>>
>>                  strstart++;
>>
>>                  INSERT_STRING(strstart, hash_head);
>>
>>                  /* strstart never exceeds WSIZE-MAX_MATCH, so there are
>>
>> diff --git a/util.c b/util.c
>> index 0a0fc21..c9f0e52 100644
>> --- a/util.c
>> +++ b/util.c
>> @@ -38,6 +38,12 @@
>>  
>>
>>  static int write_buffer (int, voidp, unsigned int);
>>
>>  
>>
>> +#if defined __ARM_NEON__ || defined __ARM_NEON
>>
>> +#define CRC32D(crc, val) __asm__("crc32x %w[c], %w[c], %x[v]":[c]"+r"(crc):[v]"r"(val))
>>
>> +#define CRC32W(crc, val) __asm__("crc32w %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
>>
>> +#define CRC32H(crc, val) __asm__("crc32h %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
>>
>> +#define CRC32B(crc, val) __asm__("crc32b %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(val))
>>
>> +#else
>>
>>  /* ========================================================================
>>
>>   * Table of CRC-32's of all single-byte values (made by makecrc.c)
>>
>>   */
>>
>> @@ -95,6 +101,7 @@ static const ulg crc_32_tab[] = {
>>    0x5d681b02L, 0x2a6f2b94L, 0xb40bbe37L, 0xc30c8ea1L, 0x5a05df1bL,
>>
>>    0x2d02ef8dL
>>
>>  };
>>
>> +#endif
>>
>>  
>>
>>  /* Shift register contents.  */
>>
>>  static ulg crc = 0xffffffffL;
>>
>> @@ -132,6 +139,43 @@ ulg updcrc(s, n)
>>      const uch *s;           /* pointer to bytes to pump through */
>>
>>      unsigned n;             /* number of bytes in s[] */
>>
>>  {
>>
>> +#if defined __ARM_NEON__ || defined __ARM_NEON
>>
>> +    register ulg c;
>>
>> +    static ulg crc = (ulg)0xffffffffL;
>>
>> +    register const uint8_t  *buf1;
>>
>> +    register const uint16_t *buf2;
>>
>> +    register const uint32_t *buf4;
>>
>> +    register const uint64_t *buf8;
>>
>> +    int64_t length = (int64_t)n;
>>
>> +    buf8 = (const  uint64_t *)(const void *)s;
>>
>> +
>>
>> +    if (s == NULL) {
>>
>> +        c = 0xffffffffL;
>>
>> +    } else {
>>
>> +        c = crc;
>>
>> +        while(length >= sizeof(uint64_t)) {
>>
>> +            CRC32D(c, *buf8++);
>>
>> +            length -= sizeof(uint64_t);
>>
>> +        }
>>
>> +        buf4 = (const uint32_t *)(const void *)buf8;
>>
>> +        if (length >= sizeof(uint32_t)) {
>>
>> +            CRC32W(c, *buf4++);
>>
>> +            length -= sizeof(uint32_t);
>>
>> +        }
>>
>> +        buf2 = (const uint16_t *)(const void *)buf4;
>>
>> +        if(length >= sizeof(uint16_t)) {
>>
>> +            CRC32H(c, *buf2++);
>>
>> +            length -= sizeof(uint16_t);
>>
>> +        }
>>
>> +        buf1 = (const uint8_t *)(const void *)buf2;
>>
>> +        if (length >= sizeof(uint8_t)) {
>>
>> +            CRC32B(c, *buf1);
>>
>> +            length -= sizeof(uint8_t);
>>
>> +        }
>>
>> +    }
>>
>> +    crc = c;
>>
>> +    return (c ^ 0xffffffffL);
>>
>> +#else
>>
>>      register ulg c;         /* temporary variable */
>>
>>  
>>
>>      if (s == NULL) {
>>
>> @@ -144,6 +188,7 @@ ulg updcrc(s, n)
>>      }
>>
>>      crc = c;
>>
>>      return c ^ 0xffffffffL;       /* (instead of ~c for 64-bit machines) */
>>
>> +#endif
>>
>>  }
>>
>>  
>>
>>  /* Return a current CRC value.  */
>>
> 
> Please allow me to show a set of actual test data for this patch.
> 
> First, I made an original version of the program "gzip-1.10" based
> on the gzip-1.10 source code, and then made an optimized version of
> the program "gzip-optimized" after applying my optimization patch.
> 
> Next I use gzip-1.10 version to test the compression and decompression
> time on some **xml** files:
> [XML]# time ./gzip-1.10 *.xml
> 
> real    0m5.099s
> user    0m4.384s
> sys     0m0.176s
> [XML]# time ./gzip-1.10 -d *.gz
> 
> real    0m2.173s
> user    0m1.821s
> sys     0m0.348s
> 
> Then use the optimized version to compare:
> [XML]# time ./gzip-optimized *.xml
> 
> real    0m2.785s
> user    0m2.576s
> sys     0m0.204s
> [XML]# time ./gzip-optimized -d *.gz
> 
> real    0m0.497s
> user    0m0.176s
> sys     0m0.320s
> 
> 
> The next test object is a large **log** file:
> [LOG]# time ./gzip-1.10 *.log
> 
> real    0m8.883s
> user    0m8.652s
> sys     0m0.217s
> [LOG]# time ./gzip-1.10 -d *.gz
> 
> real    0m3.049s
> user    0m2.604s
> sys     0m0.439s
> 
> Also use the optimized version to compare:
> [LOG]# time ./gzip-optimized *.log
> 
> real    0m6.882s
> user    0m6.607s
> sys     0m0.264s
> [LOG]# time ./gzip-optimized -d *.gz
> 
> real    0m1.054s
> user    0m0.622s
> sys     0m0.431s
> 
> The above experimental data are from the aarch64 platform.
> 

Gentle ping.
: )

-- 
Best regards,
Li Qiang

Information forwarded to bug-gzip <at> gnu.org:
bug#41535; Package gzip. (Sun, 29 Aug 2021 08:49:01 GMT) Full text and rfc822 format available.

Message #14 received at 41535 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Li Qiang <liqiang64 <at> huawei.com>
Cc: luanjianhai <at> huawei.com, Paul Eggert <eggert <at> cs.ucla.edu>,
 sangyan <at> huawei.com, colordev.jiang <at> huawei.com, luchunhua <at> huawei.com,
 41535 <at> debbugs.gnu.org, huxinwei <at> huawei.com, Jim Meyering <meyering <at> fb.com>
Subject: Re: bug#41535: [PATCH] performance optimization for aarch64
Date: Sun, 29 Aug 2021 10:47:45 +0200

On Sat, May 30, 2020 at 11:19 AM Li Qiang <liqiang64 <at> huawei.com> wrote:
> 在 2020/5/26 10:39, l00374334 写道:
> > From: liqiang <liqiang64 <at> huawei.com>
> >
> > By analyzing the compression and decompression process of gzip, I found
> >
> > that the hot spots of CRC32 and longest_match function are very high.
> >
> >
> >
> > On the aarch64 architecture, we can optimize the efficiency of crc32
> >
> > through the interface provided by the neon instruction set (12x faster
> >
> > in aarch64), and optimize the performance of random access code through
> >
> > prefetch instructions (about 5%~8% improvement). In some compression
> >
> > scenarios, loop expansion can also get a certain performance improvement
> >
> > (about 10%).
> >
> >
> >
> > Modify by Li Qiang.
> >
> > ---
> >  configure | 14 ++++++++++++++
> >  deflate.c | 30 +++++++++++++++++++++++++++++-
> >  util.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++

Thank you for that work and sorry for the delay in responding.
However, for now I prefer not to apply it.
I'd prefer to see arch-specific optimizations added to libz in the
hope (perhaps naive) that someone will find time to make gzip use
libz.

Added tag(s) wontfix. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 05 Apr 2022 01:37:02 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 74 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #41535 [PATCH] performance optimization for aarch64

GNU bug report logs - #41535
[PATCH] performance optimization for aarch64