GNU bug report logs - #16472
PD: gzip - tree overflow protection algorithm doubt/suggestion

Previous Next

Package: gzip;

Reported by: "Krzysztof Rybak" <krzysztofrybak6 <at> wp.pl>

Date: Thu, 16 Jan 2014 23:52:02 UTC

Severity: normal

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Krzysztof Rybak" <krzysztofrybak6 <at> wp.pl>
To: bug-gzip <at> gnu.org
Subject: PD: gzip - tree overflow protection algorithm doubt/suggestion
Date: Thu, 16 Jan 2014 22:55:25 +0100
[Message part 1 (text/plain, inline)]
Hi,
I sent a message to the author of gzip but without response. Maybe You can help me with my doubts.

best regards,
Krzysztof

Dnia Czwartek, 2 Stycznia 2014 22:41 Krzysztof Rybak <krzysztofrybak6 <at> wp.pl> napisaƂ(a) 
> Hello Jean-loup,

> I'm preparing my own algorithm based on Huffman compression method and deflate format. 

> 

> I have doubts if method implemented in gzip program is 100% correct. 

> I mean tree overflow protection:

> just to remind: in deflate (RFC1951) and therefore in gzip there is creating tree for lit/len, dist (maximum length of codes is 15 bits) and code table (maximum length is 7 bits). In case of too long code word, there is tree modification implemented in gen_bitlen(desc) function (trees.c):

> 

>     do {

>         bits = max_length-1;

>         while (bl_count[bits] == 0) bits--;

>         bl_count[bits]--;      /* move one leaf down the tree */

>         bl_count[bits+1] += 2; /* move one overflow item as its brother */

>         bl_count[max_length]--;

>         overflow -= 2;

>     } while (overflow > 0);

> 

> I'm not sure if this method is correct for original code length greater than 8 ( for maximum 7 ) and than 16 ( for maximum 15): even not every time. I couldn't obtain such symbol distribution to generate so long codes on real files indeed as such situation is extremely rare but generated it by modifying software:

> 

> just before mentioned above overflow protection I changed bl_count histogram to one as for Fibonacci numbers (which is the worst for Huffman overflow - the longest codes, please refer to attached image): in my case I modified tree when maximum length is 7 bits (max_length = 7 is for code book compression):

>     bl_count[0] = 1;

>     bl_count[1] = 1;

>     bl_count[2] = 1;

>     bl_count[3] = 1;

>     bl_count[4] = 1;

>     bl_count[5] = 1;

>     bl_count[6] = 1;

>     bl_count[7] = 1;

>     bl_count[8] = 1;

>     bl_count[9] = 1;

>     bl_count[10] = 2;

>     overflow = 4;

> 

> in fact in gzip there is setting length = 7 when it's longer than 7:

> if (bits > max_length){

>         bits = max_length;

>         overflow++;

> }

> 

> ,so my histogram looks like:

>     bl_count[0] = 1;

>     bl_count[1] = 1;

>     bl_count[2] = 1;

>     bl_count[3] = 1;

>     bl_count[4] = 1;

>     bl_count[5] = 1;

>     bl_count[6] = 1;

>     bl_count[7] = 5;

>     bl_count[8] = 0;

>     bl_count[9] = 0;

>     overflow = 4;

> and after overflow protection method from gzip:

>     bl_count[0] = 1

>     bl_count[1] = 1

>     bl_count[2] = 1

>     bl_count[3] = 1

>     bl_count[4] = 1

>     bl_count[5] = 0

>     bl_count[6] = 2

>     bl_count[7] = 5

> , which is incorrect Huffman tree: too many elements. 

> 

> trees.c with my modification and printf() function is attached. I compress obj2 from Calgary corpus.

> I know my description may look unclear, but I hope You will help me/clarify my doubts.  

> 

> best regards,

> Krzysztof Rybak

[Fibonacci.png (image/png, attachment)]
[trees.c (text/x-csrc, attachment)]

This bug report was last modified 11 years and 151 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.