Hi, I sent a message to the author of gzip but without response. Maybe You can help me with my doubts. best regards, Krzysztof Dnia Czwartek, 2 Stycznia 2014 22:41 Krzysztof Rybak napisał(a) > Hello Jean-loup, > I'm preparing my own algorithm based on Huffman compression method and deflate format. > > I have doubts if method implemented in gzip program is 100% correct. > I mean tree overflow protection: > just to remind: in deflate (RFC1951) and therefore in gzip there is creating tree for lit/len, dist (maximum length of codes is 15 bits) and code table (maximum length is 7 bits). In case of too long code word, there is tree modification implemented in gen_bitlen(desc) function (trees.c): > > do { > bits = max_length-1; > while (bl_count[bits] == 0) bits--; > bl_count[bits]--; /* move one leaf down the tree */ > bl_count[bits+1] += 2; /* move one overflow item as its brother */ > bl_count[max_length]--; > overflow -= 2; > } while (overflow > 0); > > I'm not sure if this method is correct for original code length greater than 8 ( for maximum 7 ) and than 16 ( for maximum 15): even not every time. I couldn't obtain such symbol distribution to generate so long codes on real files indeed as such situation is extremely rare but generated it by modifying software: > > just before mentioned above overflow protection I changed bl_count histogram to one as for Fibonacci numbers (which is the worst for Huffman overflow - the longest codes, please refer to attached image): in my case I modified tree when maximum length is 7 bits (max_length = 7 is for code book compression): > bl_count[0] = 1; > bl_count[1] = 1; > bl_count[2] = 1; > bl_count[3] = 1; > bl_count[4] = 1; > bl_count[5] = 1; > bl_count[6] = 1; > bl_count[7] = 1; > bl_count[8] = 1; > bl_count[9] = 1; > bl_count[10] = 2; > overflow = 4; > > in fact in gzip there is setting length = 7 when it's longer than 7: > if (bits > max_length){ > bits = max_length; > overflow++; > } > > ,so my histogram looks like: > bl_count[0] = 1; > bl_count[1] = 1; > bl_count[2] = 1; > bl_count[3] = 1; > bl_count[4] = 1; > bl_count[5] = 1; > bl_count[6] = 1; > bl_count[7] = 5; > bl_count[8] = 0; > bl_count[9] = 0; > overflow = 4; > and after overflow protection method from gzip: > bl_count[0] = 1 > bl_count[1] = 1 > bl_count[2] = 1 > bl_count[3] = 1 > bl_count[4] = 1 > bl_count[5] = 0 > bl_count[6] = 2 > bl_count[7] = 5 > , which is incorrect Huffman tree: too many elements. > > trees.c with my modification and printf() function is attached. I compress obj2 from Calgary corpus. > I know my description may look unclear, but I hope You will help me/clarify my doubts. > > best regards, > Krzysztof Rybak