GNU bug report logs - #30719
Progressively compressing piped input

Previous Next

Package: gzip;

Reported by: "Garreau\, Alexandre" <galex-713 <at> galex-713.eu>

Date: Mon, 5 Mar 2018 21:20:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 30719 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gzip <at> gnu.org:
bug#30719; Package gzip. (Mon, 05 Mar 2018 21:20:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Garreau\, Alexandre" <galex-713 <at> galex-713.eu>:
New bug report received and forwarded. Copy sent to bug-gzip <at> gnu.org. (Mon, 05 Mar 2018 21:20:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Garreau\, Alexandre" <galex-713 <at> galex-713.eu>
To: bug-gzip <at> gnu.org
Subject: Progressively compressing piped input
Date: Mon, 05 Mar 2018 22:18:53 +0100
[Message part 1 (text/plain, inline)]
Hi,

I have a script which has a logged very repetitive textual output
(mostly output of ping and date). To minimize disk usage, I thought to
pipe it to gzip -9. Then I realized the log, contrarily to before,
remained empty, and recalled the GNU policy of “reading all input and
only then outputting” to maximize overall speed at the expense of the
decreasingly expensive memory.

Yet I want to run that script all the time and being able to dirtily
killing it or just shutdown, without loosing all its output (nor am I
sure anyway it is a good practice of keeping everything in ram until
shutdown, considering I suppose gzip only keeps the compressed output in
memory anyway, discarding the then useless input), and “tail -f”-ing the
files it writes.

I guess piping the whole output is the way to go to achieve optimal
compression, since otherwise just gzipping each line/command output
wouldn’t compress as much (since anyway the repetition occurs among the
lines, not inside them). Yet would there be a way to obtain this maximal
compression, while having gzip outputing each time I stop giving it
input (has I do every 30 seconds or so), without having to save the
uncompressed file, nor recompressing the whole file several times?

I mean, it seems to me a good thing to wait everything is compressed
before to output, rather than outputing as soon as possible, but isn’t
there a way to trigger the output each time it has been processed and
there’s no more input for a certain amount of time (that is ~30s)?

Am I looking at something like this:
[sample.sh (text/x-sh, inline)]
#!/bin/bash
while ping -c1 gnu.org ; do
    date --rfc-3339=seconds
    sleep 30
done | gzip -9 -f | tee sample.log | zcat

Information forwarded to bug-gzip <at> gnu.org:
bug#30719; Package gzip. (Mon, 05 Mar 2018 22:55:02 GMT) Full text and rfc822 format available.

Message #8 received at 30719 <at> debbugs.gnu.org (full text, mbox):

From: Mark Adler <madler <at> alumni.caltech.edu>
To: "Garreau, Alexandre" <galex-713 <at> galex-713.eu>
Cc: 30719 <at> debbugs.gnu.org
Subject: Re: bug#30719: Progressively compressing piped input
Date: Mon, 5 Mar 2018 14:54:21 -0800
[Message part 1 (text/plain, inline)]
deflate has an inherent latency that accumulates enough data in order to efficiently emit each deflate block. You can deliberately flush (with zlib, not gzip), but if you do that too frequently, e.g. each line, then you will get lousy compression or even expansion.

I wrote something called gzlog (https://github.com/madler/zlib/blob/master/examples/gzlog.h <https://github.com/madler/zlib/blob/master/examples/gzlog.h>), intended to solve this problem. It can take a small amount of input, e.g. a line, and update the output gzip file to be complete and valid after each line, yet also get good compression in the long run. It does this by writing the lines to the log.gz file effectively uncompressed (deflate has a “stored” block type), until it has accumulated, say, 1 MB of data. Then it goes back and compresses that uncompressed 1 MB, again always leaving the gzip file in a valid state. gzlog also maintains something like a journal, which allows gzlog to repair the gzip file if the last operation was interrupted, e.g. by a power failure.

> On Mar 5, 2018, at 1:18 PM, Garreau, Alexandre <galex-713 <at> galex-713.eu> wrote:
> 
> Hi,
> 
> I have a script which has a logged very repetitive textual output
> (mostly output of ping and date). To minimize disk usage, I thought to
> pipe it to gzip -9. Then I realized the log, contrarily to before,
> remained empty, and recalled the GNU policy of “reading all input and
> only then outputting” to maximize overall speed at the expense of the
> decreasingly expensive memory.
> 
> Yet I want to run that script all the time and being able to dirtily
> killing it or just shutdown, without loosing all its output (nor am I
> sure anyway it is a good practice of keeping everything in ram until
> shutdown, considering I suppose gzip only keeps the compressed output in
> memory anyway, discarding the then useless input), and “tail -f”-ing the
> files it writes.
> 
> I guess piping the whole output is the way to go to achieve optimal
> compression, since otherwise just gzipping each line/command output
> wouldn’t compress as much (since anyway the repetition occurs among the
> lines, not inside them). Yet would there be a way to obtain this maximal
> compression, while having gzip outputing each time I stop giving it
> input (has I do every 30 seconds or so), without having to save the
> uncompressed file, nor recompressing the whole file several times?
> 
> I mean, it seems to me a good thing to wait everything is compressed
> before to output, rather than outputing as soon as possible, but isn’t
> there a way to trigger the output each time it has been processed and
> there’s no more input for a certain amount of time (that is ~30s)?
> 
> Am I looking at something like this:
> #!/bin/bash
> while ping -c1 gnu.org ; do
>    date --rfc-3339=seconds
>    sleep 30
> done | gzip -9 -f | tee sample.log | zcat

[Message part 2 (text/html, inline)]

Information forwarded to bug-gzip <at> gnu.org:
bug#30719; Package gzip. (Tue, 06 Mar 2018 22:08:02 GMT) Full text and rfc822 format available.

Message #11 received at 30719 <at> debbugs.gnu.org (full text, mbox):

From: "Garreau\, Alexandre" <galex-713 <at> galex-713.eu>
To: Mark Adler <madler <at> alumni.caltech.edu>
Cc: 30719 <at> debbugs.gnu.org
Subject: Re: bug#30719: Progressively compressing piped input
Date: Tue, 06 Mar 2018 22:58:56 +0100
Le 05/03/2018 à 14h54, Mark Adler a écrit :
> deflate has an inherent latency that accumulates enough data in order
> to efficiently emit each deflate block. You can deliberately flush
> (with zlib, not gzip), but if you do that too frequently, e.g. each
> line, then you will get lousy compression or even expansion.

Even if the main repetition is being between the lines? like if 80% of
half the line, and 70% of the other half lines are the same? like in a
while loop with only ping and date? I thought to it as a very lazy way
of not having to remove all the redundant output caused by the usage of
ascii, the repetition of words or similar patterns occuring ever and
ever.

> I wrote something called gzlog
> (https://github.com/madler/zlib/blob/master/examples/gzlog.h
> <https://github.com/madler/zlib/blob/master/examples/gzlog.h>),
> intended to solve this problem. It can take a small amount of input,
> e.g. a line, and update the output gzip file to be complete and valid
> after each line, yet also get good compression in the long run. It
> does this by writing the lines to the log.gz file effectively
> uncompressed (deflate has a “stored” block type), until it has
> accumulated, say, 1 MB of data. Then it goes back and compresses that
> uncompressed 1 MB, again always leaving the gzip file in a valid
> state. gzlog also maintains something like a journal, which allows
> gzlog to repair the gzip file if the last operation was interrupted,
> e.g. by a power failure.

I rather searched some tool that could be used as an utility (since
that’s for a dirty high-level low-frequency medium-term task) rather
than a C thing, yet that’s quite interesting at least in demonstrating
the flexibility of gzip…

>> #!/bin/bash
>> while ping -c1 gnu.org ; do
>>    date --rfc-3339=seconds
>>    sleep 30
>> done | gzip -9 -f | tee sample.log | zcat

maybe the only way to go is just gzipping everything each time a log is
rotated like the standard way, if that pipe thing cannot be done even
with each line being almost the same…




Information forwarded to bug-gzip <at> gnu.org:
bug#30719; Package gzip. (Wed, 07 Mar 2018 02:13:02 GMT) Full text and rfc822 format available.

Message #14 received at 30719 <at> debbugs.gnu.org (full text, mbox):

From: Mark Adler <madler <at> alumni.caltech.edu>
To: "Garreau, Alexandre" <galex-713 <at> galex-713.eu>
Cc: 30719 <at> debbugs.gnu.org
Subject: Re: bug#30719: Progressively compressing piped input
Date: Tue, 6 Mar 2018 18:11:51 -0800
> On Mar 6, 2018, at 1:58 PM, Garreau, Alexandre <galex-713 <at> galex-713.eu> wrote:
> 
> Le 05/03/2018 à 14h54, Mark Adler a écrit :
>> deflate has an inherent latency that accumulates enough data in order
>> to efficiently emit each deflate block. You can deliberately flush
>> (with zlib, not gzip), but if you do that too frequently, e.g. each
>> line, then you will get lousy compression or even expansion.
> 
> Even if the main repetition is being between the lines? like if 80% of
> half the line, and 70% of the other half lines are the same? like in a
> while loop with only ping and date? I thought to it as a very lazy way
> of not having to remove all the redundant output caused by the usage of
> ascii, the repetition of words or similar patterns occuring ever and
> ever.


Alexandre,

It has nothing to do with how much or how little or how often there is repetition. It has to do with the overhead of the header of a dynamic block that is required to describe the Huffman codes used therein. You need several thousand symbols in order to pay for the bits required for the header.

Mark





Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 30 Mar 2022 18:38:02 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 77 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.