#23113 - parallel gzip processes trash hard disks, need larger buffers

GNU bug report logs - #23113
parallel gzip processes trash hard disks, need larger buffers

Package: gzip;

Reported by: "Chevreux, Bastien" <bastien.chevreux <at> dsm.com>

Date: Fri, 25 Mar 2016 18:16:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Message #26 received at 23113 <at> debbugs.gnu.org (full text, mbox):

From: "Chevreux, Bastien" <bastien.chevreux <at> dsm.com> To: Mark Adler <madler <at> alumni.caltech.edu> Cc: Jim Meyering <jim <at> meyering.net>, "23113 <at> debbugs.gnu.org" <23113 <at> debbugs.gnu.org> Subject: RE: bug#23113: parallel gzip processes trash hard disks, need larger buffers Date: Tue, 12 Apr 2016 16:55:30 +0000

Mark, I knew about pigz, albeit not about -b, thank you for that. Together with -p 1 that would replicate gzip and implement input buffering well enough to be used in parallel pipelines (where you do not want, e.g., 40 pipelines running 40 pigz with 40 threads each). Questions: how stable / error proof is pigz compared to gzip? I always shied away from it as gzip is so much tried and tested that errors are unlikely ... and the zlib.net homepage does not make an "official" statement like "you should all now move to pigz, it's good and tested enough." Additional question: is there a pigzlib planned? :-) Jim, Paul: I'd say that this thread/bug can be closed if pigz proves to be as stable / error free as gzip. I suppose that while backporting -b to gzip could be done, it would not make much sense. Best, Bastien -- DSM Nutritional Products Microbia Inc | Bioinformatics 60 Westview Street | Lexington, MA 02421 | United States Phone +1 781 259 7613 | Fax +1 781 259 0615 -----Original Message----- From: Mark Adler [mailto:madler <at> alumni.caltech.edu] Sent: Sonntag, 10. April 2016 03:49 To: Chevreux, Bastien Cc: Jim Meyering; 23113 <at> debbugs.gnu.org Subject: Re: bug#23113: parallel gzip processes trash hard disks, need larger buffers Bastien, pigz (a parallel version of gzip) has a variable buffer size. The -b or --blocksize option allows up to 512 MB buffers, defaulting to 128K. See http://zlib.net/pigz/ Mark > On Mar 29, 2016, at 4:03 PM, Chevreux, Bastien <bastien.chevreux <at> dsm.com> wrote: > >> From: meyering <at> gmail.com [mailto:meyering <at> gmail.com] On Behalf Of Jim >> Meyering [...] However, I suggest that you consider using xz in place >> of gzip. >> Not only can it compress better, it also works faster for comparable compression ratios. > > xz is not a viable alternative in this case: use case is not archiving. There is a plethora of programs out there with zlib support compiled in and these won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times faster than xz -1 on FASTQ files (sequencing data), and the use case here is "temporary results, so ok-ish compression in a comparatively short amount of time". Gzip is ideal in that respect as even at -1 it compresses down to ~25-35% ... and that already helps a lot when you do not need 1 TiB of hard disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day. > >> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger values makes a significant difference, we'd like to hear about the results and how you measured. > > Changing the INBUFSIZ did not have the effect hoped for as this is just the buffer size allocated by gzip ... but in the end it uses only 64k at most and the calls to the file system read() even end up to request only 32k per call. > > I traced this down through multiple layers to the function fill_window() in deflate.c, where things get really intricate using multiple pre-set variables, defines and memcpy()s. It became clear that the code is geared towards using a 64k buffer with a rolling window of 32k. Optimised for 16 bit machines that is. > > There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via defines. However, code comments say that BIG_MEM would work on a complete file loaded in memory ... which is a no-go for files in the area of 15 to 30 GiB. I'm not even sure the code would be doing what the comments say. > > Long story short: I do not feel expert enough to touch said functions and change them to provide for larger input buffering. If I were forced to implement something I'd try it with an outer buffering layer, but I'm not sure it would be elegant or even efficient. > > Best, > Bastien > > PS: then again I'm toying with the idea to write a simple gzip-packer replacement which simply buffers data and passes it to zlib. > > -- > DSM Nutritional Products Microbia Inc | Bioinformatics > 60 Westview Street | Lexington, MA 02421 | United States Phone +1 781 > 259 7613 | Fax +1 781 259 0615 > > > ________________________________ > > DISCLAIMER: > This e-mail is for the intended recipient only. > If you have received it by mistake please let us know by reply and then delete it from your system; access, disclosure, copying, distribution or reliance on any of it by anyone else is prohibited. > If you as intended recipient have received this e-mail incorrectly, please notify the sender (via e-mail) immediately.

This bug report was last modified 9 years and 42 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #23113 parallel gzip processes trash hard disks, need larger buffers

GNU bug report logs - #23113
parallel gzip processes trash hard disks, need larger buffers