#23113 - parallel gzip processes trash hard disks, need larger buffers

GNU bug report logs - #23113
parallel gzip processes trash hard disks, need larger buffers

Package: gzip;

Reported by: "Chevreux, Bastien" <bastien.chevreux <at> dsm.com>

Date: Fri, 25 Mar 2016 18:16:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Chevreux, Bastien" <bastien.chevreux <at> dsm.com> To: "bug-gzip <at> gnu.org" <bug-gzip <at> gnu.org> Subject: parallel gzip processes trash hard disks, need larger buffers Date: Fri, 25 Mar 2016 16:57:12 +0000

[Message part 1 (text/plain, inline)]

Hi there, I am using gzip 1.6 to compress large files >10 GiB in parallel (Kubuntu 14.04, 12 cores). The underlying disk system (RAID 10) is able to deliver read speeds >1 GB/s (measured with flushed file caches, iostat -mx 1 100). Here are some numbers when running gzip in parallel: 1 gzip process: the CPU is the bottleneck in compressing things and utilisation is 100%. 2 gzips in parallel: the disk throughput drops to a meagre 70MB/s and the CPU utilisation per process is at ~60%. 6 gzips in parallel: the disk throughput fluctuates between 50 and 60 MB/s and the CPU utilisation per process is at ~18-20%. Running 6 gzips in parallel on the same data residing on a SSD: 100% CPU utilisation per process Googling a bit I found this thread on SuperUser where someone saw the same behaviour already with a single disk doing normally 125 MB/s and running 4 gzips drops it to 25 MB/s: http://superuser.com/questions/599329/why-is-gzip-slow-despite-cpu-and-hard-drive-performance-not-being-maxed-out The posts there propose a workaround like this: buffer -s 100000 -m 10000000 -p 100 < bigfile.dat | gzip > bigfile.dat.gz And indeed, using "buffer" resolves trashing problems when working on a disk system. However, using "buffer" is pretty arcane (it isn't even installed per default on most Unix/Linux installations) and pretty counterintuitive. Would it be possible to have bigger buffers by default (1 MB? 10 MB?) or have an automatism in gzip like "if file to compress >10 MB and free RAM >500MB, setup the file buffer to use 1 (10?) MB" ? Alternatively, a command line option to manually set the buffer size? Best, Bastien -- DSM Nutritional Products Microbia Inc | Bioinformatics 60 Westview Street | Lexington, MA 02421 | United States Phone +1 781 259 7613 | Fax +1 781 259 0615 ________________________________ DISCLAIMER: This e-mail is for the intended recipient only. If you have received it by mistake please let us know by reply and then delete it from your system; access, disclosure, copying, distribution or reliance on any of it by anyone else is prohibited. If you as intended recipient have received this e-mail incorrectly, please notify the sender (via e-mail) immediately.

[Message part 2 (text/html, inline)]

This bug report was last modified 9 years and 42 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #23113 parallel gzip processes trash hard disks, need larger buffers

GNU bug report logs - #23113
parallel gzip processes trash hard disks, need larger buffers