From unknown Sun Aug 17 19:57:19 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#32073 <32073@debbugs.gnu.org> To: bug#32073 <32073@debbugs.gnu.org> Subject: Status: Improvements in Grep Reply-To: bug#32073 <32073@debbugs.gnu.org> Date: Mon, 18 Aug 2025 02:57:19 +0000 retitle 32073 Improvements in Grep reassign 32073 grep submitter 32073 Sergiu Hlihor severity 32073 wishlist thanks From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 17:31:49 2018 Received: (at submit) by debbugs.gnu.org; 6 Jul 2018 21:31:49 +0000 Received: from localhost ([127.0.0.1]:48863 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbYKS-0002MD-DA for submit@debbugs.gnu.org; Fri, 06 Jul 2018 17:31:49 -0400 Received: from eggs.gnu.org ([208.118.235.92]:49666) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbTYx-0003J9-SL for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fbTYr-000371-NQ for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:22 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,HTML_MESSAGE, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:37207) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fbTYr-00036v-K4 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:21 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40630) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fbTYq-0001Jy-E2 for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fbTYp-00035W-EW for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:20 -0400 Received: from mail-io0-x234.google.com ([2607:f8b0:4001:c06::234]:36810) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fbTYp-000354-7X for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:19 -0400 Received: by mail-io0-x234.google.com with SMTP id k3-v6so11350175iog.3 for ; Fri, 06 Jul 2018 09:26:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=gsN4tVk2AbiLxUguUpnC/wUV3Nk6Fj17GCQlGhyz7UM=; b=Oo5AHAu+DxPJESB8LNkT4ZWoCgD+9xIzN56qIih5SmKyJAZBx2ItDZK471rvqSQATG iHZ3GtYgTv7sG9q6cayKkER4huRFSralDMhid3z6Xc5M80wWx5uFgDCje15arJafbEbl oM2QWzvZ7YqHwWsoAIcErxRlVkIRJjM3fYJT1mmiOuZDzVi6tZFEwdMrUL5m+AdQ2GRl gkBsCXi8BAriWnlgM51gaV2nc2vovD8w2UDZZudTGESO182VbqEOj0SAwgsj+FJWmEDP 151g9W4GTtma/7r12patcSiPStmivH9jG7sG3E/VPtYu7MDWsLWCHfXoTMRMj5/xKVo3 KLog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=gsN4tVk2AbiLxUguUpnC/wUV3Nk6Fj17GCQlGhyz7UM=; b=HvtBTPH52sZoi2sQGyzP7ntKmSvEQOjeMbpD4NaSRE7iJ6hhXHQ4etl4/Q1D5zhpDC WZI77+Frwc3Fsv/Ksg8DNBL5aWLE9vHdwK6yGZSu2TYtot/uteKLlbJ+XJRbUCf33TON l96BHaI9RaOjTLcU52Eyh9c8rGNOsdHv2ZKvBVHi0/afUhQ9hqy3qsw91qKB5uvC60IP BPTvFPymBmt8b3EpvtWMjuK912gRR0J77D8n56qXkBdPaRmwI4pnxBMryZevSdOHCdQ2 g3Mlo061b2cTRqWFVHogUbhq3VnJep/ANsz4exsR6nNeSR938JYQ5b+s9jzYJgJJhDxx Ty1g== X-Gm-Message-State: APt69E2yZpnkBt2GtBRmO7+j/mh3LkKSf2fImwUwje97cv1Y4fB08RoW gNAqs6uT3XG+0apEQQWyxnQKjiSWyxLYhkUwIvCdKYtq X-Google-Smtp-Source: AAOMgpfQLYAwjGzKanOd0Y03aDYUSvuifqGJP849QmjL8Bxcg6XC1qo1G9LHokuzVfIFD9E2XQcaPh47omloYEXunRk= X-Received: by 2002:a6b:4e04:: with SMTP id c4-v6mr9029232iob.19.1530894377892; Fri, 06 Jul 2018 09:26:17 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a02:1b98:0:0:0:0:0 with HTTP; Fri, 6 Jul 2018 09:26:17 -0700 (PDT) From: Sergiu Hlihor Date: Fri, 6 Jul 2018 18:26:17 +0200 Message-ID: Subject: Improvements in Grep To: bug-grep@gnu.org Content-Type: multipart/alternative; boundary="000000000000954fdf0570571f6a" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Fri, 06 Jul 2018 17:31:47 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) --000000000000954fdf0570571f6a Content-Type: text/plain; charset="UTF-8" Hello, I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While grepping over large files I've noticed Grep is painfully slow. The bottleneck seems to be the read block which is extremely low (looks like 64KB). For large files residing over big HDD RAID arrays, this request barely reaches one drive and based on CPU usage, grep is idling more or less. Given my tests for such scenarios, a read block size of at least 512KB would be way more efficient. It's very likely that optimum would be 1MB+. Also, such increase in buffer size would also benefit slightly SSDs where maximum sequential throughput is usually achieved when reading at 256KB+ block size. If this is already possible in newer versions or configurable, I'd appreciate some hints about the new version which contains or about the way I can configure it to increase the read block size. Thanks and best regards, Sergiu --000000000000954fdf0570571f6a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,
=C2=A0=C2=A0=C2=A0=C2=A0 I'= m using grep over Ubuntu Server 14.04 (Grep version 2.16). While grepping o= ver large files I've noticed Grep is painfully slow. The bottleneck see= ms to be the read block which is extremely low (looks like 64KB). For large= files residing over big HDD RAID arrays, this request barely reaches one d= rive and based on CPU usage, grep is idling more or less. Given my tests fo= r such scenarios, a read block size of at least 512KB would be way more eff= icient. It's very likely that optimum would be 1MB+. Also, such increas= e in buffer size would also benefit slightly SSDs where maximum sequential = throughput is usually achieved when reading at 256KB+ block size.
=C2=A0=C2=A0=C2=A0=C2=A0 If this is already possible in newer version= s or configurable, I'd appreciate some hints about the new version whic= h contains or about the way I can configure it to increase the read block s= ize.

Thanks and best regards,
Sergi= u
--000000000000954fdf0570571f6a-- From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 18:06:44 2018 Received: (at 32073) by debbugs.gnu.org; 6 Jul 2018 22:06:44 +0000 Received: from localhost ([127.0.0.1]:48878 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbYsF-0003J2-W1 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:06:44 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33666) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbYsD-0003Ij-LC for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 18:06:42 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9660D16161F; Fri, 6 Jul 2018 15:06:35 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id t3uCubr3XabO; Fri, 6 Jul 2018 15:06:34 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E1759161625; Fri, 6 Jul 2018 15:06:34 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id nlyZyupbRODd; Fri, 6 Jul 2018 15:06:34 -0700 (PDT) Received: from [192.168.1.9] (unknown [47.154.30.119]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id A232616161F; Fri, 6 Jul 2018 15:06:34 -0700 (PDT) Subject: Re: bug#32073: Improvements in Grep To: Sergiu Hlihor , 32073@debbugs.gnu.org References: From: Paul Eggert Openpgp: preference=signencrypt Autocrypt: addr=eggert@cs.ucla.edu; prefer-encrypt=mutual; keydata= xsFNBEyAcmQBEADAAyH2xoTu7ppG5D3a8FMZEon74dCvc4+q1XA2J2tBy2pwaTqfhpxxdGA9 Jj50UJ3PD4bSUEgN8tLZ0san47l5XTAFLi2456ciSl5m8sKaHlGdt9XmAAtmXqeZVIYX/UFS 96fDzf4xhEmm/y7LbYEPQdUdxu47xA5KhTYp5bltF3WYDz1Ygd7gx07Auwp7iw7eNvnoDTAl KAl8KYDZzbDNCQGEbpY3efZIvPdeI+FWQN4W+kghy+P6au6PrIIhYraeua7XDdb2LS1en3Ss mE3QjqfRqI/A2ue8JMwsvXe/WK38Ezs6x74iTaqI3AFH6ilAhDqpMnd/msSESNFt76DiO1ZK QMr9amVPknjfPmJISqdhgB1DlEdw34sROf6V8mZw0xfqT6PKE46LcFefzs0kbg4GORf8vjG2 Sf1tk5eU8MBiyN/bZ03bKNjNYMpODDQQwuP84kYLkX2wBxxMAhBxwbDVZudzxDZJ1C2VXujC OJVxq2kljBM9ETYuUGqd75AW2LXrLw6+MuIsHFAYAgRr7+KcwDgBAfwhPBYX34nSSiHlmLC+ KaHLeCLF5ZI2vKm3HEeCTtlOg7xZEONgwzL+fdKo+D6SoC8RRxJKs8a3sVfI4t6CnrQzvJbB n6gxdgCu5i29J1QCYrCYvql2UyFPAK+do99/1jOXT4m2836j1wARAQABzSBQYXVsIEVnZ2Vy dCA8ZWdnZXJ0QGNzLnVjbGEuZWR1PsLBfgQTAQIAKAUCTIByZAIbAwUJEswDAAYLCQgHAwIG FQgCCQoLBBYCAwECHgECF4AACgkQ7ZfpDmKqfjRRGw/+Ij03dhYfYl/gXVRiuzV1gGrbHk+t nfrI/C7fAeoFzQ5tVgVinShaPkZo0HTPf18x6IDEdAiO8Mqo1yp0CtHmzGMCJ50o4Grgfjlr 6g/+vtEOKbhleszN2XpJvpwM2QgGvn/laTLUu8PH9aRWTs7qJJZKKKAb4sxYc92FehPu6FOD 0dDiyhlDAq4lOV2mdBpzQbiojoZzQLMQwjpgCTK2572eK9EOEQySUThXrSIz6ASenp4NYTFH s9tuJQvXk9gZDdPSl3bp+47dGxlxEWLpBIM7zIONw4ks4azgT8nvDZxA5IZHtvqBlJLBObYY 0Le61Wp0y3TlBDh2qdK8eYL426W4scEMSuig5gb8OAtQiBW6k2sGUxxeiv8ovWu8YAZgKJfu oWI+uRnMEddruY8JsoM54KaKvZikkKs2bg1ndtLVzHpJ6qFZC7QVjeHUh6/BmgvdjWPZYFTt N+KA9CWX3GQKKgN3uu988yznD7LnB98T4EUH1HA/GnfBqMV1gpzTvPc4qVQinCmIkEFp83zl +G5fCjJJ3W7ivzCnYo4KhKLpFUm97okTKR2LW3xZzEW4cLSWO387MTK3CzDOx5qe6s4a91Zu ZM/j/TQdTLDaqNn83kA4Hq48UHXYxcIh+Nd8k/3w6lFuoK0wrOFiywjLx+0ur5jmmbecBGHc 1xdhAFHOwU0ETIByZAEQAKaF678T9wyH4wjTrV1Pz3cDEoSnV/0ZUrOT37p1dcGyj/IXq1x6 70HRVahAmk0sZpYc25PF9D5GPYHFWlNjuPU96rDndXB3hedmBRhLdC4bAXjI4DV+bmdVe+q/ IMnlZRaVlm9EiMCVAR6w13sReu7qXkW9r3RwY2AzXskp/tAe4BRKr1Zmbvi2nbnQ6epEC42r Rbx0B1EhjbIQZ5JHGk24iPT7LdBgnNmos5wYjzwNlkMQD5T0Ydzhk7J+UxwA5m46mOhRDC2r FV/A0gm5TLy8DXjv/Esc4gYnYai6SQqnUEVh5LuV8YCJBnijs+Tiw71x1icmn6xGI45EugJO gec+rLypYgpVp4x0HI5T88qBRYCkxH3Kg8Qo+EWNA9A4LRQ9DX8njona0gf0s03tocK8kBN6 6UoqqPtHBnc4eMgBymCflK12eKfd2YYxnyg9cZazWA5VslvTxpm76hbg5oiAEH/Vg/8MxHyA nPhfrgwyPrmJEcVBafdspJnYQxBYNco2LFPIhlOvWh8r4at+s+M3Lb26oUTczlgdW1Sf3SDA 77BMRnF0FQyE+7AzV79MBN4ykiqaezQxtaF1Fy/tvkhffSo8u+dwG0EgJh+te38gTcISVr0G IPplLz6YhjrbHrPRF1CN5UuL9DBGjxuN35RLNVEfta6RUFlR6NctTjvrABEBAAHCwWUEGAEC AA8FAkyAcmQCGwwFCRLMAwAACgkQ7ZfpDmKqfjSrHA/+KzAKvTxRhA9MWNLxIyJ7S5uJ16gs T3oCjZrBKGEhKMOGX4O0GA6VOEryO7QRCCYah3oxSG38IAnNeiwJXgU9Bzkk85UGbPEd7HGF /VSeHCQwWou6jqUDTSDvn9YhNTdG0KXPM74aC+xr2Zow1O2mhXihgWKD0Dw+0LYPnUOsQ0KO FxHXXYHmRrS1OZPU59BLvc+TRhIhafSHKLwbXK+6ckkxBx6h8z5ccpG0Qs4bFhdFYnFrEieD LoGmnE2YLhdV6swJ9VNCS6pLiEohT3fm7aXm15tZOIyzMZhHRSAPblXxQ0ZSWjq8oRrcYNFx c4W1URpAkBCOYJoXvQfD5L3lqAl8TCqDUzYxhH/tJhbDdHrqHH767jaDaTB1+Talp/2AMKwc XNOdiklGxbmHVG6YGl6g8Lrbsu9NZEI4yLlHzuikthJWgz+3vZhVGyNlt+HNIoF6CjDL2omu 5cEq4RDHM44QqPk6l7O0pUvN1mT4B+S1b08RKpqm/ff015E37HNV/piIvJlxGAYz8PSfuGCB 1thMYqlmgdhd9/BabGFbGGYHA6U4/T5zqU+f6xHy1SsAQZ1MSKlLwekBIT+4/cLRGqCHjnV0 q5H/T6a7t5mPkbzSrOLSo4puj+IToNjYyYIDBWzhlA19avOa+rvUjmHtD3sFN7cXWtkGoi8b uNcby4U= Organization: UCLA Computer Science Department Message-ID: <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu> Date: Fri, 6 Jul 2018 15:06:34 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 32073 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Sergiu Hlihor wrote: > Given my tests for such scenarios, a read block size of at least > 512KB would be way more efficient. Does stdio do this already? If not, why not? How could grep reasonably configure a good block size? From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 18:44:55 2018 Received: (at submit) by debbugs.gnu.org; 6 Jul 2018 22:44:56 +0000 Received: from localhost ([127.0.0.1]:48900 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbZTD-0004NL-Kt for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:55 -0400 Received: from eggs.gnu.org ([208.118.235.92]:52864) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbZTB-0004N6-Nb for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:53 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fbZT5-0002Kd-Ui for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:48 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:42426) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fbZT5-0002KZ-Qk for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:47 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43835) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fbZT4-0003Yf-RL for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fbZT1-0002KB-Pa for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:46 -0400 Received: from atl4mhob08.registeredsite.com ([209.17.115.46]:55668) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fbZT1-0002Jr-K1 for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:43 -0400 Received: from mailpod.hostingplatform.com (atl4qobmail01pod2.registeredsite.com [10.30.77.35]) by atl4mhob08.registeredsite.com (8.14.4/8.14.4) with ESMTP id w66Micxx011705 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Fri, 6 Jul 2018 18:44:38 -0400 Received: (qmail 26434 invoked by uid 0); 6 Jul 2018 22:44:37 -0000 X-TCPREMOTEIP: 99.253.103.29 X-Authenticated-UID: dclarke@blastwave.org Received: from unknown (HELO sedna.genunix.com) (dclarke@blastwave.org@99.253.103.29) by 0 with ESMTPA; 6 Jul 2018 22:44:37 -0000 Subject: Re: bug#32073: Improvements in Grep To: bug-grep@gnu.org References: <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu> From: Dennis Clarke Message-ID: Date: Fri, 6 Jul 2018 18:44:36 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.0 (------) On 07/06/2018 06:06 PM, Paul Eggert wrote: > Sergiu Hlihor wrote: >> Given my tests for such scenarios, a read block size of at least >> 512KB would be way more efficient. > > Does stdio do this already? If not, why not? How could grep reasonably > configure a good block size? This seems to be a very specific complaint which is only of value on a very specific system and usage case. There is no way that grep could configure a "good block size" unless it were tailor built. Doesn't seem to be a reasonable RFE. In my opinion. Dennis From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 20:33:37 2018 Received: (at 32073) by debbugs.gnu.org; 7 Jul 2018 00:33:37 +0000 Received: from localhost ([127.0.0.1]:48940 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbbAO-0007QZ-Ov for submit@debbugs.gnu.org; Fri, 06 Jul 2018 20:33:36 -0400 Received: from mail-wm0-f53.google.com ([74.125.82.53]:55493) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbbAN-0007QL-2l for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 20:33:35 -0400 Received: by mail-wm0-f53.google.com with SMTP id v16-v6so16251135wmv.5 for <32073@debbugs.gnu.org>; Fri, 06 Jul 2018 17:33:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=Momz89FF7eOSa9Kgl6zghjC6oTftuhwl6RqoRaYOrMQ=; b=r5rTENOu+tSjTcdqBYCTdtODWfnwh8JkbnU5pvQQ4FKW1s8iXv2g5OCYUXzRzk8kV8 ODH33BK+AfAPwfMkzWeLWw5OCKnQNSHJIYsWa+w0kKz8gZobrqPJd8ed9itA2EkVtV5A iAXB+K+Pp/PxIRqXOJxVxKGnPNRuni/9L5iOidz9IVeVZwpsPFjhFNJVl9NBrwJu7s2d GjRngOzfufM+djuCXS4i5EEa6fucjxJz+8MVxCCaFNyLqOXfn3EezAAJYTsryJNZhmp7 kv4tiCYsRY7KlQw/J0XHBVZscFUrmBt+BWDTVOT2D/OJiVyNdnH/98srPilODCKG/+po w0aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=Momz89FF7eOSa9Kgl6zghjC6oTftuhwl6RqoRaYOrMQ=; b=D7WGy8TmAKpGlKrGsWtuPnKctHvaGrSGSoOvwgTpinuF9fzQnCbdkZ02dem92FhTwk 9ZGVfQuziRmvkc+VtxmXgv2FHObUpiW6RemoYLyxVNALbzXEJ572OG+/EE4py8kkSIaq PxI4NjXPPU+/L+w0hj/BScRtR4JQV23yOMc+61zQVoJZlOHojpssXYoxpwoeUS0G7F0w jQt/H5Qir8CGhByO5fizJhE+yKpo/9tVSpGaCs9xlg5SRimXpPjtXSHlZljpZeQhTsB9 7O1o7AsL0OGykTSOw8LVIrVoyTs2tJf+4WJzeZJaHPWGT7OY1dNW/ez2hQMED5Ky46z8 8JlA== X-Gm-Message-State: APt69E19y1bXDLv/AKiy97SERJPdA6/AVTIL7VXFPtIKgaw5r1Zx0uXA ui2EpCHA0XU0VxPqisyeegH6khO3gnlNFifrEhDg0g== X-Google-Smtp-Source: AAOMgpcRnfie1Sy2x6piy4b8g+uuIEn+uAsPVNH+5N8Gv+JNGmAcwNSJDDj6rqaUSE7U/1OBWoMbu4BPlT2a8/4Shew= X-Received: by 2002:a1c:a8f:: with SMTP id 137-v6mr6676449wmk.119.1530923609175; Fri, 06 Jul 2018 17:33:29 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:adf:ec4e:0:0:0:0:0 with HTTP; Fri, 6 Jul 2018 17:33:08 -0700 (PDT) In-Reply-To: References: From: Jim Meyering Date: Fri, 6 Jul 2018 17:33:08 -0700 X-Google-Sender-Auth: tlltqOQ-2sHQZaEuW_K-9CvgtBM Message-ID: Subject: Re: bug#32073: Improvements in Grep To: Sergiu Hlihor Content-Type: multipart/mixed; boundary="000000000000e75f1605705ded47" X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) --000000000000e75f1605705ded47 Content-Type: text/plain; charset="UTF-8" On Fri, Jul 6, 2018 at 9:26 AM, Sergiu Hlihor wrote: > Hello, > I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While > grepping over large files I've noticed Grep is painfully slow. The > bottleneck seems to be the read block which is extremely low (looks like > 64KB). For large files residing over big HDD RAID arrays, this request > barely reaches one drive and based on CPU usage, grep is idling more or > less. Given my tests for such scenarios, a read block size of at least > 512KB would be way more efficient. It's very likely that optimum would be > 1MB+. Also, such increase in buffer size would also benefit slightly SSDs > where maximum sequential throughput is usually achieved when reading at > 256KB+ block size. > If this is already possible in newer versions or configurable, I'd > appreciate some hints about the new version which contains or about the way > I can configure it to increase the read block size. Thanks for raising the issue. This makes me think we should follow Coreutils' lead[0] and increase grep's initial buffer size from 32KiB, probably to 128KiB. I will time with the attached diff on a few systems. [0] https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=v8.22-103-g74ca6e84c --000000000000e75f1605705ded47 Content-Type: application/octet-stream; name="grep-bufsize-increase.diff" Content-Disposition: attachment; filename="grep-bufsize-increase.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_jjaoc07a0 ZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IGY0YWU1ZjUuLjA0YWM5 YzkgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtNzk5LDcgKzc5 OSw2IEBAIHNraXBwZWRfZmlsZSAoY2hhciBjb25zdCAqbmFtZSwgYm9vbCBjb21tYW5kX2xpbmUs IGJvb2wgaXNfZGlyKQoKIHN0YXRpYyBjaGFyICpidWZmZXI7CQkvKiBCYXNlIG9mIGJ1ZmZlci4g Ki8KIHN0YXRpYyBzaXplX3QgYnVmYWxsb2M7CQkvKiBBbGxvY2F0ZWQgYnVmZmVyIHNpemUsIGNv dW50aW5nIHNsb3AuICovCi1lbnVtIHsgSU5JVElBTF9CVUZTSVpFID0gMzI3NjggfTsgLyogSW5p dGlhbCBidWZmZXIgc2l6ZSwgbm90IGNvdW50aW5nIHNsb3AuICovCiBzdGF0aWMgaW50IGJ1ZmRl c2M7CQkvKiBGaWxlIGRlc2NyaXB0b3IuICovCiBzdGF0aWMgY2hhciAqYnVmYmVnOwkJLyogQmVn aW5uaW5nIG9mIHVzZXItdmlzaWJsZSBzdHVmZi4gKi8KIHN0YXRpYyBjaGFyICpidWZsaW07CQkv KiBMaW1pdCBvZiB1c2VyLXZpc2libGUgc3R1ZmYuICovCkBAIC04MTIsNiArODExLDkgQEAgc3Rh dGljIGJvb2wgc2tpcF9udWxzOwkJLyogU2tpcCAnXDAnIGluIGRhdGEuICAqLwogc3RhdGljIGJv b2wgc2tpcF9lbXB0eV9saW5lczsJLyogU2tpcCBlbXB0eSBsaW5lcyBpbiBkYXRhLiAgKi8KIHN0 YXRpYyB1aW50bWF4X3QgdG90YWxubDsJLyogVG90YWwgbmV3bGluZSBjb3VudCBiZWZvcmUgbGFz dG5sLiAqLwoKKy8qIEluaXRpYWwgYnVmZmVyIHNpemUsIG5vdCBjb3VudGluZyBzbG9wLiAqLwor ZW51bSB7IElOSVRJQUxfQlVGU0laRSA9IDEyOCAqIDEwMjQgfTsKKwogLyogUmV0dXJuIFZBTCBh bGlnbmVkIHRvIHRoZSBuZXh0IG11bHRpcGxlIG9mIEFMSUdOTUVOVC4gIFZBTCBjYW4gYmUKICAg IGFuIGludGVnZXIgb3IgYSBwb2ludGVyLiAgQm90aCBhcmdzIG11c3QgYmUgZnJlZSBvZiBzaWRl IGVmZmVjdHMuICAqLwogI2RlZmluZSBBTElHTl9UTyh2YWwsIGFsaWdubWVudCkgXAo= --000000000000e75f1605705ded47-- From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 21:39:13 2018 Received: (at 32073) by debbugs.gnu.org; 7 Jul 2018 01:39:13 +0000 Received: from localhost ([127.0.0.1]:48957 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbcBs-0000ed-OU for submit@debbugs.gnu.org; Fri, 06 Jul 2018 21:39:13 -0400 Received: from mail-it0-f49.google.com ([209.85.214.49]:54285) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fbc4L-0000Sc-O0 for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 21:31:26 -0400 Received: by mail-it0-f49.google.com with SMTP id s7-v6so18707912itb.4 for <32073@debbugs.gnu.org>; Fri, 06 Jul 2018 18:31:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=moB9xXnht8Ivaa6H7SGXQS7xXcWeF8ztpA7QCKa2BT4=; b=bLqNZb71KPy2mG0vNyuyHEeRYm904p/g6KRsezoGV7fzUqdmYb+kf9BhNAAL2b3uNX EZS7Mkdk+wtgo787UcgZCPdzLsgB4Xx4XWz6+DdEV7GlXKDzCciLV+7xZf8CLThTVsqO ANycURMEcfIb8XOOKkywhequHiDPzuGjA+mCL8XbTQ85KlCtIy6Wi9m/UaH3DbF6MpQf m+iyBtopRtUMcO5vwaLX8jA5Z5mqzvW1z7TQrgzeOR6X0WaWp3964Rn0uRW3JU4i+nOR SUfDDvlxAM9Uv5rcCH6QXFHSKysTf6GQLABCezImn7rNgnnu0DfYsJlCAepPO/3DSwyB 2ajg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=moB9xXnht8Ivaa6H7SGXQS7xXcWeF8ztpA7QCKa2BT4=; b=TJo1fVsRWbp+azwzaVnH3qm9j43mqh06Jr6+A8x3WInMKafRmapHzxVSZe0wzqvWki 0BHrtABV03cUduLLrIAF7VuPO0JhbHPM1z/DW2MxrpbHcbdYc36CkcZ8w4anA9Ugdhy3 EFm07/0b5RWnq1A3UFDn/hkcc+jl+vx4NguDzsq2vr/pNcb65hBiVMFu5IAgsda6X6jE 5+KZz2OCcVXGBE18HKL5qXIIc+nQwy0shI3h/qVo4f4ccpy2rgk8jlhv9pW7g9/and0T kjMlgfihEsLTcKmyR7DTE3K+pwP9YRZXTXgj7eaWhds+NDExYxO3CW9tcHMFIo1+PVt7 /sHA== X-Gm-Message-State: APt69E25xAaOkhKgd8peujqWEJOl4JV0PGYTKYKkK2Fc+0CgdTE5G00N 74Sf2u9htI5ECihtHlTcygvRb72EZZ9LiVTCX+1yyw== X-Google-Smtp-Source: AAOMgpcEMTNY6K71KFEu+OBwvA3lpDLV0oMhfUmxaofiJR2wF4p/DaEshJq7+Vz+4+8CU1NGhLNdV/5thsc9LKdvNhY= X-Received: by 2002:a24:cf57:: with SMTP id y84-v6mr10031863itf.98.1530927080155; Fri, 06 Jul 2018 18:31:20 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a02:1b98:0:0:0:0:0 with HTTP; Fri, 6 Jul 2018 18:31:19 -0700 (PDT) In-Reply-To: References: From: Sergiu Hlihor Date: Sat, 7 Jul 2018 03:31:19 +0200 Message-ID: Subject: Re: bug#32073: Improvements in Grep To: Jim Meyering Content-Type: multipart/alternative; boundary="000000000000ca462d05705ebc23" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 32073 X-Mailman-Approved-At: Fri, 06 Jul 2018 21:39:11 -0400 Cc: 32073@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --000000000000ca462d05705ebc23 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable To add, the increase to 128KiB is good, but for RAID arrays with light to medium load, this is not sufficient. In a system without any load, the HDD can read ahead and always serve the next request from buffer thus reading at full sequential speed of ~200MB/s . In a RAID 10 configuration with 12 hdds where strip size is set to 128KB, every HDD is hit at every 6th request. There is enough delay between reads hitting the same drive that the read ahead buffer often gets discarded which basically limits the throughput to max IOPS x buffer size =3D ~10-20MiB for 128KiB. I have such systems in production environments and I often see read speeds under 10MiB and read await >10ms which means that read ahead buffer is already discarded. At the same load conditions, if I read the data using utilities which can do 512KiB buffer size, I see read speed varying between 50 and 400MiB. Grep has an average CPU load of 2-3% of the given machine under such low reads, therefore it can do much more if reading is optimized= . On 7 July 2018 at 02:33, Jim Meyering wrote: > On Fri, Jul 6, 2018 at 9:26 AM, Sergiu Hlihor wrote: > > Hello, > > I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While > > grepping over large files I've noticed Grep is painfully slow. The > > bottleneck seems to be the read block which is extremely low (looks lik= e > > 64KB). For large files residing over big HDD RAID arrays, this request > > barely reaches one drive and based on CPU usage, grep is idling more or > > less. Given my tests for such scenarios, a read block size of at least > > 512KB would be way more efficient. It's very likely that optimum would = be > > 1MB+. Also, such increase in buffer size would also benefit slightly SS= Ds > > where maximum sequential throughput is usually achieved when reading at > > 256KB+ block size. > > If this is already possible in newer versions or configurable, I'd > > appreciate some hints about the new version which contains or about the > way > > I can configure it to increase the read block size. > > Thanks for raising the issue. > This makes me think we should follow Coreutils' lead[0] and increase > grep's initial buffer size from 32KiB, probably to 128KiB. I will time > with the attached diff on a few systems. > > [0] https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=3D > v8.22-103-g74ca6e84c > --=20 _____________________________________________ Senior Software Engineer & Team leader Telefon: +49 (0) 6221 7787-481 Email: sh@discovergy.com *Discovergy GmbH* _____________________________________________ Registergericht: Amtsgericht Aachen HRB 15391 Gesch=C3=A4ftsf=C3=BChrer: Ralf Esser | Bernhard Seidl | Nikolaus Starzache= r Diese E-Mail und eventuell angeh=C3=A4ngte Dateien sind nur f=C3=BCr den ob= en genannten Empf=C3=A4nger bestimmt und k=C3=B6nnen vertrauliche Informatione= n enthalten. Sollten Sie nicht der Empf=C3=A4nger sein, ist jede Verbreitung, Weiterleitung und Kopie verboten. Wenn Sie diese E-Mail versehentlich erhalten haben, senden Sie diese Mail zur=C3=BCck oder unterrichten umgehen= d den Absender unter oben genannten Kontaktdaten. Bitte l=C3=B6schen Sie diese Nachricht in diesem Fall umgehend. Vielen Dank. --000000000000ca462d05705ebc23 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
To add, the increase to 128KiB is good, but for RAID = arrays with light to medium load, this is not sufficient. In a system witho= ut any load, the HDD can read ahead and always serve the next request from = buffer thus reading at full sequential speed of ~200MB/s . In a RAID 10 con= figuration with 12 hdds where strip size is set to 128KB, every HDD is hit = at every 6th request. There is enough delay between reads hitting the same = drive that the read ahead buffer often gets discarded which basically limit= s the throughput to max IOPS x buffer size=C2=A0 =3D ~10-20MiB for 128KiB. = =C2=A0
I have such systems in production environments and I = often see read speeds under 10MiB and read await >10ms which means that = read ahead buffer is already discarded. At the same load conditions, if I r= ead the data using utilities which can do 512KiB buffer size, I see read sp= eed varying between 50 and 400MiB. Grep has an average CPU load of 2-3% of = the given machine under such low reads, therefore it can do much more if re= ading is optimized.

On 7 July 2018 at 02:33, Jim Meyering = <jim@meyering.net<= /a>> wrote:
On Fri, Jul 6, 2018= at 9:26 AM, Sergiu Hlihor <sh@disc= overgy.com> wrote:
> Hello,
>=C2=A0 =C2=A0 =C2=A0 I'm using grep over Ubuntu Server 14.04 (Grep = version 2.16). While
> grepping over large files I've noticed Grep is painfully slow. The=
> bottleneck seems to be the read block which is extremely low (looks li= ke
> 64KB). For large files residing over big HDD RAID arrays, this request=
> barely reaches one drive and based on CPU usage, grep is idling more o= r
> less. Given my tests for such scenarios, a read block size of at least=
> 512KB would be way more efficient. It's very likely that optimum w= ould be
> 1MB+. Also, such increase in buffer size would also benefit slightly S= SDs
> where maximum sequential throughput is usually achieved when reading a= t
> 256KB+ block size.
>=C2=A0 =C2=A0 =C2=A0 If this is already possible in newer versions or c= onfigurable, I'd
> appreciate some hints about the new version which contains or about th= e way
> I can configure it to increase the read block size.

Thanks for raising the issue.
This makes me think we should follow Coreutils' lead[0] and increase grep's initial buffer size from 32KiB, probably to 128KiB. I will time<= br> with the attached diff on a few systems.

[0] https://git.s= avannah.gnu.org/cgit/coreutils.git/commit/?id=3Dv8.22-103-g74ca6e= 84c



--
=
_____________________________________________

= Senior Software Engineer & Team leader

Telefon: +49 (0) 6221 778= 7-481

Email: sh@disc= overgy.com

Discovergy GmbH
_____________________________= ________________

Registergeric= ht: Amtsgericht Aachen HRB 15391

Gesch=C3= =A4ftsf=C3=BChrer: Ralf Esser | Bernhard Seidl | Nikolaus Starzacher=

Diese E-Mail und eventuell angeh=C3=A4ngte Dateien sind nur f=C3=BCr den oben ge= nannten Empf=C3=A4nger bestimmt und k=C3=B6nnen vertrauliche Informationen enthalt= en.=20 Sollten Sie nicht der Empf=C3=A4nger sein, ist jede Verbreitung,=20 Weiterleitung und Kopie verboten. Wenn Sie diese E-Mail versehentlich=20 erhalten haben, senden Sie diese Mail zur=C3=BCck oder unterrichten umgehen= d=20 den Absender unter oben genannten Kontaktdaten. Bitte l=C3=B6schen Sie dies= e=20 Nachricht in diesem Fall umgehend. Vielen Dank.
--000000000000ca462d05705ebc23-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 02:53:03 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 07:53:03 +0000 Received: from localhost ([127.0.0.1]:35593 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imYoR-000097-DS for submit@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:03 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49318) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imYoO-00008c-SA for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:01 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4988716008F; Tue, 31 Dec 2019 23:52:55 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id Gxmu9XNl4O-w; Tue, 31 Dec 2019 23:52:54 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9C3A716022A; Tue, 31 Dec 2019 23:52:54 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id EswyY5VL8zaA; Tue, 31 Dec 2019 23:52:54 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 6D60516008F; Tue, 31 Dec 2019 23:52:54 -0800 (PST) To: Sergiu Hlihor From: Paul Eggert Organization: UCLA Computer Science Department Subject: Re: Improvements in Grep (Bug#32073) Message-ID: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> Date: Tue, 31 Dec 2019 23:52:54 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Dennis Clarke , Jim Meyering X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > This makes me think we should follow Coreutils' lead[0] and increase > grep's initial buffer size from 32KiB, probably to 128KiB. I see that Jim later installed a patch increasing it to 96 KiB. Whatever number is chosen, it's "wrong" for some configuration. And I suppose the particular configuration that Sergiu Hlihor mentioned could be tweaked so that it worked better with grep (and with other programs). I'm inclined to mark this bug report as a wishlist item, in the sense that it'd be nice if grep and/or the OS could pick buffer sizes more intelligently (though it's not clear how grep and/or the OS could go about this). From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 02:53:29 2020 Received: (at control) by debbugs.gnu.org; 1 Jan 2020 07:53:29 +0000 Received: from localhost ([127.0.0.1]:35596 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imYor-00009n-M7 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:29 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49386) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imYop-00009Y-8O for control@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:28 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D537616008F for ; Tue, 31 Dec 2019 23:53:19 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 3PAb7AcBSNSW for ; Tue, 31 Dec 2019 23:53:19 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 41FB716022A for ; Tue, 31 Dec 2019 23:53:19 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id qk7gh82mTcLL for ; Tue, 31 Dec 2019 23:53:19 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 24AEC16008F for ; Tue, 31 Dec 2019 23:53:19 -0800 (PST) To: control@debbugs.gnu.org From: Paul Eggert Subject: 32073 is wishlist Organization: UCLA Computer Science Department Message-ID: <4ce2bf47-cf95-a1c9-92cd-a351983cd23f@cs.ucla.edu> Date: Tue, 31 Dec 2019 23:53:18 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) severity 32073 wishlist From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 04:15:37 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 09:15:37 +0000 Received: from localhost ([127.0.0.1]:35621 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ima6K-0003yV-Uc for submit@debbugs.gnu.org; Wed, 01 Jan 2020 04:15:37 -0500 Received: from mail-io1-f50.google.com ([209.85.166.50]:34884) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ima6I-0003yF-QY for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 04:15:36 -0500 Received: by mail-io1-f50.google.com with SMTP id v18so35842758iol.2 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 01:15:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8gHSYNXJGtNZ3y8e9Nw8xBL74BT4jrQGosamVRHAPwE=; b=RHY9/EZNzOLQbXdM8yE7Em+XryBaYVTCsL6kzppIApUrQBPapaZIJ+YLRTFKJayFQ5 zmrCvfR4WiNuREW6XOV3bU590JIE3dcFucwcjYuHFQRB3vsA7728et+Xkxfz3I+JinAj kUWosCOKB+hgpJLZfYI5V/GS3pE6lgfqgDmYtR0ywh4e7yMcdCV7ar1YzcggMSnC0qjl 41d03g7n5dWawEmvqedFvgX0njyaojVViK7++X+q43XLrSvMC2GzLay8RHiLdE+BLir9 1jxuH13Y+oBsiqwA+wk5X/cxjdqXbvYm685Yyr0QIjaxIVx4ScPaJPZ0AmWLG7Lj6/7W yf6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8gHSYNXJGtNZ3y8e9Nw8xBL74BT4jrQGosamVRHAPwE=; b=a3vUjVYpkkDSnTuJg6qU2hPm2rzPsoe28rpKH9L15sRe78vtWhFXzOxdZgb8kchlH+ 4fRWCRRUR4pjAMNGUq1o+ZRG4N06jmApjKC3b3lafFsk5VIb6S+or5V+xIljQcLwF9EF kw1jnOf3gs4qjTFOG7LZHcWY8mtgmef01YYJ4fhj4AwhkY2lJRdoaorZnf8xS4H8/s83 pppQgvZCmA5J8QSKcnMLaU2/80k2rAvVjwa+vB5gABKR6c8pGXxzVxysyUb4DkyZPtR3 ww4/tJljbviR29fqVNTARspTgTpLGWwhbuuhKx1ZdFF+aisvS/Z3kN+LQQfbxIIaEOY6 llBQ== X-Gm-Message-State: APjAAAVudOcOVwHsECbWShBB7sRiFc0qJqC5DetlwIBP1zWr62tZiCBq A9No8KAFnKOj9/qQRQabPR7sLP3AdHjtW0WrYIRhYA== X-Google-Smtp-Source: APXvYqyqv9lqodjCG3Jkez2qtpMRbIhjIlGYB6bhFu6so69gdZ5KPSB8/uvgzj4GZl/qdlbNcZmgcB8fLcP/2+wj+38= X-Received: by 2002:a02:864b:: with SMTP id e69mr58953496jai.83.1577870129071; Wed, 01 Jan 2020 01:15:29 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> In-Reply-To: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> From: Sergiu Hlihor Date: Wed, 1 Jan 2020 10:15:16 +0100 Message-ID: Subject: Re: Improvements in Grep (Bug#32073) To: Paul Eggert Content-Type: multipart/alternative; boundary="0000000000008b9f5e059b1084ee" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Dennis Clarke , Jim Meyering X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --0000000000008b9f5e059b1084ee Content-Type: text/plain; charset="UTF-8" This topic is getting more and more frustrating. If you rely on OS, then you are at the mercy of whatever read ahead configuration you have. And read ahead is typically 128KB so does not help that much. A HDD RAID 10 array with 12 disks and a strip size of 128KB reaches the maximum read throughput if read block size is 6 * 128 = 768KB. When issuing read requests with 128KB , you only hit one HDD, having 1/6 read throughput. With flash the same. A state of the art SSD that can do 5GB/s reads can actually do around 1GB/s or less at 128KB block size. Why is so hard to understand how hardware works and the fact that you need huge block sizes to actually read at full speed? Why not just exposing the read buffer size as a configurable parameter, then anyone can just tune it as needed? 96KB is purely retarded. On Wed, 1 Jan 2020 at 08:52, Paul Eggert wrote: > > This makes me think we should follow Coreutils' lead[0] and increase > > grep's initial buffer size from 32KiB, probably to 128KiB. > > I see that Jim later installed a patch increasing it to 96 KiB. > > Whatever number is chosen, it's "wrong" for some configuration. And I > suppose > the particular configuration that Sergiu Hlihor mentioned could be tweaked > so > that it worked better with grep (and with other programs). > > I'm inclined to mark this bug report as a wishlist item, in the sense that > it'd > be nice if grep and/or the OS could pick buffer sizes more intelligently > (though > it's not clear how grep and/or the OS could go about this). > --0000000000008b9f5e059b1084ee Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
This topic is getting more and more frustrating. If y= ou rely on OS, then you are at the mercy of whatever read ahead configurati= on you have. And read ahead is typically 128KB so does not help that much. = A HDD RAID 10 array with 12 disks and a strip size of 128KB reaches the max= imum read throughput if read block size is 6 * 128 =3D 768KB. When issuing = read requests with 128KB , you only hit one HDD, having 1/6 read throughput= . With flash the same. A state of the art SSD that can do 5GB/s reads can a= ctually do around 1GB/s or less at 128KB block size. Why is so hard to unde= rstand how hardware works and the fact that you need huge block sizes to ac= tually read at full speed? Why not just exposing the read buffer size as a = configurable parameter, then anyone can just tune it as needed? 96KB is pur= ely retarded.

On Wed, 1 Jan 2020 at 08:52, Paul Eggert <eggert@cs.ucla.edu> wrote:
> This makes me think we shou= ld follow Coreutils' lead[0] and increase
> grep's initial buffer size from 32KiB, probably to 128KiB.

I see that Jim later installed a patch increasing it to 96 KiB.

Whatever number is chosen, it's "wrong" for some configuratio= n. And I suppose
the particular configuration that Sergiu Hlihor mentioned could be tweaked = so
that it worked better with grep (and with other programs).

I'm inclined to mark this bug report as a wishlist item, in the sense t= hat it'd
be nice if grep and/or the OS could pick buffer sizes more intelligently (t= hough
it's not clear how grep and/or the OS could go about this).



--0000000000008b9f5e059b1084ee-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 06:19:34 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 11:19:34 +0000 Received: from localhost ([127.0.0.1]:35683 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imc2H-0006qy-PG for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:19:34 -0500 Received: from freefriends.org ([96.88.95.60]:44578) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imc2F-0006qq-G8 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 06:19:32 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 001BJN5u027995 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 1 Jan 2020 04:19:23 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 001BJMYA027994; Wed, 1 Jan 2020 04:19:22 -0700 From: arnold@skeeve.com Message-Id: <202001011119.001BJMYA027994@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Wed, 01 Jan 2020 04:19:22 -0700 To: sh@discovergy.com, eggert@cs.ucla.edu Subject: Re: bug#32073: Improvements in Grep (Bug#32073) References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> In-Reply-To: User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.9 (/) As a quite serious question, how is someone writing user-level code supposed to be able to figure out the right buffer size for a particular file, and to do so portably? ("Show me the code.") Gawk bases its reads on the st_blksize member in struct stat. That will typically be something like 4K - not nearly enough, given your description below. Arnold Sergiu Hlihor wrote: > This topic is getting more and more frustrating. If you rely on OS, then > you are at the mercy of whatever read ahead configuration you have. And > read ahead is typically 128KB so does not help that much. A HDD RAID 10 > array with 12 disks and a strip size of 128KB reaches the maximum read > throughput if read block size is 6 * 128 = 768KB. When issuing read > requests with 128KB , you only hit one HDD, having 1/6 read throughput. > With flash the same. A state of the art SSD that can do 5GB/s reads can > actually do around 1GB/s or less at 128KB block size. Why is so hard to > understand how hardware works and the fact that you need huge block sizes > to actually read at full speed? Why not just exposing the read buffer size > as a configurable parameter, then anyone can just tune it as needed? 96KB > is purely retarded. > > On Wed, 1 Jan 2020 at 08:52, Paul Eggert wrote: > > > > This makes me think we should follow Coreutils' lead[0] and increase > > > grep's initial buffer size from 32KiB, probably to 128KiB. > > > > I see that Jim later installed a patch increasing it to 96 KiB. > > > > Whatever number is chosen, it's "wrong" for some configuration. And I > > suppose > > the particular configuration that Sergiu Hlihor mentioned could be tweaked > > so > > that it worked better with grep (and with other programs). > > > > I'm inclined to mark this bug report as a wishlist item, in the sense that > > it'd > > be nice if grep and/or the OS could pick buffer sizes more intelligently > > (though > > it's not clear how grep and/or the OS could go about this). > > From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 06:27:57 2020 Received: (at submit) by debbugs.gnu.org; 1 Jan 2020 11:27:57 +0000 Received: from localhost ([127.0.0.1]:35689 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imcAO-00077b-V7 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:27:57 -0500 Received: from lists.gnu.org ([209.51.188.17]:46445) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imcAM-00077P-VV for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:27:55 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:37966) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1imcAL-0007y8-Jv for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:54 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_LOW, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1imcAK-0000fx-G8 for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:53 -0500 Received: from wout2-smtp.messagingengine.com ([64.147.123.25]:53503) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1imcAK-0000cX-6N for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:52 -0500 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.west.internal (Postfix) with ESMTP id 2567A44F for ; Wed, 1 Jan 2020 06:27:50 -0500 (EST) Received: from imap34 ([10.202.2.84]) by compute1.internal (MEProxy); Wed, 01 Jan 2020 06:27:50 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=3FPE13 sLv9H+a6dWQRcMgOBbn4EKJMJWiX4CmxajVgQ=; b=jFIOhRXG5TxSZfp8sSbsYf atLO6F0EBVwJYVgqpV/PMbFcbDL2NxxGv61We/kSEGFAmWgRqA528MvU6sUnVs8J tUU/yq2kUq9SJZy7FfUvbF/mBFZnM5y48hEeE0I60qKPmHxr7Tf1MhLOKeK6Tf+9 LdVh4fZq+LDjbe5BaJBcteOMUids9+LWeT1wh8J+kyeqKDQc3mSf6KPmGqYcCC1Z xlVDjql840uOD33Dc3hNGLwGBYm/6AWbDmRwXArH8EwTQQHfopWf5YdQ5qW64AVL mySB2nVL/IFaWGISNxvBNej/1ervduOtlMel4YIJLSH0+BFKRdP1S16dwKeJ8kyQ == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdefledgvdejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefofgggkfgjfhffhffvufgtsehttd ertderredtnecuhfhrohhmpedfrfgruhhlucflrggtkhhsohhnfdcuoehpjhesuhhsrgdr nhgvtheqnecurfgrrhgrmhepmhgrihhlfhhrohhmpehpjhesuhhsrgdrnhgvthenucevlh hushhtvghrufhiiigvpedt X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 5E5C11460061; Wed, 1 Jan 2020 06:27:49 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.1.7-694-gd5bab98-fmstable-20191218v1 Mime-Version: 1.0 Message-Id: In-Reply-To: References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> Date: Wed, 01 Jan 2020 05:26:04 -0600 From: "Paul Jackson" To: bug-grep@gnu.org Subject: Re: bug#32073: Improvements in Grep (Bug#32073) Content-Type: text/plain X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 64.147.123.25 X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) >> Why not just exposing the read buffer size as a configurable parameter ... Take a look at the (and I quote) "Hairy buffering mechanism for grep" input buffering code in the grep source file grep-3.3/src/grep.c, then you tell me why it's not a runtime variable parameter . In other words, the input (and output) i/o buffering and performance tuning for various situations and kinds of files has been tuned and refined over many years. Doing something to the code, such as making buffer size a run time adjustable parameter, would probably not be easy, would risk making one usage of grep slower in order to make some other usage faster, and would risk some nasty bugs. -- Paul Jackson pj@usa.net From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 14:07:11 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 19:07:11 +0000 Received: from localhost ([127.0.0.1]:37583 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imjKo-0006V4-Aa for submit@debbugs.gnu.org; Wed, 01 Jan 2020 14:07:11 -0500 Received: from mail-il1-f169.google.com ([209.85.166.169]:47082) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imjKm-0006Us-9C for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 14:07:09 -0500 Received: by mail-il1-f169.google.com with SMTP id t17so32599947ilm.13 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 11:07:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=GE5BFtR8fRKW8SyVt2Lsf4LSkbRzd93/7WiEMgl11gY=; b=wLcj4eRokODQL0JaGa9c6cAKCu7JfsifRZmzw4C7SXX44Gq6qvCLR3b4mAPh/l+UMS VJtMKBnP5BTRlNwsYtNlGgi0CSXPRFTsIAkfQ8lrqBEEW1IfX7uEBCmL3CF28vSbeB// gMcALYDiBGg853Ma2cuTs5epE4zWXpYU+giu6yabLP2U63D37ERXXON9PRheQS7ZyXKZ 6nkO1Ke1MiyBHx3cx0unMYYEeesQLZOIQQJjXN9XP5ZDvrpTwC+NrPBpHKGIRWeA8YBn CgjZajjSxDV3mC4V4zrr8EMB4tYv4y5VebZ6EISUtPKNZ77r2Sx10pP5/x1fBHnWE7rh jt4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GE5BFtR8fRKW8SyVt2Lsf4LSkbRzd93/7WiEMgl11gY=; b=Io4LA7W5lP/3oOKN9v6oTszdXzyWwiM5t8cwr8UNetOImddPwsWACTUZrDxz3fa5GU EvVQab9f6IGIRlGqhJ5QgSIDY3iwqZUhDnIWaXL24kLGrdj1LDnwD4kX9sWrB7zrKf5x q7iypdVIKlpVpcpgPCDqHGodSsecsmwq6lZyMGLeTojrFImwqK81vFr8MXND06UDWmQJ pjHMBEeX9tqpOHVNX+gh4CyXErHgdsWHmQLrlFMcvDoVZpAGSgzKbCGaVrlomgO3crNy EWkY9N18muh4DfbXmS+g3jqh77DvrRB9kSIWnqUkwMjw3r34Z5k2XV8H6QHU/QbH5MLz 5KxA== X-Gm-Message-State: APjAAAW/eUaiM5HWABC1RF84tL87+fcjejLYjs9oxjD1Fqozy87K0EPi w644Ffmtoe5cEW6dGBPBPeLkhsfcsiR5P3pbCfudHmTb2MQ= X-Google-Smtp-Source: APXvYqz3EOZRxGhoBIXUl1bc21QJ1eX+eLgDIJhaZSMVcJq8KpODZtuA8VFHuuYcMl86wEhSgL1v7c3L5rMyGkbvsAc= X-Received: by 2002:a92:2804:: with SMTP id l4mr66440415ilf.136.1577905622626; Wed, 01 Jan 2020 11:07:02 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> In-Reply-To: <202001011119.001BJMYA027994@freefriends.org> From: Sergiu Hlihor Date: Wed, 1 Jan 2020 20:06:39 +0100 Message-ID: Subject: Re: bug#32073: Improvements in Grep (Bug#32073) To: arnold@skeeve.com Content-Type: multipart/alternative; boundary="000000000000204a27059b18c80b" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --000000000000204a27059b18c80b Content-Type: text/plain; charset="UTF-8" Arnold, there is no need to write user code, it is already done in benchmarks. One of the standard benchmarks when testing HDDs and SSDs is read throughput vs block size and at different queue depths. Take a look at this" https://www.servethehome.com/wp-content/uploads/2019/12/Corsair-Force-MP600-1TB-ATTO.jpg . In this benchmark, at queue depth 4 and 128KB block size, the SSD was not yet able to achieve the maximum throughput 5GB/s. Moreover, if you extrapolate the results, to a queue depth of 1, you get about ~1.2GB/s out of over 5GB/s theoretical. Therefore for this particular model you need to issue read requests at minimum 512KB block size to achieve maximum throughput. With hard drives I already explained the issue. I have a production server where the HDD RAID array can do theoretically 2.5GB/s and I see read speeds over 500MB/s sustained when large block sizes are used for reads, yet when I use grep, I have a practical bandwidth of 20 to 50 MB/s. Moreover, when it comes to HDDs the math is quite simple and here it is for a standard HDD at 7200 RPM, 240MB/s: 7200 RPM => 120 revolutions per second 240 MB/s at 120 revolutions => 2MB per revolution One revolution time = 1000/120 => 8,33 ms Read throughput per ms = 240KB Worst case scenario: each read request requires a full revolution to reach to the data (head positioning is done concurrently and this can be ignored). Seek time: 8.33ms At 96KB: - Read time: 0.4ms - Total read latency = 8.33 + 0.4 = 8.73ms, read throughput = 1000 / 8.73 * 96KB = 11MB/s At 512KB: - Read time: 2.3ms - Total read latency = 8.33 + 2.3 = 10.63ms, read throughput = 1000 / 10.63 * 512KB = 48MB/s In practice average seek latencies are 4.16ms so throughput is double. This is the cold hard reality. In practice, when each one of you is testing, you are very likely deceived by testing on *one hdd, on an idle system* where you don't have anything else consuming IO in background like a database. In such an ideal scenario you do see 240MB/s because HDDs do also read ahead and by the time the data is transferred over interface and consumed, next chuck is in the buffer and can be delivered with apparent 0 seek time. This means first read takes 4ms, next ones takes 0.1ms. With a* HDD RAID array on a server where your IO is always at 50% load*, if you have a strip size of 128KB or more, you are hitting one drive at a time, each one with a penalty of 4.16ms. And due to constant load, by the time you hit the first hdd again, the read ahead buffer maintained by the HDD itself is also discarded, so all reads go directly to physical medium. If however you hit all HDDs at the same time, you will benefit from the read ahead from the HDD for at least one or more cycles thus having reads with apparent 0 latency and a way higher average bandwidth. The cost of reading from all HDDs at the same time is a potential of adding extra latencies for all other applications running, this is why the value should be configurable, such that best value can be setup based on hardware. The issue of large block sizes for IO operations is widespread across all tools from Linux, like rsync or cp and its only getting worse, to an extend where in my company we are considering writing our own tools for something that should have worked out of the box. One side issue, which I have to mention as I'm not aware of implementation details: as we are getting in GB/s territory, read is best done within it's own thread which then serves the output to the processing thread. With SSDs that can do multi GB/s this matters. On Wed, 1 Jan 2020 at 12:19, wrote: > As a quite serious question, how is someone writing user-level code > supposed to be able to figure out the right buffer size for a particular > file, and to do so portably? ("Show me the code.") > > Gawk bases its reads on the st_blksize member in struct stat. That will > typically be something like 4K - not nearly enough, given your description > below. > > Arnold > > Sergiu Hlihor wrote: > > > This topic is getting more and more frustrating. If you rely on OS, then > > you are at the mercy of whatever read ahead configuration you have. And > > read ahead is typically 128KB so does not help that much. A HDD RAID 10 > > array with 12 disks and a strip size of 128KB reaches the maximum read > > throughput if read block size is 6 * 128 = 768KB. When issuing read > > requests with 128KB , you only hit one HDD, having 1/6 read throughput. > > With flash the same. A state of the art SSD that can do 5GB/s reads can > > actually do around 1GB/s or less at 128KB block size. Why is so hard to > > understand how hardware works and the fact that you need huge block sizes > > to actually read at full speed? Why not just exposing the read buffer > size > > as a configurable parameter, then anyone can just tune it as needed? 96KB > > is purely retarded. > > > > On Wed, 1 Jan 2020 at 08:52, Paul Eggert wrote: > > > > > > This makes me think we should follow Coreutils' lead[0] and increase > > > > grep's initial buffer size from 32KiB, probably to 128KiB. > > > > > > I see that Jim later installed a patch increasing it to 96 KiB. > > > > > > Whatever number is chosen, it's "wrong" for some configuration. And I > > > suppose > > > the particular configuration that Sergiu Hlihor mentioned could be > tweaked > > > so > > > that it worked better with grep (and with other programs). > > > > > > I'm inclined to mark this bug report as a wishlist item, in the sense > that > > > it'd > > > be nice if grep and/or the OS could pick buffer sizes more > intelligently > > > (though > > > it's not clear how grep and/or the OS could go about this). > > > > --000000000000204a27059b18c80b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Arnold, there is no need to write us= er code, it is already done in benchmarks. One of the standard benchmarks w= hen testing HDDs and SSDs is read throughput vs block size and at different= queue depths.=C2=A0 Take a look at this" ht= tps://www.servethehome.com/wp-content/uploads/2019/12/Corsair-Force-MP600-1= TB-ATTO.jpg . In this benchmark, at queue depth 4 and 128KB block size,= the SSD was not yet able to achieve the maximum throughput 5GB/s. Moreover= , if you extrapolate the results, to a queue depth of 1, you get about ~1.2= GB/s out of over 5GB/s theoretical. Therefore for this particular model you= need to issue read requests at minimum 512KB block size to achieve maximum= throughput. With hard drives I already explained the issue. I have a produ= ction server where the HDD RAID array can do theoretically 2.5GB/s and I se= e read speeds over 500MB/s sustained when large block sizes are used for re= ads, yet when I use grep, I have a practical bandwidth of 20 to 50 MB/s. Mo= reover, when it comes to HDDs the math is quite simple and here it is for a= standard HDD at 7200 RPM, 240MB/s:
7200 RPM =3D> 120 revoluti= ons per second
240 MB/s at 120 revolutions =3D> 2MB per r= evolution
One revolution time=C2=A0 =3D 1000/120 =3D> 8,33 ms<= /div>
Read throughput per ms =3D 240KB

Worst c= ase scenario: each read request requires a full revolution to reach to the = data (head positioning is done concurrently and this can be ignored).
<= /div>
Seek time: 8.33ms
At 96KB:
=C2=A0- Read time: 0.4ms
=C2=A0- Total read latency=C2=A0= =3D 8.33 + 0.4 =3D 8.73ms, read throughput=C2=A0 =3D 1000 / 8.73 * 96KB = =3D 11MB/s
At 512KB:
=C2=A0- Read time: 2.3m= s
=C2=A0- Total read latency =3D 8.33 + 2.3 =3D 10.63ms, read thr= oughput=C2=A0 =3D 1000 / 10.63 * 512KB =3D 48MB/s
In practice ave= rage seek latencies are 4.16ms so throughput is double. This is the cold ha= rd reality. In practice, when each one of you is testing, you are very like= ly deceived by testing on one hdd, on an idle system where you don&#= 39;t have anything else consuming IO in background like a database. In such= an ideal scenario you do see 240MB/s because HDDs do also read ahead and b= y the time the data is transferred over interface and consumed, next chuck = is in the buffer and can be delivered with apparent 0 seek time. This means= first read takes 4ms, next ones takes 0.1ms. With a HDD RAID array on a= server where your IO is always at 50% load, if you have a strip size o= f 128KB or more, you are hitting one drive at a time, each one with a penal= ty of 4.16ms. And due to constant load, by the time you hit the first hdd a= gain, the read ahead buffer maintained by the HDD itself is also discarded,= so all reads go directly to physical medium. If however you hit all HDDs a= t the same time, you will benefit from the read ahead from the HDD for at l= east one or more cycles thus having reads with apparent 0 latency and a way= higher average bandwidth. The cost of reading from all HDDs at the same ti= me is a potential of adding extra latencies for all other applications runn= ing, this is why the value should be configurable, such that best value can= be setup based on hardware. The issue of large block sizes for IO operatio= ns is widespread across all tools from Linux, like rsync or cp and its only= getting worse, to an extend where in my company we are considering writing= our own tools for something that should have worked out of the box. One si= de issue, which I have to mention as I'm not aware of implementation de= tails: as we are getting in GB/s territory, read is best done within it'= ;s own thread which then serves the output to the processing thread. With S= SDs that can do multi GB/s this matters.


<= /div>


On Wed, 1 Jan 2020 at 12:19, <arnold@skeeve.com> wrote:
As a quite serious question, how is someon= e writing user-level code
supposed to be able to figure out the right buffer size for a particular file, and to do so portably? ("Show me the code.")

Gawk bases its reads on the st_blksize member in struct stat.=C2=A0 That wi= ll
typically be something like 4K - not nearly enough, given your description<= br> below.

Arnold

Sergiu Hlihor <sh= @discovergy.com> wrote:

> This topic is getting more and more frustrating. If you rely on OS, th= en
> you are at the mercy of whatever read ahead configuration you have. An= d
> read ahead is typically 128KB so does not help that much. A HDD RAID 1= 0
> array with 12 disks and a strip size of 128KB reaches the maximum read=
> throughput if read block size is 6 * 128 =3D 768KB. When issuing read<= br> > requests with 128KB , you only hit one HDD, having 1/6 read throughput= .
> With flash the same. A state of the art SSD that can do 5GB/s reads ca= n
> actually do around 1GB/s or less at 128KB block size. Why is so hard t= o
> understand how hardware works and the fact that you need huge block si= zes
> to actually read at full speed? Why not just exposing the read buffer = size
> as a configurable parameter, then anyone can just tune it as needed? 9= 6KB
> is purely retarded.
>
> On Wed, 1 Jan 2020 at 08:52, Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> > > This makes me think we should follow Coreutils' lead[0] = and increase
> > > grep's initial buffer size from 32KiB, probably to 128Ki= B.
> >
> > I see that Jim later installed a patch increasing it to 96 KiB. > >
> > Whatever number is chosen, it's "wrong" for some co= nfiguration. And I
> > suppose
> > the particular configuration that Sergiu Hlihor mentioned could b= e tweaked
> > so
> > that it worked better with grep (and with other programs).
> >
> > I'm inclined to mark this bug report as a wishlist item, in t= he sense that
> > it'd
> > be nice if grep and/or the OS could pick buffer sizes more intell= igently
> > (though
> > it's not clear how grep and/or the OS could go about this). > >
--000000000000204a27059b18c80b-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 14:43:04 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 19:43:04 +0000 Received: from localhost ([127.0.0.1]:37595 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imjtY-0007LK-0Q for submit@debbugs.gnu.org; Wed, 01 Jan 2020 14:43:04 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43186) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imjtV-0007Kk-3q for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 14:43:01 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id BF9D2160052; Wed, 1 Jan 2020 11:42:54 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id xp1ZcUe4sLgB; Wed, 1 Jan 2020 11:42:54 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1B474160054; Wed, 1 Jan 2020 11:42:54 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id K68Jkv66INS6; Wed, 1 Jan 2020 11:42:54 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E1E10160052; Wed, 1 Jan 2020 11:42:53 -0800 (PST) Subject: Re: Improvements in Grep (Bug#32073) To: Sergiu Hlihor References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> Date: Wed, 1 Jan 2020 11:42:53 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Dennis Clarke , Jim Meyering X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 1/1/20 1:15 AM, Sergiu Hlihor wrote: > If you rely on OS, then > you are at the mercy of whatever read ahead configuration you have. Right, and whatever changes you make to the OS and its read-ahead configuration will work for all applications, not just for 'grep'. So, change the OS to do that. There shouldn't be a need to change 'grep' in particular (or 'cp' in particular, or 'awk' in particular, etc.). > The issue of large > block sizes for IO operations is widespread across all tools from Linux, > like rsync or cp and its only getting worse Quite right. And it would be painful to have to modify all those tools, and to maintain those modifications. So modify the OS instead. Scheduling read-ahead is really the OS's job anyway. From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 15:04:59 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 20:04:59 +0000 Received: from localhost ([127.0.0.1]:37607 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imkEk-0007qg-W0 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 15:04:59 -0500 Received: from mail-il1-f174.google.com ([209.85.166.174]:38191) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imkEi-0007qS-HO for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 15:04:57 -0500 Received: by mail-il1-f174.google.com with SMTP id f5so32700534ilq.5 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 12:04:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=f3gPqw//sPzxPArZaLCn5qCkkS0muRBetlNjDRv9cFw=; b=PtSs/uSm5aVIUCaZXg5w2ZCQsGUa5lQcpOH77ANuNNf+2piUl9tePpfnUfa+N231b4 LN4/iPcDPDuxS0SIErtA/9cOBH/lAoggtTqhmsze0Itxtal1Q9rl/k8kp8VqGzZQpQob Ug/YVEttA1WULSbvtaLmx1SjBtb/oyt+GX5JZGxYNo9Ww3dc7YmUWz2t358Kk8eHku4n AAuP6kIkhOBQGZrqMzVe6dGCeElWKUgInkinYqpWWinD5gPCuIskIA2m6WHb/ZzWtYJO Bg3i9doIJ05U5BZhHYJqmkAV0+RhRClx2oYc0GcSnvtQFY0w8BnZ0HwT6ojKsICI+GOj Npwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=f3gPqw//sPzxPArZaLCn5qCkkS0muRBetlNjDRv9cFw=; b=TTMmEUMp/2qYER9W7/OH0NJfNkVCjbQ93a4SZ7lKU/VMOUdrH4ntOlQ9Amyk0MU/v1 3RsWIXbLs3d5Bvod84nNtN3oRc6770kVemblTN0zGh591o2vySDfEU7lFqo/SN++ugiw BFq+RXTDSXQdUrhRnmBlhWSeWncdp2Zwyye0U5zGxiT7oc7gzH9rck9fxd7lIUXd5zV5 qYdiLaSYKrJKhjB0ursaf6rybkB+EzQntUFGQodz0ImJBmSAGPvVKXbP4gEKmsEpz3ZQ KWsWsQ+MbGam9m5Lz5hvB3M1Nk7epL+P+v4dWFHy7xqAqF3gcW/QJnnkNCea6nvBcBot uwzg== X-Gm-Message-State: APjAAAUfpXpEeR5HZA5G+rM0bsnSPtOv+/39pEfYmiBjqtAWV0/L+AwN QTe0uc5cRMdRzLiUpQyGFaZaG5fYHUKEpK6C02kW/Q== X-Google-Smtp-Source: APXvYqyXJkilmFlV4mF5TVSrUwHwx6bE4fHBFq6X4gjgNZsVhsoYnsx/3nyr/iAeOnF8GKBQd+dW7wxgbdEBWGsQysk= X-Received: by 2002:a92:ce09:: with SMTP id b9mr64895585ilo.219.1577909091082; Wed, 01 Jan 2020 12:04:51 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> In-Reply-To: <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> From: Sergiu Hlihor Date: Wed, 1 Jan 2020 21:04:39 +0100 Message-ID: Subject: Re: Improvements in Grep (Bug#32073) To: Paul Eggert Content-Type: multipart/alternative; boundary="000000000000dcbab1059b199639" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Dennis Clarke , Jim Meyering X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --000000000000dcbab1059b199639 Content-Type: text/plain; charset="UTF-8" Paul, I have to correct you. On a production server you have usually a mix of applications many times including databases. For databases, having a read ahead means one IO less since usually database access patterns are random reads. Here actually best is to disable completely read ahead. In fact, I do have to say that probably best is to disable completely read ahead and let applications deal with it, either in an automatic fashion, like reading the optimal IO block size from device or in a configurable way with defaults good enough for today's servers. If you now configure the OS to do a read ahead hitting all HDDs then you induce potentially unnecessary IO load for all applications which use it, which when having HDDs is totally unacceptable. That's why the best is to be application specific and ideally configured to use optimal IO block size. So no, letting OS to do it is stupid. On Wed, 1 Jan 2020 at 20:42, Paul Eggert wrote: > On 1/1/20 1:15 AM, Sergiu Hlihor wrote: > > If you rely on OS, then > > you are at the mercy of whatever read ahead configuration you have. > > Right, and whatever changes you make to the OS and its read-ahead > configuration > will work for all applications, not just for 'grep'. So, change the OS to > do > that. There shouldn't be a need to change 'grep' in particular (or 'cp' in > particular, or 'awk' in particular, etc.). > > > The issue of large > > block sizes for IO operations is widespread across all tools from Linux, > > like rsync or cp and its only getting worse > > Quite right. And it would be painful to have to modify all those tools, > and to > maintain those modifications. So modify the OS instead. Scheduling > read-ahead is > really the OS's job anyway. > --000000000000dcbab1059b199639 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Paul, I have to correct you. On a production server y= ou have usually a mix of applications many times including databases. For d= atabases, having a read ahead means one IO less since usually database acce= ss patterns are random reads. Here actually best is to disable completely r= ead ahead. In fact, I do have to say that probably best is to disable compl= etely read ahead and let applications deal with it, either in an automatic = fashion, like reading the optimal IO block size from device=C2=A0 or in a c= onfigurable way with defaults good enough for today's servers. If you n= ow configure the OS to do a read ahead hitting all HDDs then you induce pot= entially unnecessary IO load for all applications which use it, which when = having HDDs is totally unacceptable. That's why the best is to be appli= cation specific and ideally configured to use optimal IO block size.
<= div>
So no, letting OS to do it is stupid.

On Wed, 1 Jan 2= 020 at 20:42, Paul Eggert <eggert@cs.ucla.edu> wrote:
On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
> If you rely on OS, then
> you are at the mercy of whatever read ahead configuration you have.
Right, and whatever changes you make to the OS and its read-ahead configura= tion
will work for all applications, not just for 'grep'. So, change the= OS to do
that. There shouldn't be a need to change 'grep' in particular = (or 'cp' in
particular, or 'awk' in particular, etc.).

> The issue of large
> block sizes for IO operations is widespread across all tools from Linu= x,
> like rsync or cp and its only getting worse

Quite right. And it would be painful to have to modify all those tools, and= to
maintain those modifications. So modify the OS instead. Scheduling read-ahe= ad is
really the OS's job anyway.

--000000000000dcbab1059b199639-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 15:24:36 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 20:24:36 +0000 Received: from localhost ([127.0.0.1]:37619 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imkXj-0008Ir-VL for submit@debbugs.gnu.org; Wed, 01 Jan 2020 15:24:36 -0500 Received: from freefriends.org ([96.88.95.60]:49340) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imkXi-0008Ik-D9 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 15:24:34 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 001KOQ9E012802 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 1 Jan 2020 13:24:27 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 001KOQMn012801; Wed, 1 Jan 2020 13:24:26 -0700 From: arnold@skeeve.com Message-Id: <202001012024.001KOQMn012801@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Wed, 01 Jan 2020 13:24:26 -0700 To: sh@discovergy.com, arnold@skeeve.com Subject: Re: bug#32073: Improvements in Grep (Bug#32073) References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> In-Reply-To: User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.9 (/) Hi. Sergiu Hlihor wrote: > Arnold, there is no need to write user code, it is already done in > benchmarks. One of the standard benchmarks when testing HDDs and SSDs is > read throughput vs block size and at different queue depths. I think you're misunderstanding me, or I am misunderstanding you. As the gawk maintainer, I can choose the buffer size to use every time I issue a read(2) system call for any given input file. Gawk currently uses the smaller of (a) the file's size or (b) the st_blksize member of the struct stat array. If I understand you correctly, this is "not enough"; gawk (grep, cp, etc.) should all use an optimal buffer size that depends upon the underlying storage hardware where the file is located. So far, so good, except for: How do I determine what that number is? I cannot run a benchmark before opening each and every file. I don't know of a system call that will give me that number. (If there is, please point me to it.) Do you just want a command line option or environment variable that you, as the application user, can set? If the latter, it happens that gawk will let you set AWKBUFSIZE and it will use whatever number you supply for doing reads. (This is even documented.) HTH, Arnold From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 16:02:49 2020 Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 21:02:49 +0000 Received: from localhost ([127.0.0.1]:37654 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iml8j-0000lx-3d for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:02:49 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:48738) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iml8g-0000lf-Gi for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 16:02:47 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1D85D160052; Wed, 1 Jan 2020 13:02:39 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id td9dIdw_GCN9; Wed, 1 Jan 2020 13:02:38 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7A6AA160054; Wed, 1 Jan 2020 13:02:38 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id N3e1b83al3QG; Wed, 1 Jan 2020 13:02:38 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 51D6C160052; Wed, 1 Jan 2020 13:02:38 -0800 (PST) Subject: Re: bug#32073: Improvements in Grep (Bug#32073) To: Sergiu Hlihor References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <0c596c01-3a43-2651-7de8-50d92ae195a4@cs.ucla.edu> Date: Wed, 1 Jan 2020 13:02:38 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 1/1/20 12:04 PM, Sergiu Hlihor wrote: > That's why the best is to be application specific That doesn't mean that one should have to modify every application. One could instead modify the OS so that it uses different read-ahead heuristics for different classes of applications. This should be easier to manage than modifying every individual application. From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 16:46:15 2020 Received: (at submit) by debbugs.gnu.org; 1 Jan 2020 21:46:15 +0000 Received: from localhost ([127.0.0.1]:37671 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imlol-0001m9-FE for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:46:15 -0500 Received: from lists.gnu.org ([209.51.188.17]:36204) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imlok-0001m2-Fw for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:46:14 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:41740) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1imloi-0006aO-S5 for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:14 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_LOW, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1imloh-0007Jc-Mw for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:12 -0500 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:42797) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1imloh-0007J0-EU for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:11 -0500 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id BB67E2234B for ; Wed, 1 Jan 2020 16:46:10 -0500 (EST) Received: from imap34 ([10.202.2.84]) by compute1.internal (MEProxy); Wed, 01 Jan 2020 16:46:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=1lZVA+ i/aNbISUaTQxnlsayXO9m5ai4v70uzaoJnjf8=; b=h4D19IOsSFh+M6g73+sQnr QJG90tT+P2IiguwhZhb1Ft+nsk5aE/8bGTNpL3vOcKJspn2deBc/jEbiLX9Gp2qe DOzYXhVUH6OGVvHnIGulN9GUguvgqNfbt9UC5vqdkr6jLuXK9RyT6pyTrD38acU6 RmmdYhMOVi6F89BVZApfBhtsbiePo3ERZfNauGOEeGqpE5FQ6B7Rg6J42akfU7/J w3Fh5UZ2zPeBILfSh56hlaY69HAGwaI0GFb8iwZIrXhs6eTLJg1lyipZwV1jCn3i Y9KKzGRr89E2NV6ZnEELGqkL8mOJr0iUFhtq1e3AiDeHdd/SEiaFHOkJrvmyzuOQ == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdefledgudehvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesth dtredtreertdenucfhrhhomhepfdfrrghulhculfgrtghkshhonhdfuceophhjsehushgr rdhnvghtqeenucfrrghrrghmpehmrghilhhfrhhomhepphhjsehushgrrdhnvghtnecuve hluhhsthgvrhfuihiivgeptd X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 3B42A1460061; Wed, 1 Jan 2020 16:46:10 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.1.7-694-gd5bab98-fmstable-20191218v1 Mime-Version: 1.0 Message-Id: In-Reply-To: References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> Date: Wed, 01 Jan 2020 15:45:54 -0600 From: "Paul Jackson" To: bug-grep@gnu.org Subject: Re: bug#32073: Improvements in Grep (Bug#32073) Content-Type: text/plain X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 66.111.4.27 X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) >From my old Unix fart view point, Paul (the other Paul) is herding a hundred GNU cats, small command line utilities, many of which date their origins back to the 1970's, many of which have over the years grown their own internal i/o routines with specific performance specializations, but few of which have much in the way of user customizable i/o blocking and read-ahead customizations. Except for the last decade, those commands spent almost their entire lives running off spinning rust platters, which grew (immensely) in size over the years, but which did not change much in other performance characteristics. Those commands are in general not well suited to adapting to provide maximally optimal performance across the recent generation of storage devices, with their much more varied performance characteristics. I'm guessing that Sergiu has some specific needs that it seems that grep meets, except that grep (like its hundred cat siblings) lacks the tunable i/o characteristics needed to get maximum performance across a rapidly evolving variety of these more recent kinds of storage. What I've done in situations such as I suspect Sergiu finds himself in is to code up a custom utility, that met my specific needs, when I had higher performance demands, while continuing to make extensive use of the general purpose classic Unix/Linux command line utilities that Paul E. now herds. I can't imagine that it would make sense to attempt to recode a hundred classic GNU utilities to each be intelligently adaptable goats/pigs/cats/dogs/cows/bison/... depending on the i/o terrain they were running on. Many many thanks to Paul E. for herding these cats all these many years. I hope my weird comments to not cause him even the slightest distress. (The word "cat" above refers to four legged felines, not to the concatenate command line utility.) -- Paul Jackson pj@usa.net From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 19:51:20 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 00:51:21 +0000 Received: from localhost ([127.0.0.1]:37827 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imohs-0003uy-Ks for submit@debbugs.gnu.org; Wed, 01 Jan 2020 19:51:20 -0500 Received: from mail-wr1-f67.google.com ([209.85.221.67]:38805) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imohq-0003ul-0s for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 19:51:18 -0500 Received: by mail-wr1-f67.google.com with SMTP id y17so37907645wrh.5 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 16:51:17 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=+G2u1+TYKprGWirYES7YqULTKruNQnNn0zOaJRjBkZQ=; b=DJpehv2iK45sVa2pCWwsjuhscGIgi7Vi+JiDPqoKcUIob746N1bEwKed6Zz4uBTo9J 79eVt33udjV2xpDpaBAbUI0+JClV5SM+w5iEsbV0baXoAD+PkggvHJlJyD4hIVd2kP4O O0dkvRo5s161Ji2xmGe4jjxgLfiZs1Tlbt1ZM4yEdEJ/XvBYVJa1fMgNdtC4bHDmtth8 EcfBurLtE+kUPbjWpdJJ223Xz9gRhcVjLod4RgxiZCFORQDSHSQmGkwjHQGytLv2NjD+ xotEpLdrbGq5KeOyV4w0qtm/f0wbzXmHQE3rgggERy8/QH+o63Pu6RGSU76ILOYZ8Enk GflA== X-Gm-Message-State: APjAAAUvGuD5mRkDz7QdxDJYD/M5KrwuoEZQvDe9YthH/7iwGE96hNvJ zm+pZHTKXMSPjejYb6YEoyW91XAaUVX51yl+rUE= X-Google-Smtp-Source: APXvYqx0LgCp9iag1ABv/x0dAx+wgdhnP1u3ZR3gJKeHrLjJoDlAoey26zheesXxnwVnWVakWy8OeBIjdeOkucRiy2M= X-Received: by 2002:a5d:670a:: with SMTP id o10mr82667154wru.227.1577926272259; Wed, 01 Jan 2020 16:51:12 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> In-Reply-To: From: Jim Meyering Date: Wed, 1 Jan 2020 16:51:00 -0800 Message-ID: Subject: Re: Improvements in Grep (Bug#32073) To: Sergiu Hlihor Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert , Dennis Clarke X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Wed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor wrote: > Paul, I have to correct you. On a production server you have usually a mi= x of applications many times including databases. For databases, having a r= ead ahead means one IO less since usually database access patterns are rand= om reads. Here actually best is to disable completely read ahead. In fact, = I do have to say that probably best is to disable completely read ahead and= let applications deal with it, either in an automatic fashion, like readin= g the optimal IO block size from device or in a configurable way with defa= ults good enough for today's servers. If you now configure the OS to do a r= ead ahead hitting all HDDs then you induce potentially unnecessary IO load = for all applications which use it, which when having HDDs is totally unacce= ptable. That's why the best is to be application specific and ideally confi= gured to use optimal IO block size. > > So no, letting OS to do it is stupid. > > On Wed, 1 Jan 2020 at 20:42, Paul Eggert wrote: >> >> On 1/1/20 1:15 AM, Sergiu Hlihor wrote: >> > If you rely on OS, then >> > you are at the mercy of whatever read ahead configuration you have. >> >> Right, and whatever changes you make to the OS and its read-ahead config= uration >> will work for all applications, not just for 'grep'. So, change the OS t= o do >> that. There shouldn't be a need to change 'grep' in particular (or 'cp' = in >> particular, or 'awk' in particular, etc.). >> >> > The issue of large >> > block sizes for IO operations is widespread across all tools from Linu= x, >> > like rsync or cp and its only getting worse >> >> Quite right. And it would be painful to have to modify all those tools, = and to >> maintain those modifications. So modify the OS instead. Scheduling read-= ahead is >> really the OS's job anyway. Hi Sergiu, If you would like to help make grep use larger buffer sizes, please run and report benchmarks measuring how much of a difference it would make, at least for your hardware. Here are some of the tests I ran to justify raising it from ~32k to ~96k: https://lists.gnu.org/archive/html/grep-devel/2018-10/msg00002.html From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 20:04:17 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 01:04:17 +0000 Received: from localhost ([127.0.0.1]:37835 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imouO-0004E8-SV for submit@debbugs.gnu.org; Wed, 01 Jan 2020 20:04:17 -0500 Received: from mail-io1-f47.google.com ([209.85.166.47]:44138) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imouM-0004Dv-Q6 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 20:04:15 -0500 Received: by mail-io1-f47.google.com with SMTP id b10so36954283iof.11 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 17:04:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=uG0zcpsFJMb40VAoaqgG4yBxN9fQ8HWatjcq1WBdwuI=; b=MrT5OWrM9nJE49cTUjxs8k/CxT7nbY4ZeVQEGSTjEnMFfbQgATGf6icSTcK75Z88No nNl+qTwFLLBjZattlCjmMwjNt8ZavrfHuQJQJUOMBpTmDoB6y+kw/Hp3G5lBJ5zuSawo EgkmrtKl6uGtcn+GLpXN0/U+qbL7M2RfFYL30m0JYOBRix5Yt95amdM6LpKCvddxzao8 nXZRyxNjdFAEBlTNx2e9ItM8eCid8K/Yu+gbtEl6aMmyh5FuwU7GaMLAjGGObUIGWqkc jiMxWWi+Zp/GIXeZKmkeOuZwGz8xt9iuBOC6w/J19PbEJagxok0z8tZD2+9n/HZuWW9E HHJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uG0zcpsFJMb40VAoaqgG4yBxN9fQ8HWatjcq1WBdwuI=; b=nUyERW3t+ZcnzUWltGBTcmkQR4kKsjbsyF320UpOEb5933zi5sJoVEw7z0JDP1rDfV FTeC7XyNHNGI7zX8rQnDkOhKs+tPCFRX4SomGFhkhIFuuEJtT4/IQpPGFpRIsuicQifn +hPRNqytX/ulsOZJL5Le0w8fTXV03dHuosziGZqMBPDJsG824Czh51KM0ijQf+VaEYXY 3QsP9zH3EufrihVbr0jprdN/b43SMG7JsgGJUa1NL1pDcGpUJ1z0KAlrEiFptwzDtaTQ JhYwBaCQcWUuTW97ch2C4GPYKSXlMCHKoPoYuufa8T2zMwi/+UMknLMiip3u+qcHP9sT QQLA== X-Gm-Message-State: APjAAAUBj/1bVqH9LaAq+VlcVHbDLEec28q59Knq8uW/Ze9lRNRmT/M+ p+Zbru5g2e8Y9UmFhGS3x4ih5Z7nQhhWzjs0qBtrMQ== X-Google-Smtp-Source: APXvYqxTMks5Ajdq0T7iEzgJjZoWJOPNuJpms5hMMgKkp0KY1CrYSQcHExV4liYzv2ybrV18DBFtVhqQwDumVAvR104= X-Received: by 2002:a5e:8505:: with SMTP id i5mr50080878ioj.158.1577927049287; Wed, 01 Jan 2020 17:04:09 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> In-Reply-To: From: Sergiu Hlihor Date: Thu, 2 Jan 2020 02:03:58 +0100 Message-ID: Subject: Re: Improvements in Grep (Bug#32073) To: Jim Meyering Content-Type: multipart/alternative; boundary="000000000000412dba059b1dc5f9" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert , Dennis Clarke X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --000000000000412dba059b1dc5f9 Content-Type: text/plain; charset="UTF-8" Hi Jim, The system for which this hurts me the most is an Ubuntu 14.04 where I'd need to run it as a separate binary. As I'm not familiar with the way it's built, is there any guidelines of how to build it from sources? I'd happy build it with ever larger block sizes and test. On Thu, 2 Jan 2020 at 01:51, Jim Meyering wrote: > On Wed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor wrote: > > Paul, I have to correct you. On a production server you have usually a > mix of applications many times including databases. For databases, having a > read ahead means one IO less since usually database access patterns are > random reads. Here actually best is to disable completely read ahead. In > fact, I do have to say that probably best is to disable completely read > ahead and let applications deal with it, either in an automatic fashion, > like reading the optimal IO block size from device or in a configurable > way with defaults good enough for today's servers. If you now configure the > OS to do a read ahead hitting all HDDs then you induce potentially > unnecessary IO load for all applications which use it, which when having > HDDs is totally unacceptable. That's why the best is to be application > specific and ideally configured to use optimal IO block size. > > > > So no, letting OS to do it is stupid. > > > > On Wed, 1 Jan 2020 at 20:42, Paul Eggert wrote: > >> > >> On 1/1/20 1:15 AM, Sergiu Hlihor wrote: > >> > If you rely on OS, then > >> > you are at the mercy of whatever read ahead configuration you have. > >> > >> Right, and whatever changes you make to the OS and its read-ahead > configuration > >> will work for all applications, not just for 'grep'. So, change the OS > to do > >> that. There shouldn't be a need to change 'grep' in particular (or 'cp' > in > >> particular, or 'awk' in particular, etc.). > >> > >> > The issue of large > >> > block sizes for IO operations is widespread across all tools from > Linux, > >> > like rsync or cp and its only getting worse > >> > >> Quite right. And it would be painful to have to modify all those tools, > and to > >> maintain those modifications. So modify the OS instead. Scheduling > read-ahead is > >> really the OS's job anyway. > > Hi Sergiu, > > If you would like to help make grep use larger buffer sizes, please > run and report benchmarks measuring how much of a difference it would > make, at least for your hardware. Here are some of the tests I ran to > justify raising it from ~32k to ~96k: > https://lists.gnu.org/archive/html/grep-devel/2018-10/msg00002.html > --000000000000412dba059b1dc5f9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Jim,
The system for whi= ch this hurts me the most is an Ubuntu 14.04 where I'd need to run it a= s a separate binary. As I'm not familiar with the way it's built, i= s there any guidelines of how to build it from sources? I'd happy build= it with ever larger block sizes and test.

On Thu, 2 Jan 2020 at 01:51= , Jim Meyering <jim@meyering.net= > wrote:
On W= ed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor <sh@discovergy.com> wrote:
> Paul, I have to correct you. On a production server you have usually a= mix of applications many times including databases. For databases, having = a read ahead means one IO less since usually database access patterns are r= andom reads. Here actually best is to disable completely read ahead. In fac= t, I do have to say that probably best is to disable completely read ahead = and let applications deal with it, either in an automatic fashion, like rea= ding the optimal IO block size from device=C2=A0 or in a configurable way w= ith defaults good enough for today's servers. If you now configure the = OS to do a read ahead hitting all HDDs then you induce potentially unnecess= ary IO load for all applications which use it, which when having HDDs is to= tally unacceptable. That's why the best is to be application specific a= nd ideally configured to use optimal IO block size.
>
> So no, letting OS to do it is stupid.
>
> On Wed, 1 Jan 2020 at 20:42, Paul Eggert <eggert@cs.ucla.edu> wrote:
>>
>> On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
>> > If you rely on OS, then
>> > you are at the mercy of whatever read ahead configuration you= have.
>>
>> Right, and whatever changes you make to the OS and its read-ahead = configuration
>> will work for all applications, not just for 'grep'. So, c= hange the OS to do
>> that. There shouldn't be a need to change 'grep' in pa= rticular (or 'cp' in
>> particular, or 'awk' in particular, etc.).
>>
>> > The issue of large
>> > block sizes for IO operations is widespread across all tools = from Linux,
>> > like rsync or cp and its only getting worse
>>
>> Quite right. And it would be painful to have to modify all those t= ools, and to
>> maintain those modifications. So modify the OS instead. Scheduling= read-ahead is
>> really the OS's job anyway.

Hi Sergiu,

If you would like to help make grep use larger buffer sizes, please
run and report benchmarks measuring how much of a difference it would
make, at least for your hardware. Here are some of the tests I ran to
justify raising it from ~32k to ~96k:
https://lists.gnu.org/archive/htm= l/grep-devel/2018-10/msg00002.html
--000000000000412dba059b1dc5f9-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 20:28:36 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 01:28:37 +0000 Received: from localhost ([127.0.0.1]:37880 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1impHw-0006mb-MV for submit@debbugs.gnu.org; Wed, 01 Jan 2020 20:28:36 -0500 Received: from mail-wr1-f47.google.com ([209.85.221.47]:36433) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1impHu-0006mO-T9 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 20:28:35 -0500 Received: by mail-wr1-f47.google.com with SMTP id z3so37948657wru.3 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 17:28:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=eU99TcSbcI2iR4jYDl47FXSwzC2QYmBjYtFB7wsDhlY=; b=bdN0ieBVyX4L9runq/UiI8jXpfthpTXNvsdZ2TI/tLLgSaEQS4J2fFufVfjOb0vpUA gYjg/rDRm0YwzSt7f+Hl7gqU5IFYQDqbMVAo/K6lUaF7i1Z4kxxq3ycNb+9LqYgWiniQ GlNwmffzXafOBCpH0fWP6TcQXRdMkSZqs43t8n6tY5eog95v3G8m6di57iszM7iG6pOE +mO8bRCnFo+q3SYmZ2jStV0+8ffaQ2Zxz73deIuiqEHvtATRWE4jDeJsnfI80saRircX TTEhGdVuwFWqtMa1fzyzEUuEABE7a73qAJDFPk2NB9ovcgwHfo/etGYtfsN7KSPnARO2 YHjA== X-Gm-Message-State: APjAAAXIIyeRpJMZAZQvcCyCRXpt7S6SX9b1QIlnC/7btxPsXWgwFFle zTCQ5tYxN0Px15Iw5wpjDnr/z8/pJXoWwOid6Bs= X-Google-Smtp-Source: APXvYqySTEjnPE8tRJ0a5sSTVSgL2avrAP8ymb5OUQxtAPFVBAkSIbmHnq3zRcia95N5M7QjqAOZAcPcDQJzboNRLyE= X-Received: by 2002:adf:8b4f:: with SMTP id v15mr50952033wra.231.1577928509167; Wed, 01 Jan 2020 17:28:29 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu> In-Reply-To: From: Jim Meyering Date: Wed, 1 Jan 2020 17:28:17 -0800 Message-ID: Subject: Re: Improvements in Grep (Bug#32073) To: Sergiu Hlihor Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert , Dennis Clarke X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Wed, Jan 1, 2020 at 5:04 PM Sergiu Hlihor wrote: > The system for which this hurts me the most is an Ubuntu 14.04 where I'd = need to run it as a separate binary. As I'm not familiar with the way it's = built, is there any guidelines of how to build it from sources? I'd happy b= uild it with ever larger block sizes and test. Something like the following should work: (if you want to be more careful than most, also download the .sig file, https://meyering.net/grep/grep-3.3.49-3f11.tar.xz.sig, and use that to verify the .xz file is the same one I signed -- do that before running ./configure) wget https://meyering.net/grep/grep-3.3.49-3f11.tar.xz xz -dc grep-3.3.49-3f11.tar.xz|tar xf - cd grep-3.3.49-3f11 ./configure && make From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 23:21:01 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 04:21:01 +0000 Received: from localhost ([127.0.0.1]:37955 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imryn-0004QT-39 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 23:21:01 -0500 Received: from mail-io1-f43.google.com ([209.85.166.43]:43901) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imryk-0004QG-PH for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 23:20:59 -0500 Received: by mail-io1-f43.google.com with SMTP id n21so35648420ioo.10 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 20:20:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0u51IBPkym3O6ykR49I8bLd9JMefMP+5ICUxZiip4Tw=; b=v+WVgp+D4fgtEt10xknIenXIjr60rQjOvNOoiMYa6yEqgKEv3LTii3ID3nYAX3CGX0 k8yXPHyLCTHtoI4CmpxGVVy2IacDN8saeFEzX9TSab+FLKThCv337qlpjyTAmEj0Oj1D MVv+4/z91VO48qgwttZsg8P60cOUAMHA+43+XU3DYIkxn2SDfLWj7t34uaPvfjViIu3y N0/TWIy+6oIyFmtceAmjaRoFN9uAzlG+Itf+iGpRoIeJgmWFuRYl4p02qklcBnYTdCRP wu8WVmPPuu92uTusgcLDS+8ksi7zbvSPT4SfP8vL9T4GDnUQUA/kMZK1Cq0/AyRjkZwc Fasw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0u51IBPkym3O6ykR49I8bLd9JMefMP+5ICUxZiip4Tw=; b=LYMTHw6kc5wq90ze4bPkbQdQeGfsddaBz0PON3DKg+Rl5CaZ83CBypwJAG1dt8DPMW dpqMLeml7BP2Ljwio/I3dNEwqYlfED5jqqEmuCBmFyN61e0vuHgDljJBlj89+exmzAW0 j0PDPtZ38jrEZKFT/Dk7I8aCoHkg71SopYjH0rTrm72AQzLZtr3Zwy5VVa7p3ZUFN0F0 rN0CCIJApguXuhQQwAjs3RSYEc87gsqyzwsx9a4mIVFXxAdLxXM8gQt9KUIeLQ89toiw fEACZFHGkWC9bv+BgntjKxIluPdtN1wOvoA6ZwOsKQGCTdzS7+AxrjYYiLciBuZp/W+s 2zFA== X-Gm-Message-State: APjAAAUt2XKOUDzF3DzPKfPJ+3DLn8RoQSk+Hd9nDjYAAwCHybQuPNNC IPN48at9KM3PIkRWT5jJmm50aPJgT0GFECOnbYsYBw== X-Google-Smtp-Source: APXvYqwA2MW4Qn+8n/GFNc3PI8pxJNjrUq6DpUNgUnJtlCyCEKbNjz4Mvydu5U2VNiU9EsWifZhf25+0rS7Z+lAby38= X-Received: by 2002:a6b:fe0f:: with SMTP id x15mr50887247ioh.219.1577938853127; Wed, 01 Jan 2020 20:20:53 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> <202001012024.001KOQMn012801@freefriends.org> In-Reply-To: <202001012024.001KOQMn012801@freefriends.org> From: Sergiu Hlihor Date: Thu, 2 Jan 2020 05:20:32 +0100 Message-ID: Subject: Re: bug#32073: Improvements in Grep (Bug#32073) To: arnold@skeeve.com Content-Type: multipart/alternative; boundary="000000000000d1810d059b208482" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --000000000000d1810d059b208482 Content-Type: text/plain; charset="UTF-8" Hi Arnold, If AWKBUFSIZE translates to disk IO request size then it is already what its needed. However it's a little annoying. Regarding optimal settings, the benchmark actually tells you what is optimal. Let's assume grep or any other tool can process in memory 3GB/s. If your device can server 5GB/s then you can saturate the CPU. If however the device needs at least X as block size to reach the maximum throughput, then that's what you have to use. Plain and simple. And as I said, when going into GB territory, at application level reads have to be asynchronous. if you look at benchmarking tools like Atto you see the graphs clearly and see the scaling for SSDs. And just happens that the value good for SSDs (minimum 512KB) also benefits HDD RAID arrays with strip sizes smaller than 512KB. With HDD RAID arrays unfortunately it does get complicated because you have to know the number of disks and strip size. I for example always use tune2fs and set those parameters when format the partition. This could just as well be a configurable OS parameter per drive and based on the location of the file, the right value could be used. But I have to admit that this would add exponential complexity with diminishing returns versus just setting a buffer size of 1MB (which will cover both current and future SSDs). Also I'm not too fond of heuristics or any other smartness at IO level in Linux IO stack. I'm working with large databases (as user) and discussed about Linux IO stack with database developers. The common opinion is that Linux IO stack got out of control and nobody actually has a good overview anymore. And I tend to agree. Linux needs an IO stack as lean as possible and let the applications decide what to do, as at the application level you know your usage pattern. I already had to finetune the database due to it. On Wed, 1 Jan 2020 at 21:24, wrote: > Hi. > > Sergiu Hlihor wrote: > > > Arnold, there is no need to write user code, it is already done in > > benchmarks. One of the standard benchmarks when testing HDDs and SSDs is > > read throughput vs block size and at different queue depths. > > I think you're misunderstanding me, or I am misunderstanding you. > > As the gawk maintainer, I can choose the buffer size to use every time > I issue a read(2) system call for any given input file. Gawk currently > uses the smaller of (a) the file's size or (b) the st_blksize member of > the struct stat array. > > If I understand you correctly, this is "not enough"; gawk (grep, > cp, etc.) should all use an optimal buffer size that depends upon the > underlying storage hardware where the file is located. > > So far, so good, except for: How do I determine what that number is? > I cannot run a benchmark before opening each and every file. I don't > know of a system call that will give me that number. (If there is, > please point me to it.) > > Do you just want a command line option or environment variable > that you, as the application user, can set? > > If the latter, it happens that gawk will let you set AWKBUFSIZE and > it will use whatever number you supply for doing reads. (This is > even documented.) > > HTH, > > Arnold > --000000000000d1810d059b208482 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Arnold,
If AWKBUFSIZE translates to disk= IO request size then it is already what its needed. However it's a lit= tle annoying.

Regarding optimal settings, the= benchmark actually tells you what is optimal. Let's assume grep or any= other tool can process in memory 3GB/s. If your device can server 5GB/s th= en you can saturate the CPU.=C2=A0 If however the device needs at least X a= s block size to reach the maximum throughput, then that's what you have= to use. Plain and simple. And as I said, when going into GB territory, at = application level reads have to be asynchronous.
if you look = at benchmarking tools like Atto you see the graphs clearly and see the scal= ing for SSDs. And just happens that the value good for SSDs=C2=A0 (minimum = 512KB) also benefits HDD RAID arrays with strip sizes smaller than 512KB. W= ith HDD RAID arrays unfortunately it does get complicated because you have = to know the number of disks and strip size. I for example always use tune2f= s and set those parameters when format the partition. This could just as we= ll be a configurable OS parameter per drive and based on the location of th= e file, the right value could be used. But I have to admit that this would = add exponential complexity with diminishing returns versus just setting a b= uffer size of 1MB (which will cover both current and future SSDs).

Also I'm not too fond of heuristics or any other = smartness at IO level in Linux IO stack. I'm working with large databas= es (as user) and discussed about Linux IO stack with database developers. T= he common opinion is that Linux IO stack got out of control and nobody actu= ally has a good overview anymore. And I tend to agree. Linux needs an IO st= ack as lean as possible and let the applications decide what to do, as at t= he application level you know your usage pattern. I already had to finetune= the database due to it.

On Wed, 1 Jan 2020 at 21:24, <arnold@skeeve.com&g= t; wrote:
Hi.
Sergiu Hlihor <sh= @discovergy.com> wrote:

> Arnold, there is no need to write user code, it is already done in
> benchmarks. One of the standard benchmarks when testing HDDs and SSDs = is
> read throughput vs block size and at different queue depths.

I think you're misunderstanding me, or I am misunderstanding you.

As the gawk maintainer, I can choose the buffer size to use every time
I issue a read(2) system call for any given input file.=C2=A0 Gawk currentl= y
uses the smaller of (a) the file's size or (b) the st_blksize member of=
the struct stat array.

If I understand you correctly, this is "not enough"; gawk (grep,<= br> cp, etc.) should all use an optimal buffer size that depends upon the
underlying storage hardware where the file is located.

So far, so good, except for: How do I determine what that number is?
I cannot run a benchmark before opening each and every file. I don't know of a system call that will give me that number. (If there is,
please point me to it.)

Do you just want a command line option or environment variable
that you, as the application user, can set?

If the latter, it happens that gawk will let you set AWKBUFSIZE and
it will use whatever number you supply for doing reads. (This is
even documented.)

HTH,

Arnold

--000000000000d1810d059b208482-- From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 02:20:50 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 07:20:50 +0000 Received: from localhost ([127.0.0.1]:38029 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imumo-0000fn-6G for submit@debbugs.gnu.org; Thu, 02 Jan 2020 02:20:50 -0500 Received: from freefriends.org ([96.88.95.60]:53654) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imumm-0000fa-K1 for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 02:20:49 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 0027KfTt032105 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 2 Jan 2020 00:20:42 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 0027Kf58032104; Thu, 2 Jan 2020 00:20:41 -0700 From: arnold@skeeve.com Message-Id: <202001020720.0027Kf58032104@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Thu, 02 Jan 2020 00:20:41 -0700 To: sh@discovergy.com, arnold@skeeve.com Subject: Re: bug#32073: Improvements in Grep (Bug#32073) References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> <202001012024.001KOQMn012801@freefriends.org> In-Reply-To: User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.9 (/) Hi. Sergiu Hlihor wrote: > Hi Arnold, > If AWKBUFSIZE translates to disk IO request size then it is already what > its needed. However it's a little annoying. How would you make it less annoying? Thanks, Arnold From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 10:32:18 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 15:32:18 +0000 Received: from localhost ([127.0.0.1]:39931 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1in2SQ-0000DQ-EE for submit@debbugs.gnu.org; Thu, 02 Jan 2020 10:32:18 -0500 Received: from mail-io1-f50.google.com ([209.85.166.50]:36201) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1in2SN-0000DC-3G for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 10:32:16 -0500 Received: by mail-io1-f50.google.com with SMTP id r13so28557954ioa.3 for <32073@debbugs.gnu.org>; Thu, 02 Jan 2020 07:32:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=discovergy-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BevTY+c4smzn2KUxkNgUFfbSAViXE48Qmf658fuB92o=; b=XLhNTwy47ZUKUj/A4BYhoI/sTeSVwylWUb+Y65hEhL5S0g1YxQ/24NOoCNRsIg8+Rw cztN/FJDuW+tohcTqYACqG04Pl4F3ByOQeLDZasef2Ud0kPv7ZmAh4jH3EtzTTmgoKxJ oZXhn9af9LZS3RvHDUUoEU3JPWl/rDsCq4+jlR8LCQFmwVujQ0+vnyUtGm9Fg3/Halur ToB3ic0sZ1AEvwK02CRbGnA4ElGjrpVNryXlnwrgBiv/jhrqOfW3VOos2ec54XXrVCZ3 +zxyE/L4BOszTL29VJY77qIhlcStWdIbBtacEbSs2+bjPa/63SQixcgySFQtniYsn9OI B9zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BevTY+c4smzn2KUxkNgUFfbSAViXE48Qmf658fuB92o=; b=Fgjz29MIH/suYsJVVJhAmXDJuyIZlDjXZ4KSv+clJLXQUhVwaXo2Gu/uqGWqzVpaMw 6QqUCcqoQS1+qyB47h4X+ulg9cuXPg2NWfApFrBMZKvLW8X8DFzgxlQdmb6XMfvqZKtM zdGU7wBqkwlC+ACMJIwLEMcjqxFlvR2iE+lHk6YN2rsGDeNXmu38+1/m23sZja1UJoRQ WQTvCIg025LanSOGuAhxbfImoQL8HGBu9gP/CskK7Qpy7nq83RYKoT6bRnhJqCPpTrQM iDYOmii1vzuzMnRiQ/HMgePuTJfwH0607fqX85dzu/lpzMsTZNzJFQu9Xvg7zqa93kdw mVpQ== X-Gm-Message-State: APjAAAX1ynWa7AmVyXAzjm1u/mJw8pMzPNZ6uvo/iG5BaZjlvvFuLW+E 25ftWKXYKbtwBBUI0EhTloQzUnFOvBWnvZi4gt9j6A== X-Google-Smtp-Source: APXvYqzI+p8JfCpdKVndPcWm9cECB9mtpftL72M7CCzjwB13j9uXxfWR8Z3o23+tYxOf5RxmkjjvrNbwajxXuW08cZg= X-Received: by 2002:a05:6602:25d3:: with SMTP id d19mr44659590iop.217.1577979129475; Thu, 02 Jan 2020 07:32:09 -0800 (PST) MIME-Version: 1.0 References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> <202001012024.001KOQMn012801@freefriends.org> <202001020720.0027Kf58032104@freefriends.org> In-Reply-To: <202001020720.0027Kf58032104@freefriends.org> From: Sergiu Hlihor Date: Thu, 2 Jan 2020 16:31:57 +0100 Message-ID: Subject: Re: bug#32073: Improvements in Grep (Bug#32073) To: arnold@skeeve.com Content-Type: multipart/alternative; boundary="00000000000079c8d7059b29e5b6" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --00000000000079c8d7059b29e5b6 Content-Type: text/plain; charset="UTF-8" Hi Arnold, Annoying in the sense that you have to specify it with every usage. In a company where you have 10+ developers grepping over various logs, each one has to remember to add the extra parameter. Easier would be to have some kind of global configuration that the system admin can set and developers forget about it. But as I said, large default is very likely enough. On Thu, 2 Jan 2020 at 08:20, wrote: > Hi. > > Sergiu Hlihor wrote: > > > Hi Arnold, > > If AWKBUFSIZE translates to disk IO request size then it is already what > > its needed. However it's a little annoying. > > How would you make it less annoying? > > Thanks, > > Arnold > --00000000000079c8d7059b29e5b6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Arnold,
Annoying in the= sense that you have to specify it with every usage. In a company where you= have 10+ developers grepping over various logs, each one has to remember t= o add the extra parameter. Easier would be to have some kind of global conf= iguration that the system admin can set and developers forget about it. But= as I said, large default is very likely enough.



On Thu, 2 Jan 2020 at 08:20, <arnold@skeeve.com> wrote:
Hi.

Sergiu Hlihor <sh= @discovergy.com> wrote:

> Hi Arnold,
> If AWKBUFSIZE translates to disk IO request size then it is already wh= at
> its needed. However it's a little annoying.

How would you make it less annoying?

Thanks,

Arnold
--00000000000079c8d7059b29e5b6-- From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 10:36:53 2020 Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 15:36:53 +0000 Received: from localhost ([127.0.0.1]:39941 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1in2Wr-0000KO-1D for submit@debbugs.gnu.org; Thu, 02 Jan 2020 10:36:53 -0500 Received: from freefriends.org ([96.88.95.60]:57776) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1in2Wp-0000KG-7K for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 10:36:51 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (freefriends.org [96.88.95.60]) by freefriends.org (8.14.7/8.14.7) with ESMTP id 002FadFL014662 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 2 Jan 2020 08:36:40 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.7/8.14.7/Submit) id 002FadBN014661; Thu, 2 Jan 2020 08:36:39 -0700 From: arnold@skeeve.com Message-Id: <202001021536.002FadBN014661@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Thu, 02 Jan 2020 08:36:39 -0700 To: sh@discovergy.com, arnold@skeeve.com Subject: Re: bug#32073: Improvements in Grep (Bug#32073) References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu> <202001011119.001BJMYA027994@freefriends.org> <202001012024.001KOQMn012801@freefriends.org> <202001020720.0027Kf58032104@freefriends.org> In-Reply-To: User-Agent: Heirloom mailx 12.5 7/5/10 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 32073 Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) OK, thanks for the input. Arnold Sergiu Hlihor wrote: > Hi Arnold, > Annoying in the sense that you have to specify it with every usage. In a > company where you have 10+ developers grepping over various logs, each one > has to remember to add the extra parameter. Easier would be to have some > kind of global configuration that the system admin can set and developers > forget about it. But as I said, large default is very likely enough. > > > > On Thu, 2 Jan 2020 at 08:20, wrote: > > > Hi. > > > > Sergiu Hlihor wrote: > > > > > Hi Arnold, > > > If AWKBUFSIZE translates to disk IO request size then it is already what > > > its needed. However it's a little annoying. > > > > How would you make it less annoying? > > > > Thanks, > > > > Arnold > >