From unknown Sun Aug 17 19:57:19 2025
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.509 (Entity 5.509)
Content-Type: text/plain; charset=utf-8
From: bug#32073 <32073@debbugs.gnu.org>
To: bug#32073 <32073@debbugs.gnu.org>
Subject: Status: Improvements in Grep
Reply-To: bug#32073 <32073@debbugs.gnu.org>
Date: Mon, 18 Aug 2025 02:57:19 +0000

retitle 32073 Improvements in Grep
reassign 32073 grep
submitter 32073 Sergiu Hlihor <sh@discovergy.com>
severity 32073 wishlist

thanks


From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 17:31:49 2018
Received: (at submit) by debbugs.gnu.org; 6 Jul 2018 21:31:49 +0000
Received: from localhost ([127.0.0.1]:48863 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fbYKS-0002MD-DA
	for submit@debbugs.gnu.org; Fri, 06 Jul 2018 17:31:49 -0400
Received: from eggs.gnu.org ([208.118.235.92]:49666)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1fbTYx-0003J9-SL
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <sh@discovergy.com>) id 1fbTYr-000371-NQ
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:22 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,HTML_MESSAGE,
 T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:37207)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <sh@discovergy.com>) id 1fbTYr-00036v-K4
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 12:26:21 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40630)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <sh@discovergy.com>) id 1fbTYq-0001Jy-E2
 for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <sh@discovergy.com>) id 1fbTYp-00035W-EW
 for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:20 -0400
Received: from mail-io0-x234.google.com ([2607:f8b0:4001:c06::234]:36810)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <sh@discovergy.com>) id 1fbTYp-000354-7X
 for bug-grep@gnu.org; Fri, 06 Jul 2018 12:26:19 -0400
Received: by mail-io0-x234.google.com with SMTP id k3-v6so11350175iog.3
 for <bug-grep@gnu.org>; Fri, 06 Jul 2018 09:26:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:from:date:message-id:subject:to;
 bh=gsN4tVk2AbiLxUguUpnC/wUV3Nk6Fj17GCQlGhyz7UM=;
 b=Oo5AHAu+DxPJESB8LNkT4ZWoCgD+9xIzN56qIih5SmKyJAZBx2ItDZK471rvqSQATG
 iHZ3GtYgTv7sG9q6cayKkER4huRFSralDMhid3z6Xc5M80wWx5uFgDCje15arJafbEbl
 oM2QWzvZ7YqHwWsoAIcErxRlVkIRJjM3fYJT1mmiOuZDzVi6tZFEwdMrUL5m+AdQ2GRl
 gkBsCXi8BAriWnlgM51gaV2nc2vovD8w2UDZZudTGESO182VbqEOj0SAwgsj+FJWmEDP
 151g9W4GTtma/7r12patcSiPStmivH9jG7sG3E/VPtYu7MDWsLWCHfXoTMRMj5/xKVo3
 KLog==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=gsN4tVk2AbiLxUguUpnC/wUV3Nk6Fj17GCQlGhyz7UM=;
 b=HvtBTPH52sZoi2sQGyzP7ntKmSvEQOjeMbpD4NaSRE7iJ6hhXHQ4etl4/Q1D5zhpDC
 WZI77+Frwc3Fsv/Ksg8DNBL5aWLE9vHdwK6yGZSu2TYtot/uteKLlbJ+XJRbUCf33TON
 l96BHaI9RaOjTLcU52Eyh9c8rGNOsdHv2ZKvBVHi0/afUhQ9hqy3qsw91qKB5uvC60IP
 BPTvFPymBmt8b3EpvtWMjuK912gRR0J77D8n56qXkBdPaRmwI4pnxBMryZevSdOHCdQ2
 g3Mlo061b2cTRqWFVHogUbhq3VnJep/ANsz4exsR6nNeSR938JYQ5b+s9jzYJgJJhDxx
 Ty1g==
X-Gm-Message-State: APt69E2yZpnkBt2GtBRmO7+j/mh3LkKSf2fImwUwje97cv1Y4fB08RoW
 gNAqs6uT3XG+0apEQQWyxnQKjiSWyxLYhkUwIvCdKYtq
X-Google-Smtp-Source: AAOMgpfQLYAwjGzKanOd0Y03aDYUSvuifqGJP849QmjL8Bxcg6XC1qo1G9LHokuzVfIFD9E2XQcaPh47omloYEXunRk=
X-Received: by 2002:a6b:4e04:: with SMTP id c4-v6mr9029232iob.19.1530894377892; 
 Fri, 06 Jul 2018 09:26:17 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a02:1b98:0:0:0:0:0 with HTTP;
 Fri, 6 Jul 2018 09:26:17 -0700 (PDT)
From: Sergiu Hlihor <sh@discovergy.com>
Date: Fri, 6 Jul 2018 18:26:17 +0200
Message-ID: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
Subject: Improvements in Grep
To: bug-grep@gnu.org
Content-Type: multipart/alternative; boundary="000000000000954fdf0570571f6a"
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Fri, 06 Jul 2018 17:31:47 -0400
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

--000000000000954fdf0570571f6a
Content-Type: text/plain; charset="UTF-8"

Hello,
     I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While
grepping over large files I've noticed Grep is painfully slow. The
bottleneck seems to be the read block which is extremely low (looks like
64KB). For large files residing over big HDD RAID arrays, this request
barely reaches one drive and based on CPU usage, grep is idling more or
less. Given my tests for such scenarios, a read block size of at least
512KB would be way more efficient. It's very likely that optimum would be
1MB+. Also, such increase in buffer size would also benefit slightly SSDs
where maximum sequential throughput is usually achieved when reading at
256KB+ block size.
     If this is already possible in newer versions or configurable, I'd
appreciate some hints about the new version which contains or about the way
I can configure it to increase the read block size.

Thanks and best regards,
Sergiu

--000000000000954fdf0570571f6a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hello, <br></div><div>=C2=A0=C2=A0=C2=A0=C2=A0 I&#39;=
m using grep over Ubuntu Server 14.04 (Grep version 2.16). While grepping o=
ver large files I&#39;ve noticed Grep is painfully slow. The bottleneck see=
ms to be the read block which is extremely low (looks like 64KB). For large=
 files residing over big HDD RAID arrays, this request barely reaches one d=
rive and based on CPU usage, grep is idling more or less. Given my tests fo=
r such scenarios, a read block size of at least 512KB would be way more eff=
icient. It&#39;s very likely that optimum would be 1MB+. Also, such increas=
e in buffer size would also benefit slightly SSDs where maximum sequential =
throughput is usually achieved when reading at 256KB+ block size. <br></div=
><div>=C2=A0=C2=A0=C2=A0=C2=A0 If this is already possible in newer version=
s or configurable, I&#39;d appreciate some hints about the new version whic=
h contains or about the way I can configure it to increase the read block s=
ize. <br></div><div><br></div><div>Thanks and best regards,</div><div>Sergi=
u</div></div>

--000000000000954fdf0570571f6a--


From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 18:06:44 2018
Received: (at 32073) by debbugs.gnu.org; 6 Jul 2018 22:06:44 +0000
Received: from localhost ([127.0.0.1]:48878 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fbYsF-0003J2-W1
	for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:06:44 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33666)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@cs.ucla.edu>) id 1fbYsD-0003Ij-LC
 for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 18:06:42 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9660D16161F;
 Fri,  6 Jul 2018 15:06:35 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id t3uCubr3XabO; Fri,  6 Jul 2018 15:06:34 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id E1759161625;
 Fri,  6 Jul 2018 15:06:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id nlyZyupbRODd; Fri,  6 Jul 2018 15:06:34 -0700 (PDT)
Received: from [192.168.1.9] (unknown [47.154.30.119])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id A232616161F;
 Fri,  6 Jul 2018 15:06:34 -0700 (PDT)
Subject: Re: bug#32073: Improvements in Grep
To: Sergiu Hlihor <sh@discovergy.com>, 32073@debbugs.gnu.org
References: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
From: Paul Eggert <eggert@cs.ucla.edu>
Openpgp: preference=signencrypt
Autocrypt: addr=eggert@cs.ucla.edu; prefer-encrypt=mutual; keydata=
 xsFNBEyAcmQBEADAAyH2xoTu7ppG5D3a8FMZEon74dCvc4+q1XA2J2tBy2pwaTqfhpxxdGA9
 Jj50UJ3PD4bSUEgN8tLZ0san47l5XTAFLi2456ciSl5m8sKaHlGdt9XmAAtmXqeZVIYX/UFS
 96fDzf4xhEmm/y7LbYEPQdUdxu47xA5KhTYp5bltF3WYDz1Ygd7gx07Auwp7iw7eNvnoDTAl
 KAl8KYDZzbDNCQGEbpY3efZIvPdeI+FWQN4W+kghy+P6au6PrIIhYraeua7XDdb2LS1en3Ss
 mE3QjqfRqI/A2ue8JMwsvXe/WK38Ezs6x74iTaqI3AFH6ilAhDqpMnd/msSESNFt76DiO1ZK
 QMr9amVPknjfPmJISqdhgB1DlEdw34sROf6V8mZw0xfqT6PKE46LcFefzs0kbg4GORf8vjG2
 Sf1tk5eU8MBiyN/bZ03bKNjNYMpODDQQwuP84kYLkX2wBxxMAhBxwbDVZudzxDZJ1C2VXujC
 OJVxq2kljBM9ETYuUGqd75AW2LXrLw6+MuIsHFAYAgRr7+KcwDgBAfwhPBYX34nSSiHlmLC+
 KaHLeCLF5ZI2vKm3HEeCTtlOg7xZEONgwzL+fdKo+D6SoC8RRxJKs8a3sVfI4t6CnrQzvJbB
 n6gxdgCu5i29J1QCYrCYvql2UyFPAK+do99/1jOXT4m2836j1wARAQABzSBQYXVsIEVnZ2Vy
 dCA8ZWdnZXJ0QGNzLnVjbGEuZWR1PsLBfgQTAQIAKAUCTIByZAIbAwUJEswDAAYLCQgHAwIG
 FQgCCQoLBBYCAwECHgECF4AACgkQ7ZfpDmKqfjRRGw/+Ij03dhYfYl/gXVRiuzV1gGrbHk+t
 nfrI/C7fAeoFzQ5tVgVinShaPkZo0HTPf18x6IDEdAiO8Mqo1yp0CtHmzGMCJ50o4Grgfjlr
 6g/+vtEOKbhleszN2XpJvpwM2QgGvn/laTLUu8PH9aRWTs7qJJZKKKAb4sxYc92FehPu6FOD
 0dDiyhlDAq4lOV2mdBpzQbiojoZzQLMQwjpgCTK2572eK9EOEQySUThXrSIz6ASenp4NYTFH
 s9tuJQvXk9gZDdPSl3bp+47dGxlxEWLpBIM7zIONw4ks4azgT8nvDZxA5IZHtvqBlJLBObYY
 0Le61Wp0y3TlBDh2qdK8eYL426W4scEMSuig5gb8OAtQiBW6k2sGUxxeiv8ovWu8YAZgKJfu
 oWI+uRnMEddruY8JsoM54KaKvZikkKs2bg1ndtLVzHpJ6qFZC7QVjeHUh6/BmgvdjWPZYFTt
 N+KA9CWX3GQKKgN3uu988yznD7LnB98T4EUH1HA/GnfBqMV1gpzTvPc4qVQinCmIkEFp83zl
 +G5fCjJJ3W7ivzCnYo4KhKLpFUm97okTKR2LW3xZzEW4cLSWO387MTK3CzDOx5qe6s4a91Zu
 ZM/j/TQdTLDaqNn83kA4Hq48UHXYxcIh+Nd8k/3w6lFuoK0wrOFiywjLx+0ur5jmmbecBGHc
 1xdhAFHOwU0ETIByZAEQAKaF678T9wyH4wjTrV1Pz3cDEoSnV/0ZUrOT37p1dcGyj/IXq1x6
 70HRVahAmk0sZpYc25PF9D5GPYHFWlNjuPU96rDndXB3hedmBRhLdC4bAXjI4DV+bmdVe+q/
 IMnlZRaVlm9EiMCVAR6w13sReu7qXkW9r3RwY2AzXskp/tAe4BRKr1Zmbvi2nbnQ6epEC42r
 Rbx0B1EhjbIQZ5JHGk24iPT7LdBgnNmos5wYjzwNlkMQD5T0Ydzhk7J+UxwA5m46mOhRDC2r
 FV/A0gm5TLy8DXjv/Esc4gYnYai6SQqnUEVh5LuV8YCJBnijs+Tiw71x1icmn6xGI45EugJO
 gec+rLypYgpVp4x0HI5T88qBRYCkxH3Kg8Qo+EWNA9A4LRQ9DX8njona0gf0s03tocK8kBN6
 6UoqqPtHBnc4eMgBymCflK12eKfd2YYxnyg9cZazWA5VslvTxpm76hbg5oiAEH/Vg/8MxHyA
 nPhfrgwyPrmJEcVBafdspJnYQxBYNco2LFPIhlOvWh8r4at+s+M3Lb26oUTczlgdW1Sf3SDA
 77BMRnF0FQyE+7AzV79MBN4ykiqaezQxtaF1Fy/tvkhffSo8u+dwG0EgJh+te38gTcISVr0G
 IPplLz6YhjrbHrPRF1CN5UuL9DBGjxuN35RLNVEfta6RUFlR6NctTjvrABEBAAHCwWUEGAEC
 AA8FAkyAcmQCGwwFCRLMAwAACgkQ7ZfpDmKqfjSrHA/+KzAKvTxRhA9MWNLxIyJ7S5uJ16gs
 T3oCjZrBKGEhKMOGX4O0GA6VOEryO7QRCCYah3oxSG38IAnNeiwJXgU9Bzkk85UGbPEd7HGF
 /VSeHCQwWou6jqUDTSDvn9YhNTdG0KXPM74aC+xr2Zow1O2mhXihgWKD0Dw+0LYPnUOsQ0KO
 FxHXXYHmRrS1OZPU59BLvc+TRhIhafSHKLwbXK+6ckkxBx6h8z5ccpG0Qs4bFhdFYnFrEieD
 LoGmnE2YLhdV6swJ9VNCS6pLiEohT3fm7aXm15tZOIyzMZhHRSAPblXxQ0ZSWjq8oRrcYNFx
 c4W1URpAkBCOYJoXvQfD5L3lqAl8TCqDUzYxhH/tJhbDdHrqHH767jaDaTB1+Talp/2AMKwc
 XNOdiklGxbmHVG6YGl6g8Lrbsu9NZEI4yLlHzuikthJWgz+3vZhVGyNlt+HNIoF6CjDL2omu
 5cEq4RDHM44QqPk6l7O0pUvN1mT4B+S1b08RKpqm/ff015E37HNV/piIvJlxGAYz8PSfuGCB
 1thMYqlmgdhd9/BabGFbGGYHA6U4/T5zqU+f6xHy1SsAQZ1MSKlLwekBIT+4/cLRGqCHjnV0
 q5H/T6a7t5mPkbzSrOLSo4puj+IToNjYyYIDBWzhlA19avOa+rvUjmHtD3sFN7cXWtkGoi8b
 uNcby4U=
Organization: UCLA Computer Science Department
Message-ID: <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu>
Date: Fri, 6 Jul 2018 15:06:34 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.8.0
MIME-Version: 1.0
In-Reply-To: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 32073
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

Sergiu Hlihor wrote:
> Given my tests for such scenarios, a read block size of at least
> 512KB would be way more efficient.

Does stdio do this already? If not, why not? How could grep reasonably configure 
a good block size?


From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 18:44:55 2018
Received: (at submit) by debbugs.gnu.org; 6 Jul 2018 22:44:56 +0000
Received: from localhost ([127.0.0.1]:48900 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fbZTD-0004NL-Kt
	for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:55 -0400
Received: from eggs.gnu.org ([208.118.235.92]:52864)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <dclarke@blastwave.org>) id 1fbZTB-0004N6-Nb
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:53 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <dclarke@blastwave.org>) id 1fbZT5-0002Kd-Ui
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:48 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:42426)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <dclarke@blastwave.org>)
 id 1fbZT5-0002KZ-Qk
 for submit@debbugs.gnu.org; Fri, 06 Jul 2018 18:44:47 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43835)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <dclarke@blastwave.org>) id 1fbZT4-0003Yf-RL
 for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:47 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <dclarke@blastwave.org>) id 1fbZT1-0002KB-Pa
 for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:46 -0400
Received: from atl4mhob08.registeredsite.com ([209.17.115.46]:55668)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <dclarke@blastwave.org>)
 id 1fbZT1-0002Jr-K1
 for bug-grep@gnu.org; Fri, 06 Jul 2018 18:44:43 -0400
Received: from mailpod.hostingplatform.com
 (atl4qobmail01pod2.registeredsite.com [10.30.77.35])
 by atl4mhob08.registeredsite.com (8.14.4/8.14.4) with ESMTP id w66Micxx011705
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL)
 for <bug-grep@gnu.org>; Fri, 6 Jul 2018 18:44:38 -0400
Received: (qmail 26434 invoked by uid 0); 6 Jul 2018 22:44:37 -0000
X-TCPREMOTEIP: 99.253.103.29
X-Authenticated-UID: dclarke@blastwave.org
Received: from unknown (HELO sedna.genunix.com)
 (dclarke@blastwave.org@99.253.103.29)
 by 0 with ESMTPA; 6 Jul 2018 22:44:37 -0000
Subject: Re: bug#32073: Improvements in Grep
To: bug-grep@gnu.org
References: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
 <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu>
From: Dennis Clarke <dclarke@blastwave.org>
Message-ID: <d2b7c614-4be5-167e-fce0-3e27d9ce5771@blastwave.org>
Date: Fri, 6 Jul 2018 18:44:36 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.8.0
MIME-Version: 1.0
In-Reply-To: <9be5ca5d-dc30-508f-649b-5146ee85cf5e@cs.ucla.edu>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -6.0 (------)

On 07/06/2018 06:06 PM, Paul Eggert wrote:
> Sergiu Hlihor wrote:
>> Given my tests for such scenarios, a read block size of at least
>> 512KB would be way more efficient.
> 
> Does stdio do this already? If not, why not? How could grep reasonably 
> configure a good block size?

This seems to be a very specific complaint which is only of value on a
very specific system and usage case.  There is no way that grep could
configure a "good block size" unless it were tailor built.  Doesn't
seem to be a reasonable RFE.  In my opinion.

Dennis


From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 20:33:37 2018
Received: (at 32073) by debbugs.gnu.org; 7 Jul 2018 00:33:37 +0000
Received: from localhost ([127.0.0.1]:48940 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fbbAO-0007QZ-Ov
	for submit@debbugs.gnu.org; Fri, 06 Jul 2018 20:33:36 -0400
Received: from mail-wm0-f53.google.com ([74.125.82.53]:55493)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@gmail.com>) id 1fbbAN-0007QL-2l
 for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 20:33:35 -0400
Received: by mail-wm0-f53.google.com with SMTP id v16-v6so16251135wmv.5
 for <32073@debbugs.gnu.org>; Fri, 06 Jul 2018 17:33:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=Momz89FF7eOSa9Kgl6zghjC6oTftuhwl6RqoRaYOrMQ=;
 b=r5rTENOu+tSjTcdqBYCTdtODWfnwh8JkbnU5pvQQ4FKW1s8iXv2g5OCYUXzRzk8kV8
 ODH33BK+AfAPwfMkzWeLWw5OCKnQNSHJIYsWa+w0kKz8gZobrqPJd8ed9itA2EkVtV5A
 iAXB+K+Pp/PxIRqXOJxVxKGnPNRuni/9L5iOidz9IVeVZwpsPFjhFNJVl9NBrwJu7s2d
 GjRngOzfufM+djuCXS4i5EEa6fucjxJz+8MVxCCaFNyLqOXfn3EezAAJYTsryJNZhmp7
 kv4tiCYsRY7KlQw/J0XHBVZscFUrmBt+BWDTVOT2D/OJiVyNdnH/98srPilODCKG/+po
 w0aA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=Momz89FF7eOSa9Kgl6zghjC6oTftuhwl6RqoRaYOrMQ=;
 b=D7WGy8TmAKpGlKrGsWtuPnKctHvaGrSGSoOvwgTpinuF9fzQnCbdkZ02dem92FhTwk
 9ZGVfQuziRmvkc+VtxmXgv2FHObUpiW6RemoYLyxVNALbzXEJ572OG+/EE4py8kkSIaq
 PxI4NjXPPU+/L+w0hj/BScRtR4JQV23yOMc+61zQVoJZlOHojpssXYoxpwoeUS0G7F0w
 jQt/H5Qir8CGhByO5fizJhE+yKpo/9tVSpGaCs9xlg5SRimXpPjtXSHlZljpZeQhTsB9
 7O1o7AsL0OGykTSOw8LVIrVoyTs2tJf+4WJzeZJaHPWGT7OY1dNW/ez2hQMED5Ky46z8
 8JlA==
X-Gm-Message-State: APt69E19y1bXDLv/AKiy97SERJPdA6/AVTIL7VXFPtIKgaw5r1Zx0uXA
 ui2EpCHA0XU0VxPqisyeegH6khO3gnlNFifrEhDg0g==
X-Google-Smtp-Source: AAOMgpcRnfie1Sy2x6piy4b8g+uuIEn+uAsPVNH+5N8Gv+JNGmAcwNSJDDj6rqaUSE7U/1OBWoMbu4BPlT2a8/4Shew=
X-Received: by 2002:a1c:a8f:: with SMTP id
 137-v6mr6676449wmk.119.1530923609175; 
 Fri, 06 Jul 2018 17:33:29 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:adf:ec4e:0:0:0:0:0 with HTTP;
 Fri, 6 Jul 2018 17:33:08 -0700 (PDT)
In-Reply-To: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
References: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
From: Jim Meyering <jim@meyering.net>
Date: Fri, 6 Jul 2018 17:33:08 -0700
X-Google-Sender-Auth: tlltqOQ-2sHQZaEuW_K-9CvgtBM
Message-ID: <CA+8g5KFkFjPKLKLAeu8EiiU+pKsu89VKsvbRzc94_0xGShadZA@mail.gmail.com>
Subject: Re: bug#32073: Improvements in Grep
To: Sergiu Hlihor <sh@discovergy.com>
Content-Type: multipart/mixed; boundary="000000000000e75f1605705ded47"
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.5 (/)

--000000000000e75f1605705ded47
Content-Type: text/plain; charset="UTF-8"

On Fri, Jul 6, 2018 at 9:26 AM, Sergiu Hlihor <sh@discovergy.com> wrote:
> Hello,
>      I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While
> grepping over large files I've noticed Grep is painfully slow. The
> bottleneck seems to be the read block which is extremely low (looks like
> 64KB). For large files residing over big HDD RAID arrays, this request
> barely reaches one drive and based on CPU usage, grep is idling more or
> less. Given my tests for such scenarios, a read block size of at least
> 512KB would be way more efficient. It's very likely that optimum would be
> 1MB+. Also, such increase in buffer size would also benefit slightly SSDs
> where maximum sequential throughput is usually achieved when reading at
> 256KB+ block size.
>      If this is already possible in newer versions or configurable, I'd
> appreciate some hints about the new version which contains or about the way
> I can configure it to increase the read block size.

Thanks for raising the issue.
This makes me think we should follow Coreutils' lead[0] and increase
grep's initial buffer size from 32KiB, probably to 128KiB. I will time
with the attached diff on a few systems.

[0] https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=v8.22-103-g74ca6e84c

--000000000000e75f1605705ded47
Content-Type: application/octet-stream; name="grep-bufsize-increase.diff"
Content-Disposition: attachment; filename="grep-bufsize-increase.diff"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_jjaoc07a0

ZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IGY0YWU1ZjUuLjA0YWM5
YzkgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtNzk5LDcgKzc5
OSw2IEBAIHNraXBwZWRfZmlsZSAoY2hhciBjb25zdCAqbmFtZSwgYm9vbCBjb21tYW5kX2xpbmUs
IGJvb2wgaXNfZGlyKQoKIHN0YXRpYyBjaGFyICpidWZmZXI7CQkvKiBCYXNlIG9mIGJ1ZmZlci4g
Ki8KIHN0YXRpYyBzaXplX3QgYnVmYWxsb2M7CQkvKiBBbGxvY2F0ZWQgYnVmZmVyIHNpemUsIGNv
dW50aW5nIHNsb3AuICovCi1lbnVtIHsgSU5JVElBTF9CVUZTSVpFID0gMzI3NjggfTsgLyogSW5p
dGlhbCBidWZmZXIgc2l6ZSwgbm90IGNvdW50aW5nIHNsb3AuICovCiBzdGF0aWMgaW50IGJ1ZmRl
c2M7CQkvKiBGaWxlIGRlc2NyaXB0b3IuICovCiBzdGF0aWMgY2hhciAqYnVmYmVnOwkJLyogQmVn
aW5uaW5nIG9mIHVzZXItdmlzaWJsZSBzdHVmZi4gKi8KIHN0YXRpYyBjaGFyICpidWZsaW07CQkv
KiBMaW1pdCBvZiB1c2VyLXZpc2libGUgc3R1ZmYuICovCkBAIC04MTIsNiArODExLDkgQEAgc3Rh
dGljIGJvb2wgc2tpcF9udWxzOwkJLyogU2tpcCAnXDAnIGluIGRhdGEuICAqLwogc3RhdGljIGJv
b2wgc2tpcF9lbXB0eV9saW5lczsJLyogU2tpcCBlbXB0eSBsaW5lcyBpbiBkYXRhLiAgKi8KIHN0
YXRpYyB1aW50bWF4X3QgdG90YWxubDsJLyogVG90YWwgbmV3bGluZSBjb3VudCBiZWZvcmUgbGFz
dG5sLiAqLwoKKy8qIEluaXRpYWwgYnVmZmVyIHNpemUsIG5vdCBjb3VudGluZyBzbG9wLiAqLwor
ZW51bSB7IElOSVRJQUxfQlVGU0laRSA9IDEyOCAqIDEwMjQgfTsKKwogLyogUmV0dXJuIFZBTCBh
bGlnbmVkIHRvIHRoZSBuZXh0IG11bHRpcGxlIG9mIEFMSUdOTUVOVC4gIFZBTCBjYW4gYmUKICAg
IGFuIGludGVnZXIgb3IgYSBwb2ludGVyLiAgQm90aCBhcmdzIG11c3QgYmUgZnJlZSBvZiBzaWRl
IGVmZmVjdHMuICAqLwogI2RlZmluZSBBTElHTl9UTyh2YWwsIGFsaWdubWVudCkgXAo=
--000000000000e75f1605705ded47--


From debbugs-submit-bounces@debbugs.gnu.org Fri Jul 06 21:39:13 2018
Received: (at 32073) by debbugs.gnu.org; 7 Jul 2018 01:39:13 +0000
Received: from localhost ([127.0.0.1]:48957 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fbcBs-0000ed-OU
	for submit@debbugs.gnu.org; Fri, 06 Jul 2018 21:39:13 -0400
Received: from mail-it0-f49.google.com ([209.85.214.49]:54285)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1fbc4L-0000Sc-O0
 for 32073@debbugs.gnu.org; Fri, 06 Jul 2018 21:31:26 -0400
Received: by mail-it0-f49.google.com with SMTP id s7-v6so18707912itb.4
 for <32073@debbugs.gnu.org>; Fri, 06 Jul 2018 18:31:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=moB9xXnht8Ivaa6H7SGXQS7xXcWeF8ztpA7QCKa2BT4=;
 b=bLqNZb71KPy2mG0vNyuyHEeRYm904p/g6KRsezoGV7fzUqdmYb+kf9BhNAAL2b3uNX
 EZS7Mkdk+wtgo787UcgZCPdzLsgB4Xx4XWz6+DdEV7GlXKDzCciLV+7xZf8CLThTVsqO
 ANycURMEcfIb8XOOKkywhequHiDPzuGjA+mCL8XbTQ85KlCtIy6Wi9m/UaH3DbF6MpQf
 m+iyBtopRtUMcO5vwaLX8jA5Z5mqzvW1z7TQrgzeOR6X0WaWp3964Rn0uRW3JU4i+nOR
 SUfDDvlxAM9Uv5rcCH6QXFHSKysTf6GQLABCezImn7rNgnnu0DfYsJlCAepPO/3DSwyB
 2ajg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=moB9xXnht8Ivaa6H7SGXQS7xXcWeF8ztpA7QCKa2BT4=;
 b=TJo1fVsRWbp+azwzaVnH3qm9j43mqh06Jr6+A8x3WInMKafRmapHzxVSZe0wzqvWki
 0BHrtABV03cUduLLrIAF7VuPO0JhbHPM1z/DW2MxrpbHcbdYc36CkcZ8w4anA9Ugdhy3
 EFm07/0b5RWnq1A3UFDn/hkcc+jl+vx4NguDzsq2vr/pNcb65hBiVMFu5IAgsda6X6jE
 5+KZz2OCcVXGBE18HKL5qXIIc+nQwy0shI3h/qVo4f4ccpy2rgk8jlhv9pW7g9/and0T
 kjMlgfihEsLTcKmyR7DTE3K+pwP9YRZXTXgj7eaWhds+NDExYxO3CW9tcHMFIo1+PVt7
 /sHA==
X-Gm-Message-State: APt69E25xAaOkhKgd8peujqWEJOl4JV0PGYTKYKkK2Fc+0CgdTE5G00N
 74Sf2u9htI5ECihtHlTcygvRb72EZZ9LiVTCX+1yyw==
X-Google-Smtp-Source: AAOMgpcEMTNY6K71KFEu+OBwvA3lpDLV0oMhfUmxaofiJR2wF4p/DaEshJq7+Vz+4+8CU1NGhLNdV/5thsc9LKdvNhY=
X-Received: by 2002:a24:cf57:: with SMTP id
 y84-v6mr10031863itf.98.1530927080155; 
 Fri, 06 Jul 2018 18:31:20 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a02:1b98:0:0:0:0:0 with HTTP;
 Fri, 6 Jul 2018 18:31:19 -0700 (PDT)
In-Reply-To: <CA+8g5KFkFjPKLKLAeu8EiiU+pKsu89VKsvbRzc94_0xGShadZA@mail.gmail.com>
References: <CAD-3cdeVqR_pvxSmayD=5tDpi8Cpze_ck64gssgoYvjV98No9g@mail.gmail.com>
 <CA+8g5KFkFjPKLKLAeu8EiiU+pKsu89VKsvbRzc94_0xGShadZA@mail.gmail.com>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Sat, 7 Jul 2018 03:31:19 +0200
Message-ID: <CAD-3cdf6upYf6NjgFTZGHXbz6b-e6wCw+1A=LT8VMZxnK5q-6w@mail.gmail.com>
Subject: Re: bug#32073: Improvements in Grep
To: Jim Meyering <jim@meyering.net>
Content-Type: multipart/alternative; boundary="000000000000ca462d05705ebc23"
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 32073
X-Mailman-Approved-At: Fri, 06 Jul 2018 21:39:11 -0400
Cc: 32073@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000ca462d05705ebc23
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

To add, the increase to 128KiB is good, but for RAID arrays with light to
medium load, this is not sufficient. In a system without any load, the HDD
can read ahead and always serve the next request from buffer thus reading
at full sequential speed of ~200MB/s . In a RAID 10 configuration with 12
hdds where strip size is set to 128KB, every HDD is hit at every 6th
request. There is enough delay between reads hitting the same drive that
the read ahead buffer often gets discarded which basically limits the
throughput to max IOPS x buffer size  =3D ~10-20MiB for 128KiB.
I have such systems in production environments and I often see read speeds
under 10MiB and read await >10ms which means that read ahead buffer is
already discarded. At the same load conditions, if I read the data using
utilities which can do 512KiB buffer size, I see read speed varying between
50 and 400MiB. Grep has an average CPU load of 2-3% of the given machine
under such low reads, therefore it can do much more if reading is optimized=
.

On 7 July 2018 at 02:33, Jim Meyering <jim@meyering.net> wrote:

> On Fri, Jul 6, 2018 at 9:26 AM, Sergiu Hlihor <sh@discovergy.com> wrote:
> > Hello,
> >      I'm using grep over Ubuntu Server 14.04 (Grep version 2.16). While
> > grepping over large files I've noticed Grep is painfully slow. The
> > bottleneck seems to be the read block which is extremely low (looks lik=
e
> > 64KB). For large files residing over big HDD RAID arrays, this request
> > barely reaches one drive and based on CPU usage, grep is idling more or
> > less. Given my tests for such scenarios, a read block size of at least
> > 512KB would be way more efficient. It's very likely that optimum would =
be
> > 1MB+. Also, such increase in buffer size would also benefit slightly SS=
Ds
> > where maximum sequential throughput is usually achieved when reading at
> > 256KB+ block size.
> >      If this is already possible in newer versions or configurable, I'd
> > appreciate some hints about the new version which contains or about the
> way
> > I can configure it to increase the read block size.
>
> Thanks for raising the issue.
> This makes me think we should follow Coreutils' lead[0] and increase
> grep's initial buffer size from 32KiB, probably to 128KiB. I will time
> with the attached diff on a few systems.
>
> [0] https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=3D
> v8.22-103-g74ca6e84c
>


--=20
_____________________________________________

Senior Software Engineer & Team leader

Telefon: +49 (0) 6221 7787-481

Email: sh@discovergy.com

*Discovergy GmbH*
_____________________________________________

Registergericht: Amtsgericht Aachen HRB 15391

Gesch=C3=A4ftsf=C3=BChrer: Ralf Esser | Bernhard Seidl | Nikolaus Starzache=
r
Diese E-Mail und eventuell angeh=C3=A4ngte Dateien sind nur f=C3=BCr den ob=
en
genannten Empf=C3=A4nger bestimmt und k=C3=B6nnen vertrauliche Informatione=
n
enthalten. Sollten Sie nicht der Empf=C3=A4nger sein, ist jede Verbreitung,
Weiterleitung und Kopie verboten. Wenn Sie diese E-Mail versehentlich
erhalten haben, senden Sie diese Mail zur=C3=BCck oder unterrichten umgehen=
d den
Absender unter oben genannten Kontaktdaten. Bitte l=C3=B6schen Sie diese
Nachricht in diesem Fall umgehend. Vielen Dank.

--000000000000ca462d05705ebc23
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>To add, the increase to 128KiB is good, but for RAID =
arrays with light to medium load, this is not sufficient. In a system witho=
ut any load, the HDD can read ahead and always serve the next request from =
buffer thus reading at full sequential speed of ~200MB/s . In a RAID 10 con=
figuration with 12 hdds where strip size is set to 128KB, every HDD is hit =
at every 6th request. There is enough delay between reads hitting the same =
drive that the read ahead buffer often gets discarded which basically limit=
s the throughput to max IOPS x buffer size=C2=A0 =3D ~10-20MiB for 128KiB. =
=C2=A0 <br></div><div>I have such systems in production environments and I =
often see read speeds under 10MiB and read await &gt;10ms which means that =
read ahead buffer is already discarded. At the same load conditions, if I r=
ead the data using utilities which can do 512KiB buffer size, I see read sp=
eed varying between 50 and 400MiB. Grep has an average CPU load of 2-3% of =
the given machine under such low reads, therefore it can do much more if re=
ading is optimized.<br> </div></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On 7 July 2018 at 02:33, Jim Meyering <span dir=3D"ltr">=
&lt;<a href=3D"mailto:jim@meyering.net" target=3D"_blank">jim@meyering.net<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Jul 6, 2018=
 at 9:26 AM, Sergiu Hlihor &lt;<a href=3D"mailto:sh@discovergy.com">sh@disc=
overgy.com</a>&gt; wrote:<br>
&gt; Hello,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 I&#39;m using grep over Ubuntu Server 14.04 (Grep =
version 2.16). While<br>
&gt; grepping over large files I&#39;ve noticed Grep is painfully slow. The=
<br>
&gt; bottleneck seems to be the read block which is extremely low (looks li=
ke<br>
&gt; 64KB). For large files residing over big HDD RAID arrays, this request=
<br>
&gt; barely reaches one drive and based on CPU usage, grep is idling more o=
r<br>
&gt; less. Given my tests for such scenarios, a read block size of at least=
<br>
&gt; 512KB would be way more efficient. It&#39;s very likely that optimum w=
ould be<br>
&gt; 1MB+. Also, such increase in buffer size would also benefit slightly S=
SDs<br>
&gt; where maximum sequential throughput is usually achieved when reading a=
t<br>
&gt; 256KB+ block size.<br>
&gt;=C2=A0 =C2=A0 =C2=A0 If this is already possible in newer versions or c=
onfigurable, I&#39;d<br>
&gt; appreciate some hints about the new version which contains or about th=
e way<br>
&gt; I can configure it to increase the read block size.<br>
<br>
Thanks for raising the issue.<br>
This makes me think we should follow Coreutils&#39; lead[0] and increase<br=
>
grep&#39;s initial buffer size from 32KiB, probably to 128KiB. I will time<=
br>
with the attached diff on a few systems.<br>
<br>
[0] <a href=3D"https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=
=3Dv8.22-103-g74ca6e84c" rel=3D"noreferrer" target=3D"_blank">https://git.s=
avannah.gnu.org/<wbr>cgit/coreutils.git/commit/?id=3D<wbr>v8.22-103-g74ca6e=
84c</a><br>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div class=3D"gmail_sig=
nature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr">=
<div><div dir=3D"ltr">_____________________________________________<br><br>=
Senior Software Engineer &amp; Team leader<br><br>Telefon: +49 (0) 6221 778=
7-481<br>
<br>
Email: <a href=3D"mailto:sh@discovergy.com" target=3D"_blank"><span>sh@disc=
overgy.com</span></a><br><br>
<b><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif=
;color:#00b050">Discovergy GmbH</span></b><br>_____________________________=
________________<br><p style=3D"margin-right:0cm;margin-bottom:7.2pt;margin=
-left:0cm;background:white;vertical-align:middle"><span style=3D"font-size:=
7.5pt;font-family:&quot;Arial&quot;,sans-serif;color:#707173">Registergeric=
ht: Amtsgericht Aachen HRB 15391</span><span style=3D"font-size:7.0pt;font-=
family:&quot;Arial&quot;,sans-serif;color:#222222"></span></p><p style=3D"m=
argin-right:0cm;margin-bottom:4.8pt;margin-left:0cm"><span style=3D"font-si=
ze:7.5pt;font-family:&quot;Arial&quot;,sans-serif;color:#707173">Gesch=C3=
=A4ftsf=C3=BChrer: Ralf Esser | Bernhard Seidl | Nikolaus Starzacher</span>=
</p><span style=3D"font-size:18.0pt;font-family:Webdings;color:#00b050"></s=
pan><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,sans-seri=
f;color:#00b050"></span><span style=3D"font-size:8.0pt;font-family:&quot;Ar=
ial&quot;,&quot;sans-serif&quot;;color:#5f5f5f">Diese
 E-Mail und eventuell angeh=C3=A4ngte Dateien sind nur f=C3=BCr den oben ge=
nannten
 Empf=C3=A4nger bestimmt und k=C3=B6nnen vertrauliche Informationen enthalt=
en.=20
Sollten Sie nicht der Empf=C3=A4nger sein, ist jede Verbreitung,=20
Weiterleitung und Kopie verboten. Wenn Sie diese E-Mail versehentlich=20
erhalten haben, senden Sie diese Mail zur=C3=BCck oder unterrichten umgehen=
d=20
den Absender unter oben genannten Kontaktdaten. Bitte l=C3=B6schen Sie dies=
e=20
Nachricht in diesem Fall umgehend. Vielen Dank.</span><br></div></div></div=
></div></div></div></div></div></div></div></div></div>
</div>

--000000000000ca462d05705ebc23--


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 02:53:03 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 07:53:03 +0000
Received: from localhost ([127.0.0.1]:35593 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imYoR-000097-DS
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:03 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49318)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@cs.ucla.edu>) id 1imYoO-00008c-SA
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:01 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4988716008F;
 Tue, 31 Dec 2019 23:52:55 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id Gxmu9XNl4O-w; Tue, 31 Dec 2019 23:52:54 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9C3A716022A;
 Tue, 31 Dec 2019 23:52:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id EswyY5VL8zaA; Tue, 31 Dec 2019 23:52:54 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 6D60516008F;
 Tue, 31 Dec 2019 23:52:54 -0800 (PST)
To: Sergiu Hlihor <sh@discovergy.com>
From: Paul Eggert <eggert@cs.ucla.edu>
Organization: UCLA Computer Science Department
Subject: Re: Improvements in Grep (Bug#32073)
Message-ID: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
Date: Tue, 31 Dec 2019 23:52:54 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Dennis Clarke <dclarke@blastwave.org>,
 Jim Meyering <jim@meyering.net>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

> This makes me think we should follow Coreutils' lead[0] and increase
> grep's initial buffer size from 32KiB, probably to 128KiB.

I see that Jim later installed a patch increasing it to 96 KiB.

Whatever number is chosen, it's "wrong" for some configuration. And I suppose
the particular configuration that Sergiu Hlihor mentioned could be tweaked so
that it worked better with grep (and with other programs).

I'm inclined to mark this bug report as a wishlist item, in the sense that it'd
be nice if grep and/or the OS could pick buffer sizes more intelligently (though
it's not clear how grep and/or the OS could go about this).


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 02:53:29 2020
Received: (at control) by debbugs.gnu.org; 1 Jan 2020 07:53:29 +0000
Received: from localhost ([127.0.0.1]:35596 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imYor-00009n-M7
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:29 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49386)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@cs.ucla.edu>) id 1imYop-00009Y-8O
 for control@debbugs.gnu.org; Wed, 01 Jan 2020 02:53:28 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id D537616008F
 for <control@debbugs.gnu.org>; Tue, 31 Dec 2019 23:53:19 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 3PAb7AcBSNSW for <control@debbugs.gnu.org>;
 Tue, 31 Dec 2019 23:53:19 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 41FB716022A
 for <control@debbugs.gnu.org>; Tue, 31 Dec 2019 23:53:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id qk7gh82mTcLL for <control@debbugs.gnu.org>;
 Tue, 31 Dec 2019 23:53:19 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 24AEC16008F
 for <control@debbugs.gnu.org>; Tue, 31 Dec 2019 23:53:19 -0800 (PST)
To: control@debbugs.gnu.org
From: Paul Eggert <eggert@cs.ucla.edu>
Subject: 32073 is wishlist
Organization: UCLA Computer Science Department
Message-ID: <4ce2bf47-cf95-a1c9-92cd-a351983cd23f@cs.ucla.edu>
Date: Tue, 31 Dec 2019 23:53:18 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: control
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

severity 32073 wishlist


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 04:15:37 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 09:15:37 +0000
Received: from localhost ([127.0.0.1]:35621 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1ima6K-0003yV-Uc
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 04:15:37 -0500
Received: from mail-io1-f50.google.com ([209.85.166.50]:34884)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1ima6I-0003yF-QY
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 04:15:36 -0500
Received: by mail-io1-f50.google.com with SMTP id v18so35842758iol.2
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 01:15:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=8gHSYNXJGtNZ3y8e9Nw8xBL74BT4jrQGosamVRHAPwE=;
 b=RHY9/EZNzOLQbXdM8yE7Em+XryBaYVTCsL6kzppIApUrQBPapaZIJ+YLRTFKJayFQ5
 zmrCvfR4WiNuREW6XOV3bU590JIE3dcFucwcjYuHFQRB3vsA7728et+Xkxfz3I+JinAj
 kUWosCOKB+hgpJLZfYI5V/GS3pE6lgfqgDmYtR0ywh4e7yMcdCV7ar1YzcggMSnC0qjl
 41d03g7n5dWawEmvqedFvgX0njyaojVViK7++X+q43XLrSvMC2GzLay8RHiLdE+BLir9
 1jxuH13Y+oBsiqwA+wk5X/cxjdqXbvYm685Yyr0QIjaxIVx4ScPaJPZ0AmWLG7Lj6/7W
 yf6Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=8gHSYNXJGtNZ3y8e9Nw8xBL74BT4jrQGosamVRHAPwE=;
 b=a3vUjVYpkkDSnTuJg6qU2hPm2rzPsoe28rpKH9L15sRe78vtWhFXzOxdZgb8kchlH+
 4fRWCRRUR4pjAMNGUq1o+ZRG4N06jmApjKC3b3lafFsk5VIb6S+or5V+xIljQcLwF9EF
 kw1jnOf3gs4qjTFOG7LZHcWY8mtgmef01YYJ4fhj4AwhkY2lJRdoaorZnf8xS4H8/s83
 pppQgvZCmA5J8QSKcnMLaU2/80k2rAvVjwa+vB5gABKR6c8pGXxzVxysyUb4DkyZPtR3
 ww4/tJljbviR29fqVNTARspTgTpLGWwhbuuhKx1ZdFF+aisvS/Z3kN+LQQfbxIIaEOY6
 llBQ==
X-Gm-Message-State: APjAAAVudOcOVwHsECbWShBB7sRiFc0qJqC5DetlwIBP1zWr62tZiCBq
 A9No8KAFnKOj9/qQRQabPR7sLP3AdHjtW0WrYIRhYA==
X-Google-Smtp-Source: APXvYqyqv9lqodjCG3Jkez2qtpMRbIhjIlGYB6bhFu6so69gdZ5KPSB8/uvgzj4GZl/qdlbNcZmgcB8fLcP/2+wj+38=
X-Received: by 2002:a02:864b:: with SMTP id e69mr58953496jai.83.1577870129071; 
 Wed, 01 Jan 2020 01:15:29 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
In-Reply-To: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Wed, 1 Jan 2020 10:15:16 +0100
Message-ID: <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
Subject: Re: Improvements in Grep (Bug#32073)
To: Paul Eggert <eggert@cs.ucla.edu>
Content-Type: multipart/alternative; boundary="0000000000008b9f5e059b1084ee"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Dennis Clarke <dclarke@blastwave.org>,
 Jim Meyering <jim@meyering.net>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--0000000000008b9f5e059b1084ee
Content-Type: text/plain; charset="UTF-8"

This topic is getting more and more frustrating. If you rely on OS, then
you are at the mercy of whatever read ahead configuration you have. And
read ahead is typically 128KB so does not help that much. A HDD RAID 10
array with 12 disks and a strip size of 128KB reaches the maximum read
throughput if read block size is 6 * 128 = 768KB. When issuing read
requests with 128KB , you only hit one HDD, having 1/6 read throughput.
With flash the same. A state of the art SSD that can do 5GB/s reads can
actually do around 1GB/s or less at 128KB block size. Why is so hard to
understand how hardware works and the fact that you need huge block sizes
to actually read at full speed? Why not just exposing the read buffer size
as a configurable parameter, then anyone can just tune it as needed? 96KB
is purely retarded.

On Wed, 1 Jan 2020 at 08:52, Paul Eggert <eggert@cs.ucla.edu> wrote:

> > This makes me think we should follow Coreutils' lead[0] and increase
> > grep's initial buffer size from 32KiB, probably to 128KiB.
>
> I see that Jim later installed a patch increasing it to 96 KiB.
>
> Whatever number is chosen, it's "wrong" for some configuration. And I
> suppose
> the particular configuration that Sergiu Hlihor mentioned could be tweaked
> so
> that it worked better with grep (and with other programs).
>
> I'm inclined to mark this bug report as a wishlist item, in the sense that
> it'd
> be nice if grep and/or the OS could pick buffer sizes more intelligently
> (though
> it's not clear how grep and/or the OS could go about this).
>

--0000000000008b9f5e059b1084ee
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>This topic is getting more and more frustrating. If y=
ou rely on OS, then you are at the mercy of whatever read ahead configurati=
on you have. And read ahead is typically 128KB so does not help that much. =
A HDD RAID 10 array with 12 disks and a strip size of 128KB reaches the max=
imum read throughput if read block size is 6 * 128 =3D 768KB. When issuing =
read requests with 128KB , you only hit one HDD, having 1/6 read throughput=
. With flash the same. A state of the art SSD that can do 5GB/s reads can a=
ctually do around 1GB/s or less at 128KB block size. Why is so hard to unde=
rstand how hardware works and the fact that you need huge block sizes to ac=
tually read at full speed? Why not just exposing the read buffer size as a =
configurable parameter, then anyone can just tune it as needed? 96KB is pur=
ely retarded.<br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" clas=
s=3D"gmail_attr">On Wed, 1 Jan 2020 at 08:52, Paul Eggert &lt;<a href=3D"ma=
ilto:eggert@cs.ucla.edu">eggert@cs.ucla.edu</a>&gt; wrote:<br></div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1p=
x solid rgb(204,204,204);padding-left:1ex">&gt; This makes me think we shou=
ld follow Coreutils&#39; lead[0] and increase<br>
&gt; grep&#39;s initial buffer size from 32KiB, probably to 128KiB.<br>
<br>
I see that Jim later installed a patch increasing it to 96 KiB.<br>
<br>
Whatever number is chosen, it&#39;s &quot;wrong&quot; for some configuratio=
n. And I suppose<br>
the particular configuration that Sergiu Hlihor mentioned could be tweaked =
so<br>
that it worked better with grep (and with other programs).<br>
<br>
I&#39;m inclined to mark this bug report as a wishlist item, in the sense t=
hat it&#39;d<br>
be nice if grep and/or the OS could pick buffer sizes more intelligently (t=
hough<br>
it&#39;s not clear how grep and/or the OS could go about this).<br>
</blockquote></div><br clear=3D"all"><br><br></div>

--0000000000008b9f5e059b1084ee--


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 06:19:34 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 11:19:34 +0000
Received: from localhost ([127.0.0.1]:35683 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imc2H-0006qy-PG
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:19:34 -0500
Received: from freefriends.org ([96.88.95.60]:44578)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arnold@skeeve.com>) id 1imc2F-0006qq-G8
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 06:19:32 -0500
X-Envelope-From: arnold@skeeve.com
Received: from freefriends.org (freefriends.org [96.88.95.60])
 by freefriends.org (8.14.7/8.14.7) with ESMTP id 001BJN5u027995
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Wed, 1 Jan 2020 04:19:23 -0700
Received: (from arnold@localhost)
 by freefriends.org (8.14.7/8.14.7/Submit) id 001BJMYA027994;
 Wed, 1 Jan 2020 04:19:22 -0700
From: arnold@skeeve.com
Message-Id: <202001011119.001BJMYA027994@freefriends.org>
X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to
 arnold@skeeve.com using -f
Date: Wed, 01 Jan 2020 04:19:22 -0700
To: sh@discovergy.com, eggert@cs.ucla.edu
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
In-Reply-To: <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.1 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.9 (/)

As a quite serious question, how is someone writing user-level code
supposed to be able to figure out the right buffer size for a particular
file, and to do so portably? ("Show me the code.")

Gawk bases its reads on the st_blksize member in struct stat.  That will
typically be something like 4K - not nearly enough, given your description
below.

Arnold

Sergiu Hlihor <sh@discovergy.com> wrote:

> This topic is getting more and more frustrating. If you rely on OS, then
> you are at the mercy of whatever read ahead configuration you have. And
> read ahead is typically 128KB so does not help that much. A HDD RAID 10
> array with 12 disks and a strip size of 128KB reaches the maximum read
> throughput if read block size is 6 * 128 = 768KB. When issuing read
> requests with 128KB , you only hit one HDD, having 1/6 read throughput.
> With flash the same. A state of the art SSD that can do 5GB/s reads can
> actually do around 1GB/s or less at 128KB block size. Why is so hard to
> understand how hardware works and the fact that you need huge block sizes
> to actually read at full speed? Why not just exposing the read buffer size
> as a configurable parameter, then anyone can just tune it as needed? 96KB
> is purely retarded.
>
> On Wed, 1 Jan 2020 at 08:52, Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> > > This makes me think we should follow Coreutils' lead[0] and increase
> > > grep's initial buffer size from 32KiB, probably to 128KiB.
> >
> > I see that Jim later installed a patch increasing it to 96 KiB.
> >
> > Whatever number is chosen, it's "wrong" for some configuration. And I
> > suppose
> > the particular configuration that Sergiu Hlihor mentioned could be tweaked
> > so
> > that it worked better with grep (and with other programs).
> >
> > I'm inclined to mark this bug report as a wishlist item, in the sense that
> > it'd
> > be nice if grep and/or the OS could pick buffer sizes more intelligently
> > (though
> > it's not clear how grep and/or the OS could go about this).
> >


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 06:27:57 2020
Received: (at submit) by debbugs.gnu.org; 1 Jan 2020 11:27:57 +0000
Received: from localhost ([127.0.0.1]:35689 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imcAO-00077b-V7
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:27:57 -0500
Received: from lists.gnu.org ([209.51.188.17]:46445)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <pj@usa.net>) id 1imcAM-00077P-VV
 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 06:27:55 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:37966)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <pj@usa.net>) id 1imcAL-0007y8-Jv
 for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:54 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_LOW,
 URIBL_BLOCKED autolearn=disabled version=3.3.2
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <pj@usa.net>) id 1imcAK-0000fx-G8
 for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:53 -0500
Received: from wout2-smtp.messagingengine.com ([64.147.123.25]:53503)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <pj@usa.net>) id 1imcAK-0000cX-6N
 for bug-grep@gnu.org; Wed, 01 Jan 2020 06:27:52 -0500
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
 by mailout.west.internal (Postfix) with ESMTP id 2567A44F
 for <bug-grep@gnu.org>; Wed,  1 Jan 2020 06:27:50 -0500 (EST)
Received: from imap34 ([10.202.2.84])
 by compute1.internal (MEProxy); Wed, 01 Jan 2020 06:27:50 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-proxy
 :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=3FPE13
 sLv9H+a6dWQRcMgOBbn4EKJMJWiX4CmxajVgQ=; b=jFIOhRXG5TxSZfp8sSbsYf
 atLO6F0EBVwJYVgqpV/PMbFcbDL2NxxGv61We/kSEGFAmWgRqA528MvU6sUnVs8J
 tUU/yq2kUq9SJZy7FfUvbF/mBFZnM5y48hEeE0I60qKPmHxr7Tf1MhLOKeK6Tf+9
 LdVh4fZq+LDjbe5BaJBcteOMUids9+LWeT1wh8J+kyeqKDQc3mSf6KPmGqYcCC1Z
 xlVDjql840uOD33Dc3hNGLwGBYm/6AWbDmRwXArH8EwTQQHfopWf5YdQ5qW64AVL
 mySB2nVL/IFaWGISNxvBNej/1ervduOtlMel4YIJLSH0+BFKRdP1S16dwKeJ8kyQ
 ==
X-ME-Sender: <xms:NYIMXhbhPRcCmdkpGXxnM6RFlGNP8sT50ATVRU7DjQeMQw--sWuJVA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdefledgvdejucetufdoteggodetrfdotf
 fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
 uceurghilhhouhhtmecufedttdenucenucfjughrpefofgggkfgjfhffhffvufgtsehttd
 ertderredtnecuhfhrohhmpedfrfgruhhlucflrggtkhhsohhnfdcuoehpjhesuhhsrgdr
 nhgvtheqnecurfgrrhgrmhepmhgrihhlfhhrohhmpehpjhesuhhsrgdrnhgvthenucevlh
 hushhtvghrufhiiigvpedt
X-ME-Proxy: <xmx:NYIMXt65eVy_B0YMmKgbR9ZMpE1FfGJx45Mfq72ns2w4jX0V93Pwng>
 <xmx:NYIMXsQulPtDXZFUjlULVZMQ22vjAKn40HcUqhy6qNiWiDVjtK4frA>
 <xmx:NYIMXtvAgvB-piXUe9Bc9YgW4mbhfQ7I9zZRwfGAs8OrOXYtjuEo5Q>
 <xmx:NYIMXqF_ko3gryJj2-oPXJOL3Zz4bgKTGQkQzqkrJMyG2EIbjX2KXg>
Received: by mailuser.nyi.internal (Postfix, from userid 501)
 id 5E5C11460061; Wed,  1 Jan 2020 06:27:49 -0500 (EST)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.1.7-694-gd5bab98-fmstable-20191218v1
Mime-Version: 1.0
Message-Id: <a59adc1e-64af-44bd-b3aa-8821a7fe354b@www.fastmail.com>
In-Reply-To: <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
Date: Wed, 01 Jan 2020 05:26:04 -0600
From: "Paul Jackson" <pj@usa.net>
To: bug-grep@gnu.org
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
 [fuzzy]
X-Received-From: 64.147.123.25
X-Spam-Score: -1.6 (-)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -2.6 (--)

>>  Why not just exposing the read buffer size as a configurable parameter ...

Take a look at the (and I quote) "Hairy buffering mechanism for grep"
input buffering code in the grep source file grep-3.3/src/grep.c, then
you tell me why it's not a runtime variable parameter <grin>.

In other words, the input (and output) i/o buffering and performance
tuning for various situations and kinds of files has been tuned and
refined over many years.  Doing something to the code, such as
making buffer size a run time adjustable parameter, would probably
not be easy, would risk making one usage of grep slower in order
to make some other usage faster, and would risk some nasty bugs.

-- 
                Paul Jackson
                pj@usa.net


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 14:07:11 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 19:07:11 +0000
Received: from localhost ([127.0.0.1]:37583 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imjKo-0006V4-Aa
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 14:07:11 -0500
Received: from mail-il1-f169.google.com ([209.85.166.169]:47082)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1imjKm-0006Us-9C
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 14:07:09 -0500
Received: by mail-il1-f169.google.com with SMTP id t17so32599947ilm.13
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 11:07:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=GE5BFtR8fRKW8SyVt2Lsf4LSkbRzd93/7WiEMgl11gY=;
 b=wLcj4eRokODQL0JaGa9c6cAKCu7JfsifRZmzw4C7SXX44Gq6qvCLR3b4mAPh/l+UMS
 VJtMKBnP5BTRlNwsYtNlGgi0CSXPRFTsIAkfQ8lrqBEEW1IfX7uEBCmL3CF28vSbeB//
 gMcALYDiBGg853Ma2cuTs5epE4zWXpYU+giu6yabLP2U63D37ERXXON9PRheQS7ZyXKZ
 6nkO1Ke1MiyBHx3cx0unMYYEeesQLZOIQQJjXN9XP5ZDvrpTwC+NrPBpHKGIRWeA8YBn
 CgjZajjSxDV3mC4V4zrr8EMB4tYv4y5VebZ6EISUtPKNZ77r2Sx10pP5/x1fBHnWE7rh
 jt4A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=GE5BFtR8fRKW8SyVt2Lsf4LSkbRzd93/7WiEMgl11gY=;
 b=Io4LA7W5lP/3oOKN9v6oTszdXzyWwiM5t8cwr8UNetOImddPwsWACTUZrDxz3fa5GU
 EvVQab9f6IGIRlGqhJ5QgSIDY3iwqZUhDnIWaXL24kLGrdj1LDnwD4kX9sWrB7zrKf5x
 q7iypdVIKlpVpcpgPCDqHGodSsecsmwq6lZyMGLeTojrFImwqK81vFr8MXND06UDWmQJ
 pjHMBEeX9tqpOHVNX+gh4CyXErHgdsWHmQLrlFMcvDoVZpAGSgzKbCGaVrlomgO3crNy
 EWkY9N18muh4DfbXmS+g3jqh77DvrRB9kSIWnqUkwMjw3r34Z5k2XV8H6QHU/QbH5MLz
 5KxA==
X-Gm-Message-State: APjAAAW/eUaiM5HWABC1RF84tL87+fcjejLYjs9oxjD1Fqozy87K0EPi
 w644Ffmtoe5cEW6dGBPBPeLkhsfcsiR5P3pbCfudHmTb2MQ=
X-Google-Smtp-Source: APXvYqz3EOZRxGhoBIXUl1bc21QJ1eX+eLgDIJhaZSMVcJq8KpODZtuA8VFHuuYcMl86wEhSgL1v7c3L5rMyGkbvsAc=
X-Received: by 2002:a92:2804:: with SMTP id l4mr66440415ilf.136.1577905622626; 
 Wed, 01 Jan 2020 11:07:02 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
In-Reply-To: <202001011119.001BJMYA027994@freefriends.org>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Wed, 1 Jan 2020 20:06:39 +0100
Message-ID: <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
To: arnold@skeeve.com
Content-Type: multipart/alternative; boundary="000000000000204a27059b18c80b"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000204a27059b18c80b
Content-Type: text/plain; charset="UTF-8"

Arnold, there is no need to write user code, it is already done in
benchmarks. One of the standard benchmarks when testing HDDs and SSDs is
read throughput vs block size and at different queue depths.  Take a look
at this"
https://www.servethehome.com/wp-content/uploads/2019/12/Corsair-Force-MP600-1TB-ATTO.jpg
. In this benchmark, at queue depth 4 and 128KB block size, the SSD was not
yet able to achieve the maximum throughput 5GB/s. Moreover, if you
extrapolate the results, to a queue depth of 1, you get about ~1.2GB/s out
of over 5GB/s theoretical. Therefore for this particular model you need to
issue read requests at minimum 512KB block size to achieve maximum
throughput. With hard drives I already explained the issue. I have a
production server where the HDD RAID array can do theoretically 2.5GB/s and
I see read speeds over 500MB/s sustained when large block sizes are used
for reads, yet when I use grep, I have a practical bandwidth of 20 to 50
MB/s. Moreover, when it comes to HDDs the math is quite simple and here it
is for a standard HDD at 7200 RPM, 240MB/s:
7200 RPM => 120 revolutions per second
240 MB/s at 120 revolutions => 2MB per revolution
One revolution time  = 1000/120 => 8,33 ms
Read throughput per ms = 240KB

Worst case scenario: each read request requires a full revolution to reach
to the data (head positioning is done concurrently and this can be
ignored).
Seek time: 8.33ms
At 96KB:
 - Read time: 0.4ms
 - Total read latency  = 8.33 + 0.4 = 8.73ms, read throughput  = 1000 /
8.73 * 96KB = 11MB/s
At 512KB:
 - Read time: 2.3ms
 - Total read latency = 8.33 + 2.3 = 10.63ms, read throughput  = 1000 /
10.63 * 512KB = 48MB/s
In practice average seek latencies are 4.16ms so throughput is double. This
is the cold hard reality. In practice, when each one of you is testing, you
are very likely deceived by testing on *one hdd, on an idle system* where
you don't have anything else consuming IO in background like a database. In
such an ideal scenario you do see 240MB/s because HDDs do also read ahead
and by the time the data is transferred over interface and consumed, next
chuck is in the buffer and can be delivered with apparent 0 seek time. This
means first read takes 4ms, next ones takes 0.1ms. With a* HDD RAID array
on a server where your IO is always at 50% load*, if you have a strip size
of 128KB or more, you are hitting one drive at a time, each one with a
penalty of 4.16ms. And due to constant load, by the time you hit the first
hdd again, the read ahead buffer maintained by the HDD itself is also
discarded, so all reads go directly to physical medium. If however you hit
all HDDs at the same time, you will benefit from the read ahead from the
HDD for at least one or more cycles thus having reads with apparent 0
latency and a way higher average bandwidth. The cost of reading from all
HDDs at the same time is a potential of adding extra latencies for all
other applications running, this is why the value should be configurable,
such that best value can be setup based on hardware. The issue of large
block sizes for IO operations is widespread across all tools from Linux,
like rsync or cp and its only getting worse, to an extend where in my
company we are considering writing our own tools for something that should
have worked out of the box. One side issue, which I have to mention as I'm
not aware of implementation details: as we are getting in GB/s territory,
read is best done within it's own thread which then serves the output to
the processing thread. With SSDs that can do multi GB/s this matters.


On Wed, 1 Jan 2020 at 12:19, <arnold@skeeve.com> wrote:

> As a quite serious question, how is someone writing user-level code
> supposed to be able to figure out the right buffer size for a particular
> file, and to do so portably? ("Show me the code.")
>
> Gawk bases its reads on the st_blksize member in struct stat.  That will
> typically be something like 4K - not nearly enough, given your description
> below.
>
> Arnold
>
> Sergiu Hlihor <sh@discovergy.com> wrote:
>
> > This topic is getting more and more frustrating. If you rely on OS, then
> > you are at the mercy of whatever read ahead configuration you have. And
> > read ahead is typically 128KB so does not help that much. A HDD RAID 10
> > array with 12 disks and a strip size of 128KB reaches the maximum read
> > throughput if read block size is 6 * 128 = 768KB. When issuing read
> > requests with 128KB , you only hit one HDD, having 1/6 read throughput.
> > With flash the same. A state of the art SSD that can do 5GB/s reads can
> > actually do around 1GB/s or less at 128KB block size. Why is so hard to
> > understand how hardware works and the fact that you need huge block sizes
> > to actually read at full speed? Why not just exposing the read buffer
> size
> > as a configurable parameter, then anyone can just tune it as needed? 96KB
> > is purely retarded.
> >
> > On Wed, 1 Jan 2020 at 08:52, Paul Eggert <eggert@cs.ucla.edu> wrote:
> >
> > > > This makes me think we should follow Coreutils' lead[0] and increase
> > > > grep's initial buffer size from 32KiB, probably to 128KiB.
> > >
> > > I see that Jim later installed a patch increasing it to 96 KiB.
> > >
> > > Whatever number is chosen, it's "wrong" for some configuration. And I
> > > suppose
> > > the particular configuration that Sergiu Hlihor mentioned could be
> tweaked
> > > so
> > > that it worked better with grep (and with other programs).
> > >
> > > I'm inclined to mark this bug report as a wishlist item, in the sense
> that
> > > it'd
> > > be nice if grep and/or the OS could pick buffer sizes more
> intelligently
> > > (though
> > > it's not clear how grep and/or the OS could go about this).
> > >
>

--000000000000204a27059b18c80b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div>Arnold, there is no need to write us=
er code, it is already done in benchmarks. One of the standard benchmarks w=
hen testing HDDs and SSDs is read throughput vs block size and at different=
 queue depths.=C2=A0 Take a look at this&quot; <a href=3D"https://www.serve=
thehome.com/wp-content/uploads/2019/12/Corsair-Force-MP600-1TB-ATTO.jpg">ht=
tps://www.servethehome.com/wp-content/uploads/2019/12/Corsair-Force-MP600-1=
TB-ATTO.jpg</a> . In this benchmark, at queue depth 4 and 128KB block size,=
 the SSD was not yet able to achieve the maximum throughput 5GB/s. Moreover=
, if you extrapolate the results, to a queue depth of 1, you get about ~1.2=
GB/s out of over 5GB/s theoretical. Therefore for this particular model you=
 need to issue read requests at minimum 512KB block size to achieve maximum=
 throughput. With hard drives I already explained the issue. I have a produ=
ction server where the HDD RAID array can do theoretically 2.5GB/s and I se=
e read speeds over 500MB/s sustained when large block sizes are used for re=
ads, yet when I use grep, I have a practical bandwidth of 20 to 50 MB/s. Mo=
reover, when it comes to HDDs the math is quite simple and here it is for a=
 standard HDD at 7200 RPM, 240MB/s:</div><div>7200 RPM =3D&gt; 120 revoluti=
ons per second <br></div><div>240 MB/s at 120 revolutions =3D&gt; 2MB per r=
evolution</div><div>One revolution time=C2=A0 =3D 1000/120 =3D&gt; 8,33 ms<=
/div><div>Read throughput per ms =3D 240KB</div><div><br></div><div>Worst c=
ase scenario: each read request requires a full revolution to reach to the =
data (head positioning is done concurrently and this can be ignored). <br><=
/div><div></div><div>Seek time: 8.33ms</div><div></div><div>At 96KB:<br></d=
iv><div>=C2=A0- Read time: 0.4ms</div><div>=C2=A0- Total read latency=C2=A0=
 =3D 8.33 + 0.4 =3D 8.73ms, read throughput=C2=A0 =3D 1000 / 8.73 * 96KB =
=3D 11MB/s</div><div></div><div>At 512KB:</div><div>=C2=A0- Read time: 2.3m=
s</div><div>=C2=A0- Total read latency =3D 8.33 + 2.3 =3D 10.63ms, read thr=
oughput=C2=A0 =3D 1000 / 10.63 * 512KB =3D 48MB/s</div><div>In practice ave=
rage seek latencies are 4.16ms so throughput is double. This is the cold ha=
rd reality. In practice, when each one of you is testing, you are very like=
ly deceived by testing on <b>one hdd, on an idle system</b> where you don&#=
39;t have anything else consuming IO in background like a database. In such=
 an ideal scenario you do see 240MB/s because HDDs do also read ahead and b=
y the time the data is transferred over interface and consumed, next chuck =
is in the buffer and can be delivered with apparent 0 seek time. This means=
 first read takes 4ms, next ones takes 0.1ms. With a<b> HDD RAID array on a=
 server where your IO is always at 50% load</b>, if you have a strip size o=
f 128KB or more, you are hitting one drive at a time, each one with a penal=
ty of 4.16ms. And due to constant load, by the time you hit the first hdd a=
gain, the read ahead buffer maintained by the HDD itself is also discarded,=
 so all reads go directly to physical medium. If however you hit all HDDs a=
t the same time, you will benefit from the read ahead from the HDD for at l=
east one or more cycles thus having reads with apparent 0 latency and a way=
 higher average bandwidth. The cost of reading from all HDDs at the same ti=
me is a potential of adding extra latencies for all other applications runn=
ing, this is why the value should be configurable, such that best value can=
 be setup based on hardware. The issue of large block sizes for IO operatio=
ns is widespread across all tools from Linux, like rsync or cp and its only=
 getting worse, to an extend where in my company we are considering writing=
 our own tools for something that should have worked out of the box. One si=
de issue, which I have to mention as I&#39;m not aware of implementation de=
tails: as we are getting in GB/s territory, read is best done within it&#39=
;s own thread which then serves the output to the processing thread. With S=
SDs that can do multi GB/s this matters.<br></div><div><br></div><div><br><=
/div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" c=
lass=3D"gmail_attr">On Wed, 1 Jan 2020 at 12:19, &lt;<a href=3D"mailto:arno=
ld@skeeve.com">arnold@skeeve.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">As a quite serious question, how is someon=
e writing user-level code<br>
supposed to be able to figure out the right buffer size for a particular<br=
>
file, and to do so portably? (&quot;Show me the code.&quot;)<br>
<br>
Gawk bases its reads on the st_blksize member in struct stat.=C2=A0 That wi=
ll<br>
typically be something like 4K - not nearly enough, given your description<=
br>
below.<br>
<br>
Arnold<br>
<br>
Sergiu Hlihor &lt;<a href=3D"mailto:sh@discovergy.com" target=3D"_blank">sh=
@discovergy.com</a>&gt; wrote:<br>
<br>
&gt; This topic is getting more and more frustrating. If you rely on OS, th=
en<br>
&gt; you are at the mercy of whatever read ahead configuration you have. An=
d<br>
&gt; read ahead is typically 128KB so does not help that much. A HDD RAID 1=
0<br>
&gt; array with 12 disks and a strip size of 128KB reaches the maximum read=
<br>
&gt; throughput if read block size is 6 * 128 =3D 768KB. When issuing read<=
br>
&gt; requests with 128KB , you only hit one HDD, having 1/6 read throughput=
.<br>
&gt; With flash the same. A state of the art SSD that can do 5GB/s reads ca=
n<br>
&gt; actually do around 1GB/s or less at 128KB block size. Why is so hard t=
o<br>
&gt; understand how hardware works and the fact that you need huge block si=
zes<br>
&gt; to actually read at full speed? Why not just exposing the read buffer =
size<br>
&gt; as a configurable parameter, then anyone can just tune it as needed? 9=
6KB<br>
&gt; is purely retarded.<br>
&gt;<br>
&gt; On Wed, 1 Jan 2020 at 08:52, Paul Eggert &lt;<a href=3D"mailto:eggert@=
cs.ucla.edu" target=3D"_blank">eggert@cs.ucla.edu</a>&gt; wrote:<br>
&gt;<br>
&gt; &gt; &gt; This makes me think we should follow Coreutils&#39; lead[0] =
and increase<br>
&gt; &gt; &gt; grep&#39;s initial buffer size from 32KiB, probably to 128Ki=
B.<br>
&gt; &gt;<br>
&gt; &gt; I see that Jim later installed a patch increasing it to 96 KiB.<b=
r>
&gt; &gt;<br>
&gt; &gt; Whatever number is chosen, it&#39;s &quot;wrong&quot; for some co=
nfiguration. And I<br>
&gt; &gt; suppose<br>
&gt; &gt; the particular configuration that Sergiu Hlihor mentioned could b=
e tweaked<br>
&gt; &gt; so<br>
&gt; &gt; that it worked better with grep (and with other programs).<br>
&gt; &gt;<br>
&gt; &gt; I&#39;m inclined to mark this bug report as a wishlist item, in t=
he sense that<br>
&gt; &gt; it&#39;d<br>
&gt; &gt; be nice if grep and/or the OS could pick buffer sizes more intell=
igently<br>
&gt; &gt; (though<br>
&gt; &gt; it&#39;s not clear how grep and/or the OS could go about this).<b=
r>
&gt; &gt;<br>
</blockquote></div></div>

--000000000000204a27059b18c80b--


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 14:43:04 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 19:43:04 +0000
Received: from localhost ([127.0.0.1]:37595 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imjtY-0007LK-0Q
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 14:43:04 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43186)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@cs.ucla.edu>) id 1imjtV-0007Kk-3q
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 14:43:01 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id BF9D2160052;
 Wed,  1 Jan 2020 11:42:54 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id xp1ZcUe4sLgB; Wed,  1 Jan 2020 11:42:54 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1B474160054;
 Wed,  1 Jan 2020 11:42:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id K68Jkv66INS6; Wed,  1 Jan 2020 11:42:54 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E1E10160052;
 Wed,  1 Jan 2020 11:42:53 -0800 (PST)
Subject: Re: Improvements in Grep (Bug#32073)
To: Sergiu Hlihor <sh@discovergy.com>
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
From: Paul Eggert <eggert@cs.ucla.edu>
Organization: UCLA Computer Science Department
Message-ID: <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
Date: Wed, 1 Jan 2020 11:42:53 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
In-Reply-To: <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Dennis Clarke <dclarke@blastwave.org>,
 Jim Meyering <jim@meyering.net>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
> If you rely on OS, then
> you are at the mercy of whatever read ahead configuration you have.

Right, and whatever changes you make to the OS and its read-ahead configuration
will work for all applications, not just for 'grep'. So, change the OS to do
that. There shouldn't be a need to change 'grep' in particular (or 'cp' in
particular, or 'awk' in particular, etc.).

> The issue of large
> block sizes for IO operations is widespread across all tools from Linux,
> like rsync or cp and its only getting worse

Quite right. And it would be painful to have to modify all those tools, and to
maintain those modifications. So modify the OS instead. Scheduling read-ahead is
really the OS's job anyway.


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 15:04:59 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 20:04:59 +0000
Received: from localhost ([127.0.0.1]:37607 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imkEk-0007qg-W0
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 15:04:59 -0500
Received: from mail-il1-f174.google.com ([209.85.166.174]:38191)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1imkEi-0007qS-HO
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 15:04:57 -0500
Received: by mail-il1-f174.google.com with SMTP id f5so32700534ilq.5
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 12:04:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=f3gPqw//sPzxPArZaLCn5qCkkS0muRBetlNjDRv9cFw=;
 b=PtSs/uSm5aVIUCaZXg5w2ZCQsGUa5lQcpOH77ANuNNf+2piUl9tePpfnUfa+N231b4
 LN4/iPcDPDuxS0SIErtA/9cOBH/lAoggtTqhmsze0Itxtal1Q9rl/k8kp8VqGzZQpQob
 Ug/YVEttA1WULSbvtaLmx1SjBtb/oyt+GX5JZGxYNo9Ww3dc7YmUWz2t358Kk8eHku4n
 AAuP6kIkhOBQGZrqMzVe6dGCeElWKUgInkinYqpWWinD5gPCuIskIA2m6WHb/ZzWtYJO
 Bg3i9doIJ05U5BZhHYJqmkAV0+RhRClx2oYc0GcSnvtQFY0w8BnZ0HwT6ojKsICI+GOj
 Npwg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=f3gPqw//sPzxPArZaLCn5qCkkS0muRBetlNjDRv9cFw=;
 b=TTMmEUMp/2qYER9W7/OH0NJfNkVCjbQ93a4SZ7lKU/VMOUdrH4ntOlQ9Amyk0MU/v1
 3RsWIXbLs3d5Bvod84nNtN3oRc6770kVemblTN0zGh591o2vySDfEU7lFqo/SN++ugiw
 BFq+RXTDSXQdUrhRnmBlhWSeWncdp2Zwyye0U5zGxiT7oc7gzH9rck9fxd7lIUXd5zV5
 qYdiLaSYKrJKhjB0ursaf6rybkB+EzQntUFGQodz0ImJBmSAGPvVKXbP4gEKmsEpz3ZQ
 KWsWsQ+MbGam9m5Lz5hvB3M1Nk7epL+P+v4dWFHy7xqAqF3gcW/QJnnkNCea6nvBcBot
 uwzg==
X-Gm-Message-State: APjAAAUfpXpEeR5HZA5G+rM0bsnSPtOv+/39pEfYmiBjqtAWV0/L+AwN
 QTe0uc5cRMdRzLiUpQyGFaZaG5fYHUKEpK6C02kW/Q==
X-Google-Smtp-Source: APXvYqyXJkilmFlV4mF5TVSrUwHwx6bE4fHBFq6X4gjgNZsVhsoYnsx/3nyr/iAeOnF8GKBQd+dW7wxgbdEBWGsQysk=
X-Received: by 2002:a92:ce09:: with SMTP id b9mr64895585ilo.219.1577909091082; 
 Wed, 01 Jan 2020 12:04:51 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
In-Reply-To: <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Wed, 1 Jan 2020 21:04:39 +0100
Message-ID: <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
Subject: Re: Improvements in Grep (Bug#32073)
To: Paul Eggert <eggert@cs.ucla.edu>
Content-Type: multipart/alternative; boundary="000000000000dcbab1059b199639"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Dennis Clarke <dclarke@blastwave.org>,
 Jim Meyering <jim@meyering.net>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000dcbab1059b199639
Content-Type: text/plain; charset="UTF-8"

Paul, I have to correct you. On a production server you have usually a mix
of applications many times including databases. For databases, having a
read ahead means one IO less since usually database access patterns are
random reads. Here actually best is to disable completely read ahead. In
fact, I do have to say that probably best is to disable completely read
ahead and let applications deal with it, either in an automatic fashion,
like reading the optimal IO block size from device  or in a configurable
way with defaults good enough for today's servers. If you now configure the
OS to do a read ahead hitting all HDDs then you induce potentially
unnecessary IO load for all applications which use it, which when having
HDDs is totally unacceptable. That's why the best is to be application
specific and ideally configured to use optimal IO block size.

So no, letting OS to do it is stupid.

On Wed, 1 Jan 2020 at 20:42, Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
> > If you rely on OS, then
> > you are at the mercy of whatever read ahead configuration you have.
>
> Right, and whatever changes you make to the OS and its read-ahead
> configuration
> will work for all applications, not just for 'grep'. So, change the OS to
> do
> that. There shouldn't be a need to change 'grep' in particular (or 'cp' in
> particular, or 'awk' in particular, etc.).
>
> > The issue of large
> > block sizes for IO operations is widespread across all tools from Linux,
> > like rsync or cp and its only getting worse
>
> Quite right. And it would be painful to have to modify all those tools,
> and to
> maintain those modifications. So modify the OS instead. Scheduling
> read-ahead is
> really the OS's job anyway.
>

--000000000000dcbab1059b199639
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Paul, I have to correct you. On a production server y=
ou have usually a mix of applications many times including databases. For d=
atabases, having a read ahead means one IO less since usually database acce=
ss patterns are random reads. Here actually best is to disable completely r=
ead ahead. In fact, I do have to say that probably best is to disable compl=
etely read ahead and let applications deal with it, either in an automatic =
fashion, like reading the optimal IO block size from device=C2=A0 or in a c=
onfigurable way with defaults good enough for today&#39;s servers. If you n=
ow configure the OS to do a read ahead hitting all HDDs then you induce pot=
entially unnecessary IO load for all applications which use it, which when =
having HDDs is totally unacceptable. That&#39;s why the best is to be appli=
cation specific and ideally configured to use optimal IO block size.</div><=
div><br></div><div>So no, letting OS to do it is stupid.<br></div><br><div =
class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, 1 Jan 2=
020 at 20:42, Paul Eggert &lt;<a href=3D"mailto:eggert@cs.ucla.edu" target=
=3D"_blank">eggert@cs.ucla.edu</a>&gt; wrote:<br></div><blockquote class=3D=
"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(2=
04,204,204);padding-left:1ex">On 1/1/20 1:15 AM, Sergiu Hlihor wrote:<br>
&gt; If you rely on OS, then<br>
&gt; you are at the mercy of whatever read ahead configuration you have.<br=
>
<br>
Right, and whatever changes you make to the OS and its read-ahead configura=
tion<br>
will work for all applications, not just for &#39;grep&#39;. So, change the=
 OS to do<br>
that. There shouldn&#39;t be a need to change &#39;grep&#39; in particular =
(or &#39;cp&#39; in<br>
particular, or &#39;awk&#39; in particular, etc.).<br>
<br>
&gt; The issue of large<br>
&gt; block sizes for IO operations is widespread across all tools from Linu=
x,<br>
&gt; like rsync or cp and its only getting worse<br>
<br>
Quite right. And it would be painful to have to modify all those tools, and=
 to<br>
maintain those modifications. So modify the OS instead. Scheduling read-ahe=
ad is<br>
really the OS&#39;s job anyway.<br>
</blockquote></div><br></div>

--000000000000dcbab1059b199639--


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 15:24:36 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 20:24:36 +0000
Received: from localhost ([127.0.0.1]:37619 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imkXj-0008Ir-VL
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 15:24:36 -0500
Received: from freefriends.org ([96.88.95.60]:49340)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arnold@skeeve.com>) id 1imkXi-0008Ik-D9
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 15:24:34 -0500
X-Envelope-From: arnold@skeeve.com
Received: from freefriends.org (freefriends.org [96.88.95.60])
 by freefriends.org (8.14.7/8.14.7) with ESMTP id 001KOQ9E012802
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Wed, 1 Jan 2020 13:24:27 -0700
Received: (from arnold@localhost)
 by freefriends.org (8.14.7/8.14.7/Submit) id 001KOQMn012801;
 Wed, 1 Jan 2020 13:24:26 -0700
From: arnold@skeeve.com
Message-Id: <202001012024.001KOQMn012801@freefriends.org>
X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to
 arnold@skeeve.com using -f
Date: Wed, 01 Jan 2020 13:24:26 -0700
To: sh@discovergy.com, arnold@skeeve.com
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
 <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
In-Reply-To: <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.1 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.9 (/)

Hi.

Sergiu Hlihor <sh@discovergy.com> wrote:

> Arnold, there is no need to write user code, it is already done in
> benchmarks. One of the standard benchmarks when testing HDDs and SSDs is
> read throughput vs block size and at different queue depths.

I think you're misunderstanding me, or I am misunderstanding you.

As the gawk maintainer, I can choose the buffer size to use every time
I issue a read(2) system call for any given input file.  Gawk currently
uses the smaller of (a) the file's size or (b) the st_blksize member of
the struct stat array.

If I understand you correctly, this is "not enough"; gawk (grep,
cp, etc.) should all use an optimal buffer size that depends upon the
underlying storage hardware where the file is located.

So far, so good, except for: How do I determine what that number is?
I cannot run a benchmark before opening each and every file. I don't
know of a system call that will give me that number. (If there is,
please point me to it.)

Do you just want a command line option or environment variable
that you, as the application user, can set?

If the latter, it happens that gawk will let you set AWKBUFSIZE and
it will use whatever number you supply for doing reads. (This is
even documented.)

HTH,

Arnold


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 16:02:49 2020
Received: (at 32073) by debbugs.gnu.org; 1 Jan 2020 21:02:49 +0000
Received: from localhost ([127.0.0.1]:37654 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1iml8j-0000lx-3d
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:02:49 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:48738)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@cs.ucla.edu>) id 1iml8g-0000lf-Gi
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 16:02:47 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1D85D160052;
 Wed,  1 Jan 2020 13:02:39 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id td9dIdw_GCN9; Wed,  1 Jan 2020 13:02:38 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7A6AA160054;
 Wed,  1 Jan 2020 13:02:38 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id N3e1b83al3QG; Wed,  1 Jan 2020 13:02:38 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 51D6C160052;
 Wed,  1 Jan 2020 13:02:38 -0800 (PST)
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
To: Sergiu Hlihor <sh@discovergy.com>
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
 <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
From: Paul Eggert <eggert@cs.ucla.edu>
Organization: UCLA Computer Science Department
Message-ID: <0c596c01-3a43-2651-7de8-50d92ae195a4@cs.ucla.edu>
Date: Wed, 1 Jan 2020 13:02:38 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
In-Reply-To: <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On 1/1/20 12:04 PM, Sergiu Hlihor wrote:

> That's why the best is to be application specific

That doesn't mean that one should have to modify every application. One could
instead modify the OS so that it uses different read-ahead heuristics for
different classes of applications. This should be easier to manage than
modifying every individual application.


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 16:46:15 2020
Received: (at submit) by debbugs.gnu.org; 1 Jan 2020 21:46:15 +0000
Received: from localhost ([127.0.0.1]:37671 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imlol-0001m9-FE
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:46:15 -0500
Received: from lists.gnu.org ([209.51.188.17]:36204)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <pj@usa.net>) id 1imlok-0001m2-Fw
 for submit@debbugs.gnu.org; Wed, 01 Jan 2020 16:46:14 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:41740)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <pj@usa.net>) id 1imloi-0006aO-S5
 for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:14 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_LOW,
 URIBL_BLOCKED autolearn=disabled version=3.3.2
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <pj@usa.net>) id 1imloh-0007Jc-Mw
 for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:12 -0500
Received: from out3-smtp.messagingengine.com ([66.111.4.27]:42797)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <pj@usa.net>) id 1imloh-0007J0-EU
 for bug-grep@gnu.org; Wed, 01 Jan 2020 16:46:11 -0500
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
 by mailout.nyi.internal (Postfix) with ESMTP id BB67E2234B
 for <bug-grep@gnu.org>; Wed,  1 Jan 2020 16:46:10 -0500 (EST)
Received: from imap34 ([10.202.2.84])
 by compute1.internal (MEProxy); Wed, 01 Jan 2020 16:46:10 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-proxy
 :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=1lZVA+
 i/aNbISUaTQxnlsayXO9m5ai4v70uzaoJnjf8=; b=h4D19IOsSFh+M6g73+sQnr
 QJG90tT+P2IiguwhZhb1Ft+nsk5aE/8bGTNpL3vOcKJspn2deBc/jEbiLX9Gp2qe
 DOzYXhVUH6OGVvHnIGulN9GUguvgqNfbt9UC5vqdkr6jLuXK9RyT6pyTrD38acU6
 RmmdYhMOVi6F89BVZApfBhtsbiePo3ERZfNauGOEeGqpE5FQ6B7Rg6J42akfU7/J
 w3Fh5UZ2zPeBILfSh56hlaY69HAGwaI0GFb8iwZIrXhs6eTLJg1lyipZwV1jCn3i
 Y9KKzGRr89E2NV6ZnEELGqkL8mOJr0iUFhtq1e3AiDeHdd/SEiaFHOkJrvmyzuOQ
 ==
X-ME-Sender: <xms:IhMNXpWnRJ8_D0d6Q75r6NYCiOBc9wRyE33jvNlp5eppKTmzDWcT6A>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdefledgudehvdcutefuodetggdotefrod
 ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
 necuuegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesth
 dtredtreertdenucfhrhhomhepfdfrrghulhculfgrtghkshhonhdfuceophhjsehushgr
 rdhnvghtqeenucfrrghrrghmpehmrghilhhfrhhomhepphhjsehushgrrdhnvghtnecuve
 hluhhsthgvrhfuihiivgeptd
X-ME-Proxy: <xmx:IhMNXhdD0kzmSybXt2dzC1gQokhSxjkTsH3J3riofFD4b23xIQrFBg>
 <xmx:IhMNXkyatBxRUL1f-XHhfNs5ux4dRQa1ZXLldBSWuZ-OEzSQZEBPdQ>
 <xmx:IhMNXkGC57X5lY8sT2gC5JPxg_gYNc_apfsbQvWFBF2Xunw9FkhrtQ>
 <xmx:IhMNXijfb0OrKNYVO-2PbTaHwlcr35ywIVbCFoifMjmbUVg0Xcisaw>
Received: by mailuser.nyi.internal (Postfix, from userid 501)
 id 3B42A1460061; Wed,  1 Jan 2020 16:46:10 -0500 (EST)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.1.7-694-gd5bab98-fmstable-20191218v1
Mime-Version: 1.0
Message-Id: <a0744545-50e1-4e11-b200-2fac405c7260@www.fastmail.com>
In-Reply-To: <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
 <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
Date: Wed, 01 Jan 2020 15:45:54 -0600
From: "Paul Jackson" <pj@usa.net>
To: bug-grep@gnu.org
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
 [fuzzy]
X-Received-From: 66.111.4.27
X-Spam-Score: -1.6 (-)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -2.6 (--)

>From my old Unix fart view point, Paul (the other Paul)
is herding a hundred GNU cats, small command line utilities,
many of which date their origins back to the 1970's, many of
which have over the years grown their own internal i/o routines
with specific performance specializations, but few of which
have much in the way of user customizable i/o blocking and
read-ahead customizations.

Except for the last decade, those commands spent almost
their entire lives running off spinning rust platters, which
grew (immensely) in size over the years, but which did not
change much in other performance characteristics. 

Those commands are in general not well suited to adapting to
provide maximally optimal performance across the recent
generation of storage devices, with their much more varied
performance characteristics.

I'm guessing that Sergiu has some specific needs that it seems
that grep meets, except that grep (like its hundred cat siblings)
lacks the tunable i/o characteristics needed to get maximum
performance across a rapidly evolving variety of these more
recent kinds of storage.

What I've done in situations such as I suspect Sergiu finds
himself in is to code up a custom utility, that met my specific
needs, when I had higher performance demands, while
continuing to make extensive use of the general purpose
classic Unix/Linux command line utilities that Paul E. now
herds.

I can't imagine that it would make sense to attempt to recode
a hundred classic GNU utilities to each be intelligently adaptable
goats/pigs/cats/dogs/cows/bison/... depending on the i/o
terrain they were running on.

Many many thanks to Paul E. for herding these cats all these
many years.  I hope my weird comments to not cause him even
the slightest distress.

(The word "cat" above refers to four legged felines, not to the
concatenate command line utility.)

-- 
                Paul Jackson
                pj@usa.net


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 19:51:20 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 00:51:21 +0000
Received: from localhost ([127.0.0.1]:37827 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imohs-0003uy-Ks
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 19:51:20 -0500
Received: from mail-wr1-f67.google.com ([209.85.221.67]:38805)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@gmail.com>) id 1imohq-0003ul-0s
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 19:51:18 -0500
Received: by mail-wr1-f67.google.com with SMTP id y17so37907645wrh.5
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 16:51:17 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=+G2u1+TYKprGWirYES7YqULTKruNQnNn0zOaJRjBkZQ=;
 b=DJpehv2iK45sVa2pCWwsjuhscGIgi7Vi+JiDPqoKcUIob746N1bEwKed6Zz4uBTo9J
 79eVt33udjV2xpDpaBAbUI0+JClV5SM+w5iEsbV0baXoAD+PkggvHJlJyD4hIVd2kP4O
 O0dkvRo5s161Ji2xmGe4jjxgLfiZs1Tlbt1ZM4yEdEJ/XvBYVJa1fMgNdtC4bHDmtth8
 EcfBurLtE+kUPbjWpdJJ223Xz9gRhcVjLod4RgxiZCFORQDSHSQmGkwjHQGytLv2NjD+
 xotEpLdrbGq5KeOyV4w0qtm/f0wbzXmHQE3rgggERy8/QH+o63Pu6RGSU76ILOYZ8Enk
 GflA==
X-Gm-Message-State: APjAAAUvGuD5mRkDz7QdxDJYD/M5KrwuoEZQvDe9YthH/7iwGE96hNvJ
 zm+pZHTKXMSPjejYb6YEoyW91XAaUVX51yl+rUE=
X-Google-Smtp-Source: APXvYqx0LgCp9iag1ABv/x0dAx+wgdhnP1u3ZR3gJKeHrLjJoDlAoey26zheesXxnwVnWVakWy8OeBIjdeOkucRiy2M=
X-Received: by 2002:a5d:670a:: with SMTP id o10mr82667154wru.227.1577926272259; 
 Wed, 01 Jan 2020 16:51:12 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
 <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
In-Reply-To: <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
From: Jim Meyering <jim@meyering.net>
Date: Wed, 1 Jan 2020 16:51:00 -0800
Message-ID: <CA+8g5KEEqcTjV3k+50y4SNhUrrhwO4ACtUuM5PDeRHaaBRAKBg@mail.gmail.com>
Subject: Re: Improvements in Grep (Bug#32073)
To: Sergiu Hlihor <sh@discovergy.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>,
 Dennis Clarke <dclarke@blastwave.org>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.5 (/)

On Wed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor <sh@discovergy.com> wrote:
> Paul, I have to correct you. On a production server you have usually a mi=
x of applications many times including databases. For databases, having a r=
ead ahead means one IO less since usually database access patterns are rand=
om reads. Here actually best is to disable completely read ahead. In fact, =
I do have to say that probably best is to disable completely read ahead and=
 let applications deal with it, either in an automatic fashion, like readin=
g the optimal IO block size from device  or in a configurable way with defa=
ults good enough for today's servers. If you now configure the OS to do a r=
ead ahead hitting all HDDs then you induce potentially unnecessary IO load =
for all applications which use it, which when having HDDs is totally unacce=
ptable. That's why the best is to be application specific and ideally confi=
gured to use optimal IO block size.
>
> So no, letting OS to do it is stupid.
>
> On Wed, 1 Jan 2020 at 20:42, Paul Eggert <eggert@cs.ucla.edu> wrote:
>>
>> On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
>> > If you rely on OS, then
>> > you are at the mercy of whatever read ahead configuration you have.
>>
>> Right, and whatever changes you make to the OS and its read-ahead config=
uration
>> will work for all applications, not just for 'grep'. So, change the OS t=
o do
>> that. There shouldn't be a need to change 'grep' in particular (or 'cp' =
in
>> particular, or 'awk' in particular, etc.).
>>
>> > The issue of large
>> > block sizes for IO operations is widespread across all tools from Linu=
x,
>> > like rsync or cp and its only getting worse
>>
>> Quite right. And it would be painful to have to modify all those tools, =
and to
>> maintain those modifications. So modify the OS instead. Scheduling read-=
ahead is
>> really the OS's job anyway.

Hi Sergiu,

If you would like to help make grep use larger buffer sizes, please
run and report benchmarks measuring how much of a difference it would
make, at least for your hardware. Here are some of the tests I ran to
justify raising it from ~32k to ~96k:
https://lists.gnu.org/archive/html/grep-devel/2018-10/msg00002.html


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 20:04:17 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 01:04:17 +0000
Received: from localhost ([127.0.0.1]:37835 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imouO-0004E8-SV
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 20:04:17 -0500
Received: from mail-io1-f47.google.com ([209.85.166.47]:44138)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1imouM-0004Dv-Q6
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 20:04:15 -0500
Received: by mail-io1-f47.google.com with SMTP id b10so36954283iof.11
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 17:04:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=uG0zcpsFJMb40VAoaqgG4yBxN9fQ8HWatjcq1WBdwuI=;
 b=MrT5OWrM9nJE49cTUjxs8k/CxT7nbY4ZeVQEGSTjEnMFfbQgATGf6icSTcK75Z88No
 nNl+qTwFLLBjZattlCjmMwjNt8ZavrfHuQJQJUOMBpTmDoB6y+kw/Hp3G5lBJ5zuSawo
 EgkmrtKl6uGtcn+GLpXN0/U+qbL7M2RfFYL30m0JYOBRix5Yt95amdM6LpKCvddxzao8
 nXZRyxNjdFAEBlTNx2e9ItM8eCid8K/Yu+gbtEl6aMmyh5FuwU7GaMLAjGGObUIGWqkc
 jiMxWWi+Zp/GIXeZKmkeOuZwGz8xt9iuBOC6w/J19PbEJagxok0z8tZD2+9n/HZuWW9E
 HHJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=uG0zcpsFJMb40VAoaqgG4yBxN9fQ8HWatjcq1WBdwuI=;
 b=nUyERW3t+ZcnzUWltGBTcmkQR4kKsjbsyF320UpOEb5933zi5sJoVEw7z0JDP1rDfV
 FTeC7XyNHNGI7zX8rQnDkOhKs+tPCFRX4SomGFhkhIFuuEJtT4/IQpPGFpRIsuicQifn
 +hPRNqytX/ulsOZJL5Le0w8fTXV03dHuosziGZqMBPDJsG824Czh51KM0ijQf+VaEYXY
 3QsP9zH3EufrihVbr0jprdN/b43SMG7JsgGJUa1NL1pDcGpUJ1z0KAlrEiFptwzDtaTQ
 JhYwBaCQcWUuTW97ch2C4GPYKSXlMCHKoPoYuufa8T2zMwi/+UMknLMiip3u+qcHP9sT
 QQLA==
X-Gm-Message-State: APjAAAUBj/1bVqH9LaAq+VlcVHbDLEec28q59Knq8uW/Ze9lRNRmT/M+
 p+Zbru5g2e8Y9UmFhGS3x4ih5Z7nQhhWzjs0qBtrMQ==
X-Google-Smtp-Source: APXvYqxTMks5Ajdq0T7iEzgJjZoWJOPNuJpms5hMMgKkp0KY1CrYSQcHExV4liYzv2ybrV18DBFtVhqQwDumVAvR104=
X-Received: by 2002:a5e:8505:: with SMTP id i5mr50080878ioj.158.1577927049287; 
 Wed, 01 Jan 2020 17:04:09 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
 <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
 <CA+8g5KEEqcTjV3k+50y4SNhUrrhwO4ACtUuM5PDeRHaaBRAKBg@mail.gmail.com>
In-Reply-To: <CA+8g5KEEqcTjV3k+50y4SNhUrrhwO4ACtUuM5PDeRHaaBRAKBg@mail.gmail.com>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Thu, 2 Jan 2020 02:03:58 +0100
Message-ID: <CAD-3cddJmwBTqozvJcJerc8tRXcv0-2Pf0aePe2yhkJaSOY+vA@mail.gmail.com>
Subject: Re: Improvements in Grep (Bug#32073)
To: Jim Meyering <jim@meyering.net>
Content-Type: multipart/alternative; boundary="000000000000412dba059b1dc5f9"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>,
 Dennis Clarke <dclarke@blastwave.org>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000412dba059b1dc5f9
Content-Type: text/plain; charset="UTF-8"

Hi Jim,
The system for which this hurts me the most is an Ubuntu 14.04 where I'd
need to run it as a separate binary. As I'm not familiar with the way it's
built, is there any guidelines of how to build it from sources? I'd happy
build it with ever larger block sizes and test.

On Thu, 2 Jan 2020 at 01:51, Jim Meyering <jim@meyering.net> wrote:

> On Wed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor <sh@discovergy.com> wrote:
> > Paul, I have to correct you. On a production server you have usually a
> mix of applications many times including databases. For databases, having a
> read ahead means one IO less since usually database access patterns are
> random reads. Here actually best is to disable completely read ahead. In
> fact, I do have to say that probably best is to disable completely read
> ahead and let applications deal with it, either in an automatic fashion,
> like reading the optimal IO block size from device  or in a configurable
> way with defaults good enough for today's servers. If you now configure the
> OS to do a read ahead hitting all HDDs then you induce potentially
> unnecessary IO load for all applications which use it, which when having
> HDDs is totally unacceptable. That's why the best is to be application
> specific and ideally configured to use optimal IO block size.
> >
> > So no, letting OS to do it is stupid.
> >
> > On Wed, 1 Jan 2020 at 20:42, Paul Eggert <eggert@cs.ucla.edu> wrote:
> >>
> >> On 1/1/20 1:15 AM, Sergiu Hlihor wrote:
> >> > If you rely on OS, then
> >> > you are at the mercy of whatever read ahead configuration you have.
> >>
> >> Right, and whatever changes you make to the OS and its read-ahead
> configuration
> >> will work for all applications, not just for 'grep'. So, change the OS
> to do
> >> that. There shouldn't be a need to change 'grep' in particular (or 'cp'
> in
> >> particular, or 'awk' in particular, etc.).
> >>
> >> > The issue of large
> >> > block sizes for IO operations is widespread across all tools from
> Linux,
> >> > like rsync or cp and its only getting worse
> >>
> >> Quite right. And it would be painful to have to modify all those tools,
> and to
> >> maintain those modifications. So modify the OS instead. Scheduling
> read-ahead is
> >> really the OS's job anyway.
>
> Hi Sergiu,
>
> If you would like to help make grep use larger buffer sizes, please
> run and report benchmarks measuring how much of a difference it would
> make, at least for your hardware. Here are some of the tests I ran to
> justify raising it from ~32k to ~96k:
> https://lists.gnu.org/archive/html/grep-devel/2018-10/msg00002.html
>

--000000000000412dba059b1dc5f9
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div>Hi Jim,</div><div>The system for whi=
ch this hurts me the most is an Ubuntu 14.04 where I&#39;d need to run it a=
s a separate binary. As I&#39;m not familiar with the way it&#39;s built, i=
s there any guidelines of how to build it from sources? I&#39;d happy build=
 it with ever larger block sizes and test.</div></div><br><div class=3D"gma=
il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, 2 Jan 2020 at 01:51=
, Jim Meyering &lt;<a href=3D"mailto:jim@meyering.net">jim@meyering.net</a>=
&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On W=
ed, Jan 1, 2020 at 12:04 PM Sergiu Hlihor &lt;<a href=3D"mailto:sh@discover=
gy.com" target=3D"_blank">sh@discovergy.com</a>&gt; wrote:<br>
&gt; Paul, I have to correct you. On a production server you have usually a=
 mix of applications many times including databases. For databases, having =
a read ahead means one IO less since usually database access patterns are r=
andom reads. Here actually best is to disable completely read ahead. In fac=
t, I do have to say that probably best is to disable completely read ahead =
and let applications deal with it, either in an automatic fashion, like rea=
ding the optimal IO block size from device=C2=A0 or in a configurable way w=
ith defaults good enough for today&#39;s servers. If you now configure the =
OS to do a read ahead hitting all HDDs then you induce potentially unnecess=
ary IO load for all applications which use it, which when having HDDs is to=
tally unacceptable. That&#39;s why the best is to be application specific a=
nd ideally configured to use optimal IO block size.<br>
&gt;<br>
&gt; So no, letting OS to do it is stupid.<br>
&gt;<br>
&gt; On Wed, 1 Jan 2020 at 20:42, Paul Eggert &lt;<a href=3D"mailto:eggert@=
cs.ucla.edu" target=3D"_blank">eggert@cs.ucla.edu</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; On 1/1/20 1:15 AM, Sergiu Hlihor wrote:<br>
&gt;&gt; &gt; If you rely on OS, then<br>
&gt;&gt; &gt; you are at the mercy of whatever read ahead configuration you=
 have.<br>
&gt;&gt;<br>
&gt;&gt; Right, and whatever changes you make to the OS and its read-ahead =
configuration<br>
&gt;&gt; will work for all applications, not just for &#39;grep&#39;. So, c=
hange the OS to do<br>
&gt;&gt; that. There shouldn&#39;t be a need to change &#39;grep&#39; in pa=
rticular (or &#39;cp&#39; in<br>
&gt;&gt; particular, or &#39;awk&#39; in particular, etc.).<br>
&gt;&gt;<br>
&gt;&gt; &gt; The issue of large<br>
&gt;&gt; &gt; block sizes for IO operations is widespread across all tools =
from Linux,<br>
&gt;&gt; &gt; like rsync or cp and its only getting worse<br>
&gt;&gt;<br>
&gt;&gt; Quite right. And it would be painful to have to modify all those t=
ools, and to<br>
&gt;&gt; maintain those modifications. So modify the OS instead. Scheduling=
 read-ahead is<br>
&gt;&gt; really the OS&#39;s job anyway.<br>
<br>
Hi Sergiu,<br>
<br>
If you would like to help make grep use larger buffer sizes, please<br>
run and report benchmarks measuring how much of a difference it would<br>
make, at least for your hardware. Here are some of the tests I ran to<br>
justify raising it from ~32k to ~96k:<br>
<a href=3D"https://lists.gnu.org/archive/html/grep-devel/2018-10/msg00002.h=
tml" rel=3D"noreferrer" target=3D"_blank">https://lists.gnu.org/archive/htm=
l/grep-devel/2018-10/msg00002.html</a><br>
</blockquote></div></div>

--000000000000412dba059b1dc5f9--


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 20:28:36 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 01:28:37 +0000
Received: from localhost ([127.0.0.1]:37880 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1impHw-0006mb-MV
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 20:28:36 -0500
Received: from mail-wr1-f47.google.com ([209.85.221.47]:36433)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <meyering@gmail.com>) id 1impHu-0006mO-T9
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 20:28:35 -0500
Received: by mail-wr1-f47.google.com with SMTP id z3so37948657wru.3
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 17:28:34 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=eU99TcSbcI2iR4jYDl47FXSwzC2QYmBjYtFB7wsDhlY=;
 b=bdN0ieBVyX4L9runq/UiI8jXpfthpTXNvsdZ2TI/tLLgSaEQS4J2fFufVfjOb0vpUA
 gYjg/rDRm0YwzSt7f+Hl7gqU5IFYQDqbMVAo/K6lUaF7i1Z4kxxq3ycNb+9LqYgWiniQ
 GlNwmffzXafOBCpH0fWP6TcQXRdMkSZqs43t8n6tY5eog95v3G8m6di57iszM7iG6pOE
 +mO8bRCnFo+q3SYmZ2jStV0+8ffaQ2Zxz73deIuiqEHvtATRWE4jDeJsnfI80saRircX
 TTEhGdVuwFWqtMa1fzyzEUuEABE7a73qAJDFPk2NB9ovcgwHfo/etGYtfsN7KSPnARO2
 YHjA==
X-Gm-Message-State: APjAAAXIIyeRpJMZAZQvcCyCRXpt7S6SX9b1QIlnC/7btxPsXWgwFFle
 zTCQ5tYxN0Px15Iw5wpjDnr/z8/pJXoWwOid6Bs=
X-Google-Smtp-Source: APXvYqySTEjnPE8tRJ0a5sSTVSgL2avrAP8ymb5OUQxtAPFVBAkSIbmHnq3zRcia95N5M7QjqAOZAcPcDQJzboNRLyE=
X-Received: by 2002:adf:8b4f:: with SMTP id v15mr50952033wra.231.1577928509167; 
 Wed, 01 Jan 2020 17:28:29 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <299d76d3-09d2-8c4d-3b1f-0b2205c03db7@cs.ucla.edu>
 <CAD-3cdeARpf+yBqSf0uF00Y3z6xrRksjz-5CarqrgPiEXnH_Mw@mail.gmail.com>
 <CA+8g5KEEqcTjV3k+50y4SNhUrrhwO4ACtUuM5PDeRHaaBRAKBg@mail.gmail.com>
 <CAD-3cddJmwBTqozvJcJerc8tRXcv0-2Pf0aePe2yhkJaSOY+vA@mail.gmail.com>
In-Reply-To: <CAD-3cddJmwBTqozvJcJerc8tRXcv0-2Pf0aePe2yhkJaSOY+vA@mail.gmail.com>
From: Jim Meyering <jim@meyering.net>
Date: Wed, 1 Jan 2020 17:28:17 -0800
Message-ID: <CA+8g5KGUswVjrpiEF-rOXcf9Rgcd6PcrkhByYBYhnneeVB3sSQ@mail.gmail.com>
Subject: Re: Improvements in Grep (Bug#32073)
To: Sergiu Hlihor <sh@discovergy.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>,
 Dennis Clarke <dclarke@blastwave.org>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.5 (/)

On Wed, Jan 1, 2020 at 5:04 PM Sergiu Hlihor <sh@discovergy.com> wrote:
> The system for which this hurts me the most is an Ubuntu 14.04 where I'd =
need to run it as a separate binary. As I'm not familiar with the way it's =
built, is there any guidelines of how to build it from sources? I'd happy b=
uild it with ever larger block sizes and test.

Something like the following should work: (if you want to be more
careful than most, also download the .sig file,
https://meyering.net/grep/grep-3.3.49-3f11.tar.xz.sig, and use that to
verify the .xz file is the same one I signed -- do that before running
./configure)

wget https://meyering.net/grep/grep-3.3.49-3f11.tar.xz
xz -dc grep-3.3.49-3f11.tar.xz|tar xf -
cd grep-3.3.49-3f11
./configure && make


From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 01 23:21:01 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 04:21:01 +0000
Received: from localhost ([127.0.0.1]:37955 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imryn-0004QT-39
	for submit@debbugs.gnu.org; Wed, 01 Jan 2020 23:21:01 -0500
Received: from mail-io1-f43.google.com ([209.85.166.43]:43901)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1imryk-0004QG-PH
 for 32073@debbugs.gnu.org; Wed, 01 Jan 2020 23:20:59 -0500
Received: by mail-io1-f43.google.com with SMTP id n21so35648420ioo.10
 for <32073@debbugs.gnu.org>; Wed, 01 Jan 2020 20:20:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=0u51IBPkym3O6ykR49I8bLd9JMefMP+5ICUxZiip4Tw=;
 b=v+WVgp+D4fgtEt10xknIenXIjr60rQjOvNOoiMYa6yEqgKEv3LTii3ID3nYAX3CGX0
 k8yXPHyLCTHtoI4CmpxGVVy2IacDN8saeFEzX9TSab+FLKThCv337qlpjyTAmEj0Oj1D
 MVv+4/z91VO48qgwttZsg8P60cOUAMHA+43+XU3DYIkxn2SDfLWj7t34uaPvfjViIu3y
 N0/TWIy+6oIyFmtceAmjaRoFN9uAzlG+Itf+iGpRoIeJgmWFuRYl4p02qklcBnYTdCRP
 wu8WVmPPuu92uTusgcLDS+8ksi7zbvSPT4SfP8vL9T4GDnUQUA/kMZK1Cq0/AyRjkZwc
 Fasw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=0u51IBPkym3O6ykR49I8bLd9JMefMP+5ICUxZiip4Tw=;
 b=LYMTHw6kc5wq90ze4bPkbQdQeGfsddaBz0PON3DKg+Rl5CaZ83CBypwJAG1dt8DPMW
 dpqMLeml7BP2Ljwio/I3dNEwqYlfED5jqqEmuCBmFyN61e0vuHgDljJBlj89+exmzAW0
 j0PDPtZ38jrEZKFT/Dk7I8aCoHkg71SopYjH0rTrm72AQzLZtr3Zwy5VVa7p3ZUFN0F0
 rN0CCIJApguXuhQQwAjs3RSYEc87gsqyzwsx9a4mIVFXxAdLxXM8gQt9KUIeLQ89toiw
 fEACZFHGkWC9bv+BgntjKxIluPdtN1wOvoA6ZwOsKQGCTdzS7+AxrjYYiLciBuZp/W+s
 2zFA==
X-Gm-Message-State: APjAAAUt2XKOUDzF3DzPKfPJ+3DLn8RoQSk+Hd9nDjYAAwCHybQuPNNC
 IPN48at9KM3PIkRWT5jJmm50aPJgT0GFECOnbYsYBw==
X-Google-Smtp-Source: APXvYqwA2MW4Qn+8n/GFNc3PI8pxJNjrUq6DpUNgUnJtlCyCEKbNjz4Mvydu5U2VNiU9EsWifZhf25+0rS7Z+lAby38=
X-Received: by 2002:a6b:fe0f:: with SMTP id x15mr50887247ioh.219.1577938853127; 
 Wed, 01 Jan 2020 20:20:53 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
 <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
 <202001012024.001KOQMn012801@freefriends.org>
In-Reply-To: <202001012024.001KOQMn012801@freefriends.org>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Thu, 2 Jan 2020 05:20:32 +0100
Message-ID: <CAD-3cdcn1sf0qv1=nAfg1wLR=VeK=iZXv6_OY4PPaAxjavC6Yg@mail.gmail.com>
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
To: arnold@skeeve.com
Content-Type: multipart/alternative; boundary="000000000000d1810d059b208482"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000d1810d059b208482
Content-Type: text/plain; charset="UTF-8"

Hi Arnold,
If AWKBUFSIZE translates to disk IO request size then it is already what
its needed. However it's a little annoying.

Regarding optimal settings, the benchmark actually tells you what is
optimal. Let's assume grep or any other tool can process in memory 3GB/s.
If your device can server 5GB/s then you can saturate the CPU.  If however
the device needs at least X as block size to reach the maximum throughput,
then that's what you have to use. Plain and simple. And as I said, when
going into GB territory, at application level reads have to be asynchronous.
if you look at benchmarking tools like Atto you see the graphs clearly and
see the scaling for SSDs. And just happens that the value good for SSDs
(minimum 512KB) also benefits HDD RAID arrays with strip sizes smaller than
512KB. With HDD RAID arrays unfortunately it does get complicated because
you have to know the number of disks and strip size. I for example always
use tune2fs and set those parameters when format the partition. This could
just as well be a configurable OS parameter per drive and based on the
location of the file, the right value could be used. But I have to admit
that this would add exponential complexity with diminishing returns versus
just setting a buffer size of 1MB (which will cover both current and future
SSDs).

Also I'm not too fond of heuristics or any other smartness at IO level in
Linux IO stack. I'm working with large databases (as user) and discussed
about Linux IO stack with database developers. The common opinion is that
Linux IO stack got out of control and nobody actually has a good overview
anymore. And I tend to agree. Linux needs an IO stack as lean as possible
and let the applications decide what to do, as at the application level you
know your usage pattern. I already had to finetune the database due to it.

On Wed, 1 Jan 2020 at 21:24, <arnold@skeeve.com> wrote:

> Hi.
>
> Sergiu Hlihor <sh@discovergy.com> wrote:
>
> > Arnold, there is no need to write user code, it is already done in
> > benchmarks. One of the standard benchmarks when testing HDDs and SSDs is
> > read throughput vs block size and at different queue depths.
>
> I think you're misunderstanding me, or I am misunderstanding you.
>
> As the gawk maintainer, I can choose the buffer size to use every time
> I issue a read(2) system call for any given input file.  Gawk currently
> uses the smaller of (a) the file's size or (b) the st_blksize member of
> the struct stat array.
>
> If I understand you correctly, this is "not enough"; gawk (grep,
> cp, etc.) should all use an optimal buffer size that depends upon the
> underlying storage hardware where the file is located.
>
> So far, so good, except for: How do I determine what that number is?
> I cannot run a benchmark before opening each and every file. I don't
> know of a system call that will give me that number. (If there is,
> please point me to it.)
>
> Do you just want a command line option or environment variable
> that you, as the application user, can set?
>
> If the latter, it happens that gawk will let you set AWKBUFSIZE and
> it will use whatever number you supply for doing reads. (This is
> even documented.)
>
> HTH,
>
> Arnold
>

--000000000000d1810d059b208482
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi Arnold,</div><div>If AWKBUFSIZE translates to disk=
 IO request size then it is already what its needed. However it&#39;s a lit=
tle annoying. <br></div><div><br></div><div>Regarding optimal settings, the=
 benchmark actually tells you what is optimal. Let&#39;s assume grep or any=
 other tool can process in memory 3GB/s. If your device can server 5GB/s th=
en you can saturate the CPU.=C2=A0 If however the device needs at least X a=
s block size to reach the maximum throughput, then that&#39;s what you have=
 to use. Plain and simple. And as I said, when going into GB territory, at =
application level reads have to be asynchronous.<br></div><div>if you look =
at benchmarking tools like Atto you see the graphs clearly and see the scal=
ing for SSDs. And just happens that the value good for SSDs=C2=A0 (minimum =
512KB) also benefits HDD RAID arrays with strip sizes smaller than 512KB. W=
ith HDD RAID arrays unfortunately it does get complicated because you have =
to know the number of disks and strip size. I for example always use tune2f=
s and set those parameters when format the partition. This could just as we=
ll be a configurable OS parameter per drive and based on the location of th=
e file, the right value could be used. But I have to admit that this would =
add exponential complexity with diminishing returns versus just setting a b=
uffer size of 1MB (which will cover both current and future SSDs). <br></di=
v><div><br></div><div>Also I&#39;m not too fond of heuristics or any other =
smartness at IO level in Linux IO stack. I&#39;m working with large databas=
es (as user) and discussed about Linux IO stack with database developers. T=
he common opinion is that Linux IO stack got out of control and nobody actu=
ally has a good overview anymore. And I tend to agree. Linux needs an IO st=
ack as lean as possible and let the applications decide what to do, as at t=
he application level you know your usage pattern. I already had to finetune=
 the database due to it.<br></div><div><br></div><div class=3D"gmail_quote"=
><div dir=3D"ltr" class=3D"gmail_attr">On Wed, 1 Jan 2020 at 21:24, &lt;<a =
href=3D"mailto:arnold@skeeve.com" target=3D"_blank">arnold@skeeve.com</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi.<br=
>
<br>
Sergiu Hlihor &lt;<a href=3D"mailto:sh@discovergy.com" target=3D"_blank">sh=
@discovergy.com</a>&gt; wrote:<br>
<br>
&gt; Arnold, there is no need to write user code, it is already done in<br>
&gt; benchmarks. One of the standard benchmarks when testing HDDs and SSDs =
is<br>
&gt; read throughput vs block size and at different queue depths.<br>
<br>
I think you&#39;re misunderstanding me, or I am misunderstanding you.<br>
<br>
As the gawk maintainer, I can choose the buffer size to use every time<br>
I issue a read(2) system call for any given input file.=C2=A0 Gawk currentl=
y<br>
uses the smaller of (a) the file&#39;s size or (b) the st_blksize member of=
<br>
the struct stat array.<br>
<br>
If I understand you correctly, this is &quot;not enough&quot;; gawk (grep,<=
br>
cp, etc.) should all use an optimal buffer size that depends upon the<br>
underlying storage hardware where the file is located.<br>
<br>
So far, so good, except for: How do I determine what that number is?<br>
I cannot run a benchmark before opening each and every file. I don&#39;t<br=
>
know of a system call that will give me that number. (If there is,<br>
please point me to it.)<br>
<br>
Do you just want a command line option or environment variable<br>
that you, as the application user, can set?<br>
<br>
If the latter, it happens that gawk will let you set AWKBUFSIZE and<br>
it will use whatever number you supply for doing reads. (This is<br>
even documented.)<br>
<br>
HTH,<br>
<br>
Arnold<br>
</blockquote></div><br></div>

--000000000000d1810d059b208482--


From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 02:20:50 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 07:20:50 +0000
Received: from localhost ([127.0.0.1]:38029 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1imumo-0000fn-6G
	for submit@debbugs.gnu.org; Thu, 02 Jan 2020 02:20:50 -0500
Received: from freefriends.org ([96.88.95.60]:53654)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arnold@skeeve.com>) id 1imumm-0000fa-K1
 for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 02:20:49 -0500
X-Envelope-From: arnold@skeeve.com
Received: from freefriends.org (freefriends.org [96.88.95.60])
 by freefriends.org (8.14.7/8.14.7) with ESMTP id 0027KfTt032105
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Thu, 2 Jan 2020 00:20:42 -0700
Received: (from arnold@localhost)
 by freefriends.org (8.14.7/8.14.7/Submit) id 0027Kf58032104;
 Thu, 2 Jan 2020 00:20:41 -0700
From: arnold@skeeve.com
Message-Id: <202001020720.0027Kf58032104@freefriends.org>
X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to
 arnold@skeeve.com using -f
Date: Thu, 02 Jan 2020 00:20:41 -0700
To: sh@discovergy.com, arnold@skeeve.com
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
 <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
 <202001012024.001KOQMn012801@freefriends.org>
 <CAD-3cdcn1sf0qv1=nAfg1wLR=VeK=iZXv6_OY4PPaAxjavC6Yg@mail.gmail.com>
In-Reply-To: <CAD-3cdcn1sf0qv1=nAfg1wLR=VeK=iZXv6_OY4PPaAxjavC6Yg@mail.gmail.com>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.1 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.9 (/)

Hi.

Sergiu Hlihor <sh@discovergy.com> wrote:

> Hi Arnold,
> If AWKBUFSIZE translates to disk IO request size then it is already what
> its needed. However it's a little annoying.

How would you make it less annoying?

Thanks,

Arnold


From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 10:32:18 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 15:32:18 +0000
Received: from localhost ([127.0.0.1]:39931 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1in2SQ-0000DQ-EE
	for submit@debbugs.gnu.org; Thu, 02 Jan 2020 10:32:18 -0500
Received: from mail-io1-f50.google.com ([209.85.166.50]:36201)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sh@discovergy.com>) id 1in2SN-0000DC-3G
 for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 10:32:16 -0500
Received: by mail-io1-f50.google.com with SMTP id r13so28557954ioa.3
 for <32073@debbugs.gnu.org>; Thu, 02 Jan 2020 07:32:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=discovergy-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=BevTY+c4smzn2KUxkNgUFfbSAViXE48Qmf658fuB92o=;
 b=XLhNTwy47ZUKUj/A4BYhoI/sTeSVwylWUb+Y65hEhL5S0g1YxQ/24NOoCNRsIg8+Rw
 cztN/FJDuW+tohcTqYACqG04Pl4F3ByOQeLDZasef2Ud0kPv7ZmAh4jH3EtzTTmgoKxJ
 oZXhn9af9LZS3RvHDUUoEU3JPWl/rDsCq4+jlR8LCQFmwVujQ0+vnyUtGm9Fg3/Halur
 ToB3ic0sZ1AEvwK02CRbGnA4ElGjrpVNryXlnwrgBiv/jhrqOfW3VOos2ec54XXrVCZ3
 +zxyE/L4BOszTL29VJY77qIhlcStWdIbBtacEbSs2+bjPa/63SQixcgySFQtniYsn9OI
 B9zg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=BevTY+c4smzn2KUxkNgUFfbSAViXE48Qmf658fuB92o=;
 b=Fgjz29MIH/suYsJVVJhAmXDJuyIZlDjXZ4KSv+clJLXQUhVwaXo2Gu/uqGWqzVpaMw
 6QqUCcqoQS1+qyB47h4X+ulg9cuXPg2NWfApFrBMZKvLW8X8DFzgxlQdmb6XMfvqZKtM
 zdGU7wBqkwlC+ACMJIwLEMcjqxFlvR2iE+lHk6YN2rsGDeNXmu38+1/m23sZja1UJoRQ
 WQTvCIg025LanSOGuAhxbfImoQL8HGBu9gP/CskK7Qpy7nq83RYKoT6bRnhJqCPpTrQM
 iDYOmii1vzuzMnRiQ/HMgePuTJfwH0607fqX85dzu/lpzMsTZNzJFQu9Xvg7zqa93kdw
 mVpQ==
X-Gm-Message-State: APjAAAX1ynWa7AmVyXAzjm1u/mJw8pMzPNZ6uvo/iG5BaZjlvvFuLW+E
 25ftWKXYKbtwBBUI0EhTloQzUnFOvBWnvZi4gt9j6A==
X-Google-Smtp-Source: APXvYqzI+p8JfCpdKVndPcWm9cECB9mtpftL72M7CCzjwB13j9uXxfWR8Z3o23+tYxOf5RxmkjjvrNbwajxXuW08cZg=
X-Received: by 2002:a05:6602:25d3:: with SMTP id
 d19mr44659590iop.217.1577979129475; 
 Thu, 02 Jan 2020 07:32:09 -0800 (PST)
MIME-Version: 1.0
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
 <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
 <202001012024.001KOQMn012801@freefriends.org>
 <CAD-3cdcn1sf0qv1=nAfg1wLR=VeK=iZXv6_OY4PPaAxjavC6Yg@mail.gmail.com>
 <202001020720.0027Kf58032104@freefriends.org>
In-Reply-To: <202001020720.0027Kf58032104@freefriends.org>
From: Sergiu Hlihor <sh@discovergy.com>
Date: Thu, 2 Jan 2020 16:31:57 +0100
Message-ID: <CAD-3cdfjGEWtH-GZLn6Ss2Wjj3HaCETMDEBP7A1m6p4xSc-7Ug@mail.gmail.com>
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
To: arnold@skeeve.com
Content-Type: multipart/alternative; boundary="00000000000079c8d7059b29e5b6"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, Paul Eggert <eggert@cs.ucla.edu>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--00000000000079c8d7059b29e5b6
Content-Type: text/plain; charset="UTF-8"

Hi Arnold,
Annoying in the sense that you have to specify it with every usage. In a
company where you have 10+ developers grepping over various logs, each one
has to remember to add the extra parameter. Easier would be to have some
kind of global configuration that the system admin can set and developers
forget about it. But as I said, large default is very likely enough.


On Thu, 2 Jan 2020 at 08:20, <arnold@skeeve.com> wrote:

> Hi.
>
> Sergiu Hlihor <sh@discovergy.com> wrote:
>
> > Hi Arnold,
> > If AWKBUFSIZE translates to disk IO request size then it is already what
> > its needed. However it's a little annoying.
>
> How would you make it less annoying?
>
> Thanks,
>
> Arnold
>

--00000000000079c8d7059b29e5b6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div>Hi Arnold,</div><div>Annoying in the=
 sense that you have to specify it with every usage. In a company where you=
 have 10+ developers grepping over various logs, each one has to remember t=
o add the extra parameter. Easier would be to have some kind of global conf=
iguration that the system admin can set and developers forget about it. But=
 as I said, large default is very likely enough.<br></div><div><br></div><d=
iv><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D=
"gmail_attr">On Thu, 2 Jan 2020 at 08:20, &lt;<a href=3D"mailto:arnold@skee=
ve.com">arnold@skeeve.com</a>&gt; wrote:<br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex">Hi.<br>
<br>
Sergiu Hlihor &lt;<a href=3D"mailto:sh@discovergy.com" target=3D"_blank">sh=
@discovergy.com</a>&gt; wrote:<br>
<br>
&gt; Hi Arnold,<br>
&gt; If AWKBUFSIZE translates to disk IO request size then it is already wh=
at<br>
&gt; its needed. However it&#39;s a little annoying.<br>
<br>
How would you make it less annoying?<br>
<br>
Thanks,<br>
<br>
Arnold<br>
</blockquote></div></div>

--00000000000079c8d7059b29e5b6--


From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 10:36:53 2020
Received: (at 32073) by debbugs.gnu.org; 2 Jan 2020 15:36:53 +0000
Received: from localhost ([127.0.0.1]:39941 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1in2Wr-0000KO-1D
	for submit@debbugs.gnu.org; Thu, 02 Jan 2020 10:36:53 -0500
Received: from freefriends.org ([96.88.95.60]:57776)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <arnold@skeeve.com>) id 1in2Wp-0000KG-7K
 for 32073@debbugs.gnu.org; Thu, 02 Jan 2020 10:36:51 -0500
X-Envelope-From: arnold@skeeve.com
Received: from freefriends.org (freefriends.org [96.88.95.60])
 by freefriends.org (8.14.7/8.14.7) with ESMTP id 002FadFL014662
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Thu, 2 Jan 2020 08:36:40 -0700
Received: (from arnold@localhost)
 by freefriends.org (8.14.7/8.14.7/Submit) id 002FadBN014661;
 Thu, 2 Jan 2020 08:36:39 -0700
From: arnold@skeeve.com
Message-Id: <202001021536.002FadBN014661@freefriends.org>
X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to
 arnold@skeeve.com using -f
Date: Thu, 02 Jan 2020 08:36:39 -0700
To: sh@discovergy.com, arnold@skeeve.com
Subject: Re: bug#32073: Improvements in Grep (Bug#32073)
References: <5608aabb-ae0e-38e0-8c26-443f764cb53a@cs.ucla.edu>
 <CAD-3cdd_r=fV0L2Pw8hQMZAWSot3M12bvR93LY4m7zoaCXijtg@mail.gmail.com>
 <202001011119.001BJMYA027994@freefriends.org>
 <CAD-3cdeVbf3TVwFyj7NFd5d5_gTXugTb8_=x9aTjGE4+ufHggQ@mail.gmail.com>
 <202001012024.001KOQMn012801@freefriends.org>
 <CAD-3cdcn1sf0qv1=nAfg1wLR=VeK=iZXv6_OY4PPaAxjavC6Yg@mail.gmail.com>
 <202001020720.0027Kf58032104@freefriends.org>
 <CAD-3cdfjGEWtH-GZLn6Ss2Wjj3HaCETMDEBP7A1m6p4xSc-7Ug@mail.gmail.com>
In-Reply-To: <CAD-3cdfjGEWtH-GZLn6Ss2Wjj3HaCETMDEBP7A1m6p4xSc-7Ug@mail.gmail.com>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.2 (/)
X-Debbugs-Envelope-To: 32073
Cc: 32073@debbugs.gnu.org, eggert@cs.ucla.edu
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -0.8 (/)

OK, thanks for the input.

Arnold

Sergiu Hlihor <sh@discovergy.com> wrote:

> Hi Arnold,
> Annoying in the sense that you have to specify it with every usage. In a
> company where you have 10+ developers grepping over various logs, each one
> has to remember to add the extra parameter. Easier would be to have some
> kind of global configuration that the system admin can set and developers
> forget about it. But as I said, large default is very likely enough.
>
>
>
> On Thu, 2 Jan 2020 at 08:20, <arnold@skeeve.com> wrote:
>
> > Hi.
> >
> > Sergiu Hlihor <sh@discovergy.com> wrote:
> >
> > > Hi Arnold,
> > > If AWKBUFSIZE translates to disk IO request size then it is already what
> > > its needed. However it's a little annoying.
> >
> > How would you make it less annoying?
> >
> > Thanks,
> >
> > Arnold
> >