GNU bug report logs -
#73721
grep perf docs barely mention mem usage
Previous Next
Reported by: <mark.yagnatinsky <at> barclays.com>
Date: Wed, 9 Oct 2024 17:03:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 73721 in the body.
You can then email your comments to 73721 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#73721
; Package
grep
.
(Wed, 09 Oct 2024 17:03:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
<mark.yagnatinsky <at> barclays.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 09 Oct 2024 17:03:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
This page: https://www.gnu.org/software/grep/manual/html_node/Performance.html
discusses how to get good performance out of grep.
That is nice, but the perf advice focuses almost entirely on speed.
I'm more interested in how to avoid excessive memory usage.
After a bit of research, it seems that once upon a time, grep used mmap where possible, but it no longer does this.
Thus, peak memory usage will be proportional to the length of the longest line in the file.
Thus, if use the "-z multiline hack" to search across lines, grep will read the whole file into memory.
Thus, if I try this on a huge file, I will likely have a bad time.
(e.g., a 5 gig file would fail in a 32-bit grep, and would increase memory pressure on the system on a 64-bit grep.)
Is the above about right?
This message is for information purposes only. It is not a recommendation, advice, offer or solicitation to buy or sell a product or service, nor an official confirmation of any transaction. It is directed at persons who are professionals and is intended for the recipient(s) only. It is not directed at retail customers. This message is subject to the terms at: https://www.ib.barclays/disclosures/web-and-email-disclaimer.html.
For important disclosures, please see: https://www.ib.barclays/disclosures/sales-and-trading-disclaimer.html regarding marketing commentary from Barclays Sales and/or Trading desks, who are active market participants; https://www.ib.barclays/disclosures/barclays-global-markets-disclosures.html regarding our standard terms for Barclays Investment Bank where we trade with you in principal-to-principal wholesale markets transactions; and in respect to Barclays Research, including disclosures relating to specific issuers, see: https://publicresearch.barclays.com.
__________________________________________________________________________________
If you are incorporated or operating in Australia, read these important disclosures: https://www.ib.barclays/disclosures/important-disclosures-asia-pacific.html.
__________________________________________________________________________________
For more details about how we use personal information, see our privacy notice: https://www.ib.barclays/disclosures/personal-information-use.html.
__________________________________________________________________________________
[Message part 2 (text/html, inline)]
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Wed, 09 Oct 2024 19:04:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
<mark.yagnatinsky <at> barclays.com>
:
bug acknowledged by developer.
(Wed, 09 Oct 2024 19:04:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 73721-done <at> debbugs.gnu.org (full text, mbox):
On 2024-10-09 10:01, mark.yagnatinsky--- via Bug reports for GNU grep wrote:
> After a bit of research, it seems that once upon a time, grep used mmap where possible, but it no longer does this.
> Thus, peak memory usage will be proportional to the length of the longest line in the file.
> Thus, if use the "-z multiline hack" to search across lines, grep will read the whole file into memory.
> Thus, if I try this on a huge file, I will likely have a bad time.
> (e.g., a 5 gig file would fail in a 32-bit grep, and would increase memory pressure on the system on a 64-bit grep.)
>
> Is the above about right?
Sounds right.
mmap likely wouldn't help much. As I recall, it typically made 'grep'
slower.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73721
; Package
grep
.
(Wed, 09 Oct 2024 21:13:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 73721-done <at> debbugs.gnu.org (full text, mbox):
Thanks!!
Re: wouldn't help much: maybe it wouldn't help with speed.
But how could it possibly not help with memory usage?
In particular, the "commit charge" is surely far lower, right?
With a read() based approach, grep needs to explicitly allocate an array, and as far as the kernel knows, this array has arbitrary contents.
(In fact, the array matches a certain file on disk, but the kernel can't know that.)
If memory gets tight, then if the kernel wants to evict this array from RAM, it must find swap space for it in the page file or whatever.
With mmap(), the kernel just needs to set up a bit of book-keeping to note that "this virtual address range is backed by this file on disk".
If memory gets tight, it can simply evict the range from RAM, since it knows it can reconstruct it perfectly later.
Or am I missing something?
Thanks again for responding so fast!
-----Original Message-----
From: Paul Eggert <eggert <at> cs.ucla.edu>
Sent: Wednesday, October 9, 2024 3:03 PM
To: Yagnatinsky, Mark : IT (NYK) <mark.yagnatinsky <at> barclays.com>
Cc: 73721-done <at> debbugs.gnu.org
Subject: Re: bug#73721: grep perf docs barely mention mem usage
CAUTION: This email originated from outside our organization - eggert <at> cs.ucla.edu Do not click on links, open attachments, or respond unless you recognize the sender and can validate the content is safe.
______________________________________________________________________
On 2024-10-09 10:01, mark.yagnatinsky--- via Bug reports for GNU grep wrote:
> After a bit of research, it seems that once upon a time, grep used mmap where possible, but it no longer does this.
> Thus, peak memory usage will be proportional to the length of the longest line in the file.
> Thus, if use the "-z multiline hack" to search across lines, grep will read the whole file into memory.
> Thus, if I try this on a huge file, I will likely have a bad time.
> (e.g., a 5 gig file would fail in a 32-bit grep, and would increase memory pressure on the system on a 64-bit grep.)
>
> Is the above about right?
Sounds right.
mmap likely wouldn't help much. As I recall, it typically made 'grep'
slower.
This message is for information purposes only. It is not a recommendation, advice, offer or solicitation to buy or sell a product or service, nor an official confirmation of any transaction. It is directed at persons who are professionals and is intended for the recipient(s) only. It is not directed at retail customers. This message is subject to the terms at: https://www.ib.barclays/disclosures/web-and-email-disclaimer.html.
For important disclosures, please see: https://www.ib.barclays/disclosures/sales-and-trading-disclaimer.html regarding marketing commentary from Barclays Sales and/or Trading desks, who are active market participants; https://www.ib.barclays/disclosures/barclays-global-markets-disclosures.html regarding our standard terms for Barclays Investment Bank where we trade with you in principal-to-principal wholesale markets transactions; and in respect to Barclays Research, including disclosures relating to specific issuers, see: https://publicresearch.barclays.com.
__________________________________________________________________________________
If you are incorporated or operating in Australia, read these important disclosures: https://www.ib.barclays/disclosures/important-disclosures-asia-pacific.html.
__________________________________________________________________________________
For more details about how we use personal information, see our privacy notice: https://www.ib.barclays/disclosures/personal-information-use.html.
__________________________________________________________________________________
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73721
; Package
grep
.
(Thu, 10 Oct 2024 04:12:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 73721-done <at> debbugs.gnu.org (full text, mbox):
On 2024-10-09 14:12, mark.yagnatinsky <at> barclays.com wrote:
> With mmap(), the kernel just needs to set up a bit of book-keeping
Yes, I'm sure than in some cases mmap would be a win. However, all in
all it was found not to be, and I expect that the hassle of maintaining
the mmap variant wasn't worth whatever benefits it provided. As I recall
'grep' wants to modify its input buffers for other efficiency reasons,
and mmap CoW likely won't be a performance win over page-aligned read.
We could be proved wrong by someone taking the time to resurrect --mmap
and doing performance measurements of realistic benchmarks. I don't have
the time for that, myself.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73721
; Package
grep
.
(Thu, 10 Oct 2024 16:19:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 73721-done <at> debbugs.gnu.org (full text, mbox):
Okay, that's a surprise. I had assumed the buffer would be mapped read-only.
If indeed it needs to modify the buffer then even the memory gains start to look marginal.
Thanks for explaining where my assumptions were wrong, and thanks also for confirming the parts that were right. I now know that the use-case I have in mind is likely better served by sed and not by grep.
-----Original Message-----
From: Paul Eggert <eggert <at> cs.ucla.edu>
Sent: Thursday, October 10, 2024 12:11 AM
To: Yagnatinsky, Mark : IT (NYK) <mark.yagnatinsky <at> barclays.com>
Cc: 73721-done <at> debbugs.gnu.org
Subject: Re: bug#73721: grep perf docs barely mention mem usage
CAUTION: This email originated from outside our organization - eggert <at> cs.ucla.edu Do not click on links, open attachments, or respond unless you recognize the sender and can validate the content is safe.
______________________________________________________________________
On 2024-10-09 14:12, mark.yagnatinsky <at> barclays.com wrote:
> With mmap(), the kernel just needs to set up a bit of book-keeping
Yes, I'm sure than in some cases mmap would be a win. However, all in all it was found not to be, and I expect that the hassle of maintaining the mmap variant wasn't worth whatever benefits it provided. As I recall 'grep' wants to modify its input buffers for other efficiency reasons, and mmap CoW likely won't be a performance win over page-aligned read.
We could be proved wrong by someone taking the time to resurrect --mmap and doing performance measurements of realistic benchmarks. I don't have the time for that, myself.
This message is for information purposes only. It is not a recommendation, advice, offer or solicitation to buy or sell a product or service, nor an official confirmation of any transaction. It is directed at persons who are professionals and is intended for the recipient(s) only. It is not directed at retail customers. This message is subject to the terms at: https://www.ib.barclays/disclosures/web-and-email-disclaimer.html.
For important disclosures, please see: https://www.ib.barclays/disclosures/sales-and-trading-disclaimer.html regarding marketing commentary from Barclays Sales and/or Trading desks, who are active market participants; https://www.ib.barclays/disclosures/barclays-global-markets-disclosures.html regarding our standard terms for Barclays Investment Bank where we trade with you in principal-to-principal wholesale markets transactions; and in respect to Barclays Research, including disclosures relating to specific issuers, see: https://publicresearch.barclays.com.
__________________________________________________________________________________
If you are incorporated or operating in Australia, read these important disclosures: https://www.ib.barclays/disclosures/important-disclosures-asia-pacific.html.
__________________________________________________________________________________
For more details about how we use personal information, see our privacy notice: https://www.ib.barclays/disclosures/personal-information-use.html.
__________________________________________________________________________________
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 08 Nov 2024 12:24:08 GMT)
Full text and
rfc822 format available.
This bug report was last modified 278 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.