GNU bug report logs - #24858
URGENT: Question about grep

Previous Next

Package: grep;

Reported by: Greta <romano.greta <at> gmail.com>

Date: Wed, 2 Nov 2016 15:35:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24858 in the body.
You can then email your comments to 24858 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#24858; Package grep. (Wed, 02 Nov 2016 15:35:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Greta <romano.greta <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 02 Nov 2016 15:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Greta <romano.greta <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: URGENT: Question about grep
Date: Wed, 2 Nov 2016 15:53:45 +0100
[Message part 1 (text/plain, inline)]
Dear grep developer,

I am Greta Romano and I need your help as soon as possibile.

I want to use grep command to search a string of 6 characters in every 
line of a file (biological file with DNA nucleotide).

The problem is that I need to search these 6 characters in the first 30 
characters of each line. I report you an example:

String to search: GTGTCA

File:

>HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
_/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
_/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN

I want that the research is done only in the underline characters.  So 
what I have to add in grep command to put the limit of 30 characters?

Thank you very much

Best regards

Dr. Greta Romano

[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 02 Nov 2016 15:50:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Wed, 02 Nov 2016 15:50:02 GMT) Full text and rfc822 format available.

Notification sent to Greta <romano.greta <at> gmail.com>:
bug acknowledged by developer. (Wed, 02 Nov 2016 15:50:02 GMT) Full text and rfc822 format available.

Message #12 received at 24858-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Greta <romano.greta <at> gmail.com>, 24858-done <at> debbugs.gnu.org
Subject: Re: bug#24858: URGENT: Question about grep
Date: Wed, 2 Nov 2016 10:49:30 -0500
[Message part 1 (text/plain, inline)]
tag 24858 notabug
thanks

On 11/02/2016 09:53 AM, Greta wrote:

> String to search: GTGTCA
> 
> File:
> 
>>HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
> _/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
> 
>>HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
> _/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN
> 
> 
> I want that the research is done only in the underline characters.

Underlining doesn't show up in plain text mail (and we prefer plain text
over html bloat on the mailing list).  But I think your point still made
it across

>  So
> what I have to add in grep command to put the limit of 30 characters?

You can't do it with grep.  But you can do it with sed or awk.  Use the
right tool for the job at hand :)

Let's strip your example down to a smaller test case: I want to search
for a one-byte string '1', but only in the first 3 bytes of a file.
With grep, it is not possible; the pattern matches anywhere in the line:

$ printf '012000001\n345000001\n' | grep 1
012000001
345000001

But with sed, we can copy the entire line to hold space, truncate the
line in pattern space, then do the search; if successful, print the line
stored in hold space:

$ printf '012000001\n345000001\n' | \
  sed -n 'h; s/^\(.\{3\}\).*/\1/; /1/ { x;p }'
012000001

And I'll leave the awk program as an exercise for the reader.

Therefore, I'm tagging this as not a bug.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24858; Package grep. (Wed, 02 Nov 2016 15:53:02 GMT) Full text and rfc822 format available.

Message #15 received at 24858 <at> debbugs.gnu.org (full text, mbox):

From: Bruce Dubbs <bruce.dubbs <at> gmail.com>
To: Greta <romano.greta <at> gmail.com>, 24858 <at> debbugs.gnu.org
Subject: Re: bug#24858: URGENT: Question about grep
Date: Wed, 2 Nov 2016 10:52:39 -0500
Greta wrote:
> Dear grep developer,
>
> I am Greta Romano and I need your help as soon as possibile.
>
> I want to use grep command to search a string of 6 characters in every
> line of a file (biological file with DNA nucleotide).
>
> The problem is that I need to search these 6 characters in the first 30
> characters of each line. I report you an example:
>
> String to search: GTGTCA
>
> File:
>
>  >HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
> _/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>
>  >HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
> _/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN
>
>
> I want that the research is done only in the underline characters.  So
> what I have to add in grep command to put the limit of 30 characters?

cut -c 30 filename | grep ACGTAC

  -- Bruce






Information forwarded to bug-grep <at> gnu.org:
bug#24858; Package grep. (Wed, 02 Nov 2016 17:01:02 GMT) Full text and rfc822 format available.

Message #18 received at 24858 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Bruce Dubbs <bruce.dubbs <at> gmail.com>, Greta <romano.greta <at> gmail.com>,
 24858 <at> debbugs.gnu.org
Subject: Re: bug#24858: URGENT: Question about grep
Date: Wed, 2 Nov 2016 12:00:21 -0500
[Message part 1 (text/plain, inline)]
On 11/02/2016 10:52 AM, Bruce Dubbs wrote:

>>
>> I want that the research is done only in the underline characters.  So
>> what I have to add in grep command to put the limit of 30 characters?
> 
> cut -c 30 filename | grep ACGTAC

That works if you are only interested in seeing the first 30 characters
of a given line, rather than printing the entire line when the match was
only within the first 30 characters.  If you need to map back to the
entire line, you can use some sort of decorate-search-undecorate
algorithm to keep the search portion still under grep, but at that
point, it's probably easier to just write it all in a language that can
do it in a single pass.

I guess I should also mention that if you know your lines are a fixed
width (say for example that every line is exactly 80 characters), then
you can exploit that using just grep to find a match only in the first
30 characters by explicitly spelling out the fixed-width remainder of
the line as an anchor:

grep 'ACGTAC.*.\{50\}$' filename

Sadly, the two example lines you printed were not the same length, so I
don't think it helps for your case.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#24858; Package grep. (Wed, 02 Nov 2016 17:26:01 GMT) Full text and rfc822 format available.

Message #21 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Jackson <pj <at> usa.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#24858: URGENT: Question about grep
Date: Wed, 02 Nov 2016 12:24:58 -0500
Greta asked:
>> So what I have to add in grep command to put the limit of 30 characters?

Eric replied:
>> You can't do it with grep. 

Bruce suggested:
>> cut -c 30 filename | grep ACGTAC

Using the following grep command seems to work for me, and is about
40% faster, in terms of user CPU time spent, on my system, using a large
dataset I have (some web server logs)  than using cut and grep in a pipeline,
as the extra CPU cost of the more complex grep expression is more than
compensated for by the reduced copying of the datastream:

grep -E '^.{0,30}GTGTCA

===

A custom C program could make this dramatically faster, especially if:

it avoided using stdio or any other form of line buffering that copied
each line of data within the application,

it used raw read(2) calls,

it used strchr(3) calls to scan to the end of the current line (hence the start
of the next line), and

it used a mix of strchr and unaligned word compares, say of the 4 bytes
"ACGT", then the 2 bytes "AC",  which can be done on CPU's supporting
unaligned word compares.

Finding a programmer who can code that might be difficult, and
such optimization would only make sense if you're burning lots of
CPU time or project time, on this particular scan.

-- 
                Paul Jackson
                pj <at> usa.net




Information forwarded to bug-grep <at> gnu.org:
bug#24858; Package grep. (Wed, 02 Nov 2016 17:30:02 GMT) Full text and rfc822 format available.

Message #24 received at 24858 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Jackson <pj <at> usa.net>, 24858 <at> debbugs.gnu.org
Subject: Re: bug#24858: URGENT: Question about grep
Date: Wed, 2 Nov 2016 12:29:15 -0500
[Message part 1 (text/plain, inline)]
On 11/02/2016 12:24 PM, Paul Jackson wrote:
> Greta asked:
>>> So what I have to add in grep command to put the limit of 30 characters?
> 
> Eric replied:
>>> You can't do it with grep. 
> 
> Bruce suggested:
>>> cut -c 30 filename | grep ACGTAC
> 
> Using the following grep command seems to work for me, and is about
> 40% faster, in terms of user CPU time spent, on my system, using a large
> dataset I have (some web server logs)  than using cut and grep in a pipeline,
> as the extra CPU cost of the more complex grep expression is more than
> compensated for by the reduced copying of the datastream:
> 
> grep -E '^.{0,30}GTGTCA

That searches up to 36 characters.  If you want to limit it to just the
first 30, you need '^.{0,24}GTGTCA', since the match will never occur
later than the 24th character of the first 30.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 01 Dec 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 253 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.