GNU bug report logs -
#24858
URGENT: Question about grep
Previous Next
Reported by: Greta <romano.greta <at> gmail.com>
Date: Wed, 2 Nov 2016 15:35:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24858 in the body.
You can then email your comments to 24858 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#24858
; Package
grep
.
(Wed, 02 Nov 2016 15:35:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Greta <romano.greta <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 02 Nov 2016 15:35:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dear grep developer,
I am Greta Romano and I need your help as soon as possibile.
I want to use grep command to search a string of 6 characters in every
line of a file (biological file with DNA nucleotide).
The problem is that I need to search these 6 characters in the first 30
characters of each line. I report you an example:
String to search: GTGTCA
File:
>HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
_/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
_/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN
I want that the research is done only in the underline characters. So
what I have to add in grep command to put the limit of 30 characters?
Thank you very much
Best regards
Dr. Greta Romano
[Message part 2 (text/html, inline)]
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Wed, 02 Nov 2016 15:50:01 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Wed, 02 Nov 2016 15:50:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Greta <romano.greta <at> gmail.com>
:
bug acknowledged by developer.
(Wed, 02 Nov 2016 15:50:02 GMT)
Full text and
rfc822 format available.
Message #12 received at 24858-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 24858 notabug
thanks
On 11/02/2016 09:53 AM, Greta wrote:
> String to search: GTGTCA
>
> File:
>
>>HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
> _/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>
>>HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
> _/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN
>
>
> I want that the research is done only in the underline characters.
Underlining doesn't show up in plain text mail (and we prefer plain text
over html bloat on the mailing list). But I think your point still made
it across
> So
> what I have to add in grep command to put the limit of 30 characters?
You can't do it with grep. But you can do it with sed or awk. Use the
right tool for the job at hand :)
Let's strip your example down to a smaller test case: I want to search
for a one-byte string '1', but only in the first 3 bytes of a file.
With grep, it is not possible; the pattern matches anywhere in the line:
$ printf '012000001\n345000001\n' | grep 1
012000001
345000001
But with sed, we can copy the entire line to hold space, truncate the
line in pattern space, then do the search; if successful, print the line
stored in hold space:
$ printf '012000001\n345000001\n' | \
sed -n 'h; s/^\(.\{3\}\).*/\1/; /1/ { x;p }'
012000001
And I'll leave the awk program as an exercise for the reader.
Therefore, I'm tagging this as not a bug.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24858
; Package
grep
.
(Wed, 02 Nov 2016 15:53:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 24858 <at> debbugs.gnu.org (full text, mbox):
Greta wrote:
> Dear grep developer,
>
> I am Greta Romano and I need your help as soon as possibile.
>
> I want to use grep command to search a string of 6 characters in every
> line of a file (biological file with DNA nucleotide).
>
> The problem is that I need to search these 6 characters in the first 30
> characters of each line. I report you an example:
>
> String to search: GTGTCA
>
> File:
>
> >HWI-ST740:1:C2GCJACXX:1:1101:1279:1825 1:N:0:
> _/NGACGCTCTGACCTTGGGGCTGGTCGGGG/__A_TGCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCACANNNCCAAGCTTCCNNNNNNNNNNNNNNNNNNN
>
> >HWI-ST740:1:C2GCJACXX:1:1101:1349:1847 1:N:0:
> _/NTTAGATGAGGGAAACATCTGCATCAAGTT/__G_TTTATCTGTGACAACAAGTGTTGTTCCACTGCCAAAGAGTTTCTTATAATAAAACAATCGGGGTGGCACNNNNN
>
>
> I want that the research is done only in the underline characters. So
> what I have to add in grep command to put the limit of 30 characters?
cut -c 30 filename | grep ACGTAC
-- Bruce
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24858
; Package
grep
.
(Wed, 02 Nov 2016 17:01:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 24858 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 11/02/2016 10:52 AM, Bruce Dubbs wrote:
>>
>> I want that the research is done only in the underline characters. So
>> what I have to add in grep command to put the limit of 30 characters?
>
> cut -c 30 filename | grep ACGTAC
That works if you are only interested in seeing the first 30 characters
of a given line, rather than printing the entire line when the match was
only within the first 30 characters. If you need to map back to the
entire line, you can use some sort of decorate-search-undecorate
algorithm to keep the search portion still under grep, but at that
point, it's probably easier to just write it all in a language that can
do it in a single pass.
I guess I should also mention that if you know your lines are a fixed
width (say for example that every line is exactly 80 characters), then
you can exploit that using just grep to find a match only in the first
30 characters by explicitly spelling out the fixed-width remainder of
the line as an anchor:
grep 'ACGTAC.*.\{50\}$' filename
Sadly, the two example lines you printed were not the same length, so I
don't think it helps for your case.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24858
; Package
grep
.
(Wed, 02 Nov 2016 17:26:01 GMT)
Full text and
rfc822 format available.
Message #21 received at submit <at> debbugs.gnu.org (full text, mbox):
Greta asked:
>> So what I have to add in grep command to put the limit of 30 characters?
Eric replied:
>> You can't do it with grep.
Bruce suggested:
>> cut -c 30 filename | grep ACGTAC
Using the following grep command seems to work for me, and is about
40% faster, in terms of user CPU time spent, on my system, using a large
dataset I have (some web server logs) than using cut and grep in a pipeline,
as the extra CPU cost of the more complex grep expression is more than
compensated for by the reduced copying of the datastream:
grep -E '^.{0,30}GTGTCA
===
A custom C program could make this dramatically faster, especially if:
it avoided using stdio or any other form of line buffering that copied
each line of data within the application,
it used raw read(2) calls,
it used strchr(3) calls to scan to the end of the current line (hence the start
of the next line), and
it used a mix of strchr and unaligned word compares, say of the 4 bytes
"ACGT", then the 2 bytes "AC", which can be done on CPU's supporting
unaligned word compares.
Finding a programmer who can code that might be difficult, and
such optimization would only make sense if you're burning lots of
CPU time or project time, on this particular scan.
--
Paul Jackson
pj <at> usa.net
Information forwarded
to
bug-grep <at> gnu.org
:
bug#24858
; Package
grep
.
(Wed, 02 Nov 2016 17:30:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 24858 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 11/02/2016 12:24 PM, Paul Jackson wrote:
> Greta asked:
>>> So what I have to add in grep command to put the limit of 30 characters?
>
> Eric replied:
>>> You can't do it with grep.
>
> Bruce suggested:
>>> cut -c 30 filename | grep ACGTAC
>
> Using the following grep command seems to work for me, and is about
> 40% faster, in terms of user CPU time spent, on my system, using a large
> dataset I have (some web server logs) than using cut and grep in a pipeline,
> as the extra CPU cost of the more complex grep expression is more than
> compensated for by the reduced copying of the datastream:
>
> grep -E '^.{0,30}GTGTCA
That searches up to 36 characters. If you want to limit it to just the
first 30, you need '^.{0,24}GTGTCA', since the match will never occur
later than the 24th character of the first 30.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 01 Dec 2016 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 253 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.