From unknown Wed Jun 25 00:26:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#13089: Wish: split every n'th into n pipes Resent-From: Ole Tange Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Wed, 05 Dec 2012 17:10:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 13089 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 13089@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.135472738114667 (code B ref -1); Wed, 05 Dec 2012 17:10:02 +0000 Received: (at submit) by debbugs.gnu.org; 5 Dec 2012 17:09:41 +0000 Received: from localhost ([127.0.0.1]:55205 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgITd-0003oV-97 for submit@debbugs.gnu.org; Wed, 05 Dec 2012 12:09:41 -0500 Received: from eggs.gnu.org ([208.118.235.92]:39221) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgFiD-00082Z-P0 for submit@debbugs.gnu.org; Wed, 05 Dec 2012 09:12:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TgFhv-0007yY-5b for submit@debbugs.gnu.org; Wed, 05 Dec 2012 09:12:26 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM, RCVD_IN_DNSWL_LOW,T_DKIM_INVALID autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:33970) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TgFhv-0007yO-3G for submit@debbugs.gnu.org; Wed, 05 Dec 2012 09:12:15 -0500 Received: from eggs.gnu.org ([208.118.235.92]:50484) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TgFhi-0004nu-Dk for bug-coreutils@gnu.org; Wed, 05 Dec 2012 09:12:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TgFhZ-0007jp-Ap for bug-coreutils@gnu.org; Wed, 05 Dec 2012 09:12:02 -0500 Received: from mail-vc0-f169.google.com ([209.85.220.169]:40289) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TgFhZ-0007ig-5L for bug-coreutils@gnu.org; Wed, 05 Dec 2012 09:11:53 -0500 Received: by mail-vc0-f169.google.com with SMTP id gb23so5047798vcb.0 for ; Wed, 05 Dec 2012 06:11:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=f6JGNwSBCz8SzL6tUjdHmCkWVpEfXSVp2GmouIKBunc=; b=kphpPE8pcXD4PVrfrTWk/Hi7rPpMQ3Jv1cVZDh8mzdrZ6TLsfdxk97PBhY1H3HQ8W6 i4tp1Fp7T1ORU8ztkCfgdOP/s0zbn71R2xHqv5vqOI3MEgYUfu0aQscCvCd3sMOUAWpD g0DKhMkExqHuU/KMWf9HoWlS2Rl9x1L0thExiNDSeTvLWwrr2W/NxgUukvNpQFUAOp6Z j07qZbit30OZGCFq/58uxe0ddxk3T+gB1bU3lO84SFJf0h/s/4uxI9QATJonCPx03r6/ u3KNwmZryo/X0GFqNFAwzbbGBeZkeByblRMZG0P8BUWwkR1V9Eqabg4lsP9K4Yj6riGy ueGg== Received: by 10.52.98.73 with SMTP id eg9mr4585746vdb.18.1354716712465; Wed, 05 Dec 2012 06:11:52 -0800 (PST) MIME-Version: 1.0 Received: by 10.59.12.195 with HTTP; Wed, 5 Dec 2012 06:11:32 -0800 (PST) From: Ole Tange Date: Wed, 5 Dec 2012 15:11:32 +0100 X-Google-Sender-Auth: 6TuC3tHcP2VyuI2WOJVhtzW3Kdc Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -3.4 (---) X-Mailman-Approved-At: Wed, 05 Dec 2012 12:09:40 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -3.4 (---) I often have data that can be processed in parallel. It would be great if split --filter could look at every n'th line instead of chunking into n chunks: cat bigfile | split --every-nth -n 8 --filter "grep foo" The above should start 8 greps and give each a line in round robin manner. Ideally it should be possible to do so non-blocking so if some lines take longer for one instance of grep, then the rest of the greps are not blocked. /Ole From unknown Wed Jun 25 00:26:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#13089: Wish: split every n'th into n pipes Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Wed, 05 Dec 2012 18:53:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 13089 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Ole Tange Cc: 13089@debbugs.gnu.org Received: via spool by 13089-submit@debbugs.gnu.org id=B13089.135473356524136 (code B ref 13089); Wed, 05 Dec 2012 18:53:02 +0000 Received: (at 13089) by debbugs.gnu.org; 5 Dec 2012 18:52:45 +0000 Received: from localhost ([127.0.0.1]:55246 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgK5M-0006HD-8f for submit@debbugs.gnu.org; Wed, 05 Dec 2012 13:52:45 -0500 Received: from mx1.redhat.com ([209.132.183.28]:3365) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgK5J-0006H4-SZ for 13089@debbugs.gnu.org; Wed, 05 Dec 2012 13:52:43 -0500 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id qB5IqWfu007766 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 13:52:32 -0500 Received: from [10.36.116.68] (ovpn-116-68.ams2.redhat.com [10.36.116.68]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id qB5IqTGW013441 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 13:52:31 -0500 Message-ID: <50BF97ED.6050004@draigBrady.com> Date: Wed, 05 Dec 2012 18:52:29 +0000 From: =?UTF-8?Q?P=C3=A1draig?= Brady User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id qB5IqWfu007766 X-Spam-Score: -4.2 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) tag 13089 + notabug close 13089 On 12/05/2012 02:11 PM, Ole Tange wrote: > I often have data that can be processed in parallel. > > It would be great if split --filter could look at every n'th line > instead of chunking into n chunks: > > cat bigfile | split --every-nth -n 8 --filter "grep foo" > > The above should start 8 greps and give each a line in round robin mann= er. > > Ideally it should be possible to do so non-blocking so if some lines > take longer for one instance of grep, then the rest of the greps are > not blocked. So that's mostly supported already (notice the r/ below): $ seq 8000 | split -n r/8 --filter=3D'wc -l' | uniq -c 8 1000 The concurrency is achieved through standard I/O buffers between split and the filters (note also the -u split option). I'm not sure non blocking I/O would be of much benefit, since the filters will be the same, and if we did that, then we'd have to worry about internal buffering in split. We had a similar question about tee, yesterday, and I think the answer is the same here, that the complexity doesn't seem warranted for such edge cases. thanks, P=E1draig. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 14:31:15 2012 Received: (at control) by debbugs.gnu.org; 5 Dec 2012 19:31:15 +0000 Received: from localhost ([127.0.0.1]:55281 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgKgc-0007Fx-HL for submit@debbugs.gnu.org; Wed, 05 Dec 2012 14:31:15 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40897) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgKga-0007Fo-Ku for control@debbugs.gnu.org; Wed, 05 Dec 2012 14:31:13 -0500 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id qB5JV2A7029811 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 5 Dec 2012 14:31:02 -0500 Received: from [10.36.116.68] (ovpn-116-68.ams2.redhat.com [10.36.116.68]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id qB5JV00f024578 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 5 Dec 2012 14:31:02 -0500 Message-ID: <50BFA0F3.8030901@draigBrady.com> Date: Wed, 05 Dec 2012 19:30:59 +0000 From: =?ISO-8859-1?Q?P=E1draig_Brady?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: bug#13089: Wish: split every n'th into n pipes References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 X-Spam-Score: -4.2 (----) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.0 (-----) tag 13089 + notabug close 13089 From unknown Wed Jun 25 00:26:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#13089: Wish: split every n'th into n pipes Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 06 Dec 2012 13:03:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 13089 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: notabug To: Ole Tange Cc: 13089@debbugs.gnu.org Received: via spool by 13089-submit@debbugs.gnu.org id=B13089.135479897624922 (code B ref 13089); Thu, 06 Dec 2012 13:03:02 +0000 Received: (at 13089) by debbugs.gnu.org; 6 Dec 2012 13:02:56 +0000 Received: from localhost ([127.0.0.1]:56175 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgb6N-0006Ts-IF for submit@debbugs.gnu.org; Thu, 06 Dec 2012 08:02:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52104) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgb6J-0006Ti-6G for 13089@debbugs.gnu.org; Thu, 06 Dec 2012 08:02:53 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id qB6D2b8D016890 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 6 Dec 2012 08:02:37 -0500 Received: from [10.36.116.69] (ovpn-116-69.ams2.redhat.com [10.36.116.69]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id qB6D2Zg9012296 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 6 Dec 2012 08:02:36 -0500 Message-ID: <50C0976A.8020503@draigBrady.com> Date: Thu, 06 Dec 2012 13:02:34 +0000 From: =?UTF-8?Q?P=C3=A1draig?= Brady User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 MIME-Version: 1.0 References: <50BF97ED.6050004@draigBrady.com> In-Reply-To: <50BF97ED.6050004@draigBrady.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id qB6D2b8D016890 X-Spam-Score: -4.2 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) On 12/06/2012 12:20 PM, Ole Tange wrote: > On Thu, Dec 6, 2012 at 12:41 PM, P=E1draig Brady wr= ote: >> On 12/06/2012 11:25 AM, P=E1draig Brady wrote: >>> On 12/06/2012 12:06 AM, Ole Tange wrote: >>>> >>>> Do you have a similar reference: >>>> >>>> * if each record is k lines (e.g. 4 lines as is the case in FASTQ f= iles) >>>> * If each record has a record separator (e.g. > in FASTA files) >>> >>> I'd probably preprocess first to a single line: >>> >>> The following may not be robust or efficient. >>> I suspect there may be tools already to efficiently >>> parse fast[aq] to a single line: >>> >>> fastalines(){ sed -n '/^>/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; } >>> fastqlines(){ sed -n '/^@/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; } >>> >>> Then use like: >>> >>> fasta_source | fastalines | >>> split -n r/8 --filter=3D'tr '\0' '\n'; process_fasta' > > Here you assume that the quality score never reaches '@'. You cannot > do that, because it sometimes reaches @. The only thing you can be > sure of is every record is 4 lines. Sure. I mentioned they might not be robust. These may be better: fastalines(){ sed '1!s/^>/\x00&/' | tr '\n\0' '\0\n'; } fastqlines(){ paste -d $'\1' - - - - | tr '\1' '\0' } > I was hoping for a general solution that would work no matter the > content. Your solution breaks if the content contain \0 (NULs are not > in FAST[AQ] files, but may be in other formats). Fair point, but you can use the general technique of transforming (encoding) NULs to something else before processing, in the unlikely case they're present in the input. > Do you see support coming for n-line records in split? Given the above options, probably not. Maybe we could add support for --zero-terminated to treat \0 as the delimiter rather than \n, which might simplify postprocessing required? > Do you see support coming for records split on regexp in split? Given the complexity, probably not. regexps would be better maintained within sed etc. which could do the annotation for later splitting. Note also the `cpslit` util, but I don't see us updating that to supporting a fixed number of outputs like `split` either. cheers, P=E1draig.