From unknown Mon Aug 18 11:25:38 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15077: Bug in Join Resent-From: CDR Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 12 Aug 2013 16:15:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 15077 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 15077@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.13763240965059 (code B ref -1); Mon, 12 Aug 2013 16:15:02 +0000 Received: (at submit) by debbugs.gnu.org; 12 Aug 2013 16:14:56 +0000 Received: from localhost ([127.0.0.1]:55026 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V8ulk-0001JX-7j for submit@debbugs.gnu.org; Mon, 12 Aug 2013 12:14:56 -0400 Received: from eggs.gnu.org ([208.118.235.92]:42723) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V8ukk-0001H2-AZ for submit@debbugs.gnu.org; Mon, 12 Aug 2013 12:13:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V8ukc-0002pH-JY for submit@debbugs.gnu.org; Mon, 12 Aug 2013 12:13:49 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59428) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8ukc-0002pB-GT for submit@debbugs.gnu.org; Mon, 12 Aug 2013 12:13:46 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44428) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8ukZ-0005Un-JT for bug-coreutils@gnu.org; Mon, 12 Aug 2013 12:13:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V8ukT-0002lm-7C for bug-coreutils@gnu.org; Mon, 12 Aug 2013 12:13:43 -0400 Received: from mail-wi0-x230.google.com ([2a00:1450:400c:c05::230]:53303) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8ukT-0002lG-0q for bug-coreutils@gnu.org; Mon, 12 Aug 2013 12:13:37 -0400 Received: by mail-wi0-f176.google.com with SMTP id f14so1909072wiw.15 for ; Mon, 12 Aug 2013 09:13:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=u4/j7qgVghDXhcugEYrZay7HDxMOTlbyPeZJpP5/VGE=; b=iD8+KeVswh8CtOhc2l0wl74apUgBb6mbmY//7uRibrZl96RE8ggi/Lryrtfpq9ixqj 8ep8ujRhcsHzFWEaxsaCVGo0mgkDBSRAeN6F1O0QqL70zlUlS1XMNEXi6gsLJnEI7gb+ lKIY2+Q2fnnw1tKdak2U/pCUopqb8LLF0LM32Xr8hhoIuQVI7fkJU24JwTylGWNWF01G FIajk3ZBxLkUwR6XR/Nw2Zt+2eBR4AgCDZUGMAIC6LKcmtwMvA6mVc1iuo+tJOofOl+e 88oKkR53Z0F52mkOgs4KpXV6OQMYYNMapUUtHXc5Ao6ctABA4C8/CQ6AbcQb3mKSKtKJ e/IA== MIME-Version: 1.0 X-Received: by 10.194.22.41 with SMTP id a9mr13102070wjf.16.1376324015786; Mon, 12 Aug 2013 09:13:35 -0700 (PDT) Received: by 10.180.99.167 with HTTP; Mon, 12 Aug 2013 09:13:35 -0700 (PDT) Date: Mon, 12 Aug 2013 12:13:35 -0400 Message-ID: From: CDR Content-Type: multipart/alternative; boundary=047d7b5d86950e90c304e3c26804 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -2.7 (--) X-Mailman-Approved-At: Mon, 12 Aug 2013 12:14:54 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.7 (--) --047d7b5d86950e90c304e3c26804 Content-Type: text/plain; charset=ISO-8859-1 Dear Friends The option -a 1 or -a 2 is not correct. Suppose you have two comma separated value files with 10 records each, joined on the first column. The second file has 6 records where there is no match on the joining column (first). Let's call the files today.txt and yesterday.txt If you use this command "join -t, -1 1 -2 1 -a 2 -o 2.1 today.txt yesterday.txt" it should give ONLY the non pairing lines from yesterday. Instead, it prints all the lines from yesterday. I am copying here the contents of both files. cat today.txt 2012054455,8624520081,6529,0,201 2012067075,2013106025,6214,0,201 2012087388,8623689800,6214,2,201 2012088887,8623689800,6214,0,201 2012120319,9739789996,392A,0,201 2012121177,9739789996,392A,0,201 2012122869,2013700000,6006,0,201 2012140209,2013700000,6006,0,201 2012143002,2012339982,6529,0,201 2012149116,2012339982,6529,0,201 cat yesterday.txt 2012067075,2019269533,6664,0,201 2012087388,2012320000,6006,0,201 2012088887,8624520081,6529,0,201 2012140209,9733360000,392A,0,201 2012204272,2019269533,6664,0,201 2012226151,2018209998,954F,0,201 2012299682,2018209998,954F,0,201 2012324322,9733360000,392A,0,201 2012334444,2017809469,6664,0,201 2012389608,2012320000,6006,0,201 --047d7b5d86950e90c304e3c26804 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Dear Friends

The option -a 1 or -a = 2 is not correct.
Suppose you have two comma separated value = files with 10 records each, joined on the first column. The second file has= 6 records where there is no match on the joining column (first).
Let's call the files today.txt and yesterday.txt

If you use this command
"join -t, -1 1 -2 1 -a 2 -o 2.1 toda= y.txt yesterday.txt" it should give ONLY the non pairing lines from ye= sterday. Instead, it prints all the lines from yesterday.

I am copying here the contents of both files.

<= div>cat today.txt
2012054455,8624520081,6529,0,201
2012067075,2013106= 025,6214,0,201
2012087388,8623689800,6214,2,201
2012088887,8623689800= ,6214,0,201
2012120319,9739789996,392A,0,201
2012121177,9739789996,392A,0,201
201= 2122869,2013700000,6006,0,201
2012140209,2013700000,6006,0,201
201214= 3002,2012339982,6529,0,201
2012149116,2012339982,6529,0,201


=A0cat yesterday.txt
2012067075,2019269533,6664,0,201
2012= 087388,2012320000,6006,0,201
2012088887,8624520081,6529,0,201
2012140= 209,9733360000,392A,0,201
2012204272,2019269533,6664,0,201
2012226151= ,2018209998,954F,0,201
2012299682,2018209998,954F,0,201
2012324322,9733360000,392A,0,201
201= 2334444,2017809469,6664,0,201
2012389608,2012320000,6006,0,201



--047d7b5d86950e90c304e3c26804-- From unknown Mon Aug 18 11:25:38 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15077: Clarification References: In-Reply-To: Resent-From: CDR Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 12 Aug 2013 18:33:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15077 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 15077@debbugs.gnu.org Received: via spool by 15077-submit@debbugs.gnu.org id=B15077.137633232423057 (code B ref 15077); Mon, 12 Aug 2013 18:33:01 +0000 Received: (at 15077) by debbugs.gnu.org; 12 Aug 2013 18:32:04 +0000 Received: from localhost ([127.0.0.1]:55206 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V8wuS-0005zp-8a for submit@debbugs.gnu.org; Mon, 12 Aug 2013 14:32:04 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:57309) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V8wuP-0005zK-Rt for 15077@debbugs.gnu.org; Mon, 12 Aug 2013 14:32:02 -0400 Received: by mail-wi0-f180.google.com with SMTP id f14so2071434wiw.1 for <15077@debbugs.gnu.org>; Mon, 12 Aug 2013 11:31:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=Rqe6OlO1ZW1FIKT2O3qm1vlseN1GdEwQvlvJNRQMz/I=; b=YvyHn28X2iui9hBlQU3+PYmZnZgdiok+D4R9Qw1C78On9jt+xUmfbJRXB9b+H5wrPR YjKxoLRR1dVxDdBGx/V4SYyqxplBXQ1nYt5A4ppzr3yE16uvfcUIFOGDe29028w1yTYn v7VUpR/SSAgNoixwRPV2EIm5PTwmUrPO9HocEY4M4OC0gAe9fwgJBVVf0mhFBnm97Qdl Ijt+VxdZgwwa/mqzvgsVbkqXmz7015PeLeG93BqBRJjE7Xrf6zsFxybi8rpNJQXeF6AV IS9Ryr9zkvSmM9xbpDKVAkzFG2MABT8PJALNmo5lDAp0lkdPVpqdK7r2k+vjqWXbUy5e jWOg== MIME-Version: 1.0 X-Received: by 10.180.198.79 with SMTP id ja15mr118374wic.36.1376332315961; Mon, 12 Aug 2013 11:31:55 -0700 (PDT) Received: by 10.180.99.167 with HTTP; Mon, 12 Aug 2013 11:31:55 -0700 (PDT) Date: Mon, 12 Aug 2013 14:31:55 -0400 Message-ID: From: CDR Content-Type: multipart/alternative; boundary=047d7b624252c92ef604e3c45606 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --047d7b624252c92ef604e3c45606 Content-Type: text/plain; charset=ISO-8859-1 I just found out that the "v" option does what I need. So in my opinion, the "a" option is useless, for it gives you no new information. In terms of new functionality, the "-o" option, format, should allow to add arbitrary data, like ",A", "4", etc., in addition to the list of fields (2.1 1.1 etc.) Yours Federico Alves --047d7b624252c92ef604e3c45606 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I just found out that the "v" opt= ion does what I need. So in my opinion, the "a" option is useless= , for it gives you no new information.
In terms of new functionali= ty, the "-o" option, format, should allow to add arbitrary data, = like ",A", "4", etc., in addition to the list of fields= (2.1 1.1 etc.)

Yours

Federico Alves

--047d7b624252c92ef604e3c45606-- From unknown Mon Aug 18 11:25:38 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15077: Clarification Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Tue, 13 Aug 2013 00:02:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15077 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: CDR , 15077@debbugs.gnu.org Received: via spool by 15077-submit@debbugs.gnu.org id=B15077.137635207124737 (code B ref 15077); Tue, 13 Aug 2013 00:02:02 +0000 Received: (at 15077) by debbugs.gnu.org; 13 Aug 2013 00:01:11 +0000 Received: from localhost ([127.0.0.1]:55624 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V922w-0006Qv-0J for submit@debbugs.gnu.org; Mon, 12 Aug 2013 20:01:10 -0400 Received: from mail-ie0-f181.google.com ([209.85.223.181]:58405) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V922s-0006QJ-4m for 15077@debbugs.gnu.org; Mon, 12 Aug 2013 20:01:06 -0400 Received: by mail-ie0-f181.google.com with SMTP id x14so8823895ief.26 for <15077@debbugs.gnu.org>; Mon, 12 Aug 2013 17:01:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=9BygAHEPRZj1VNhke5IKvjWswXBKydxlrxrKM2MVSe0=; b=S7kXIhkxL5D67gaW3u7iMexQJUg7nTMC8IGYqnPJEpBbZp1CXjfHhxBVV6VQxKu4WR f8InEwvvKnf3hh5r/J5PmwAAngO55qaNZ73rcOY1QRgJaJc6U/dtaEusY4KZt0C1Gp3/ cTy/KBHnIkVOWFnJfiFQZROw2/4GcN4e70ITE2B4TgioTgK0RLUO+7HdF9tWG544rh25 JBEv/JJBlnQ5Y+WKLCMD0qN+IHf7qlfPp4OaUzjlKkO8R/L9+N9SEFPA4FdlUWO11sbG Ff0JiWg7ZkJ+ufr8U5ZsMZVmEVHOG55jLuHjNi2SxK4L2gZoEV7lesUnMhQ1BT8R96yR Y5PQ== X-Received: by 10.50.1.78 with SMTP id 14mr842693igk.60.1376352059988; Mon, 12 Aug 2013 17:00:59 -0700 (PDT) Received: from [192.168.0.30] (S0106602ad08d3937.cg.shawcable.net. [184.64.154.197]) by mx.google.com with ESMTPSA id y9sm727802iga.9.2013.08.12.17.00.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 12 Aug 2013 17:00:59 -0700 (PDT) Message-ID: <52097812.6070509@gmail.com> Date: Mon, 12 Aug 2013 18:04:34 -0600 From: Assaf Gordon User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130630 Icedove/17.0.7 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Hello Federico Alves, On 08/12/2013 12:31 PM, CDR wrote: > I just found out that the "v" option does what I need. So in my opinion, > the "a" option is useless, for it gives you no new information. I'm glad to hear you found the combination of options that works for you. I would humbly disagree that the "-a" option is useless - it simply does something different than what you need. Especially when combined with output specifier ("-o") - the output of "join" is indeed not what you wanted. When used without "-o", the "-a1/2" options allow you see which keys are common to both files and which keys are just in one file. Example: The following "join" will show which lines are common (will have 9 fields) and which lines are only in the second file ("-a 2"): --- $ join -t, -1 1 -2 1 -a 2 today.txt yesterday.txt 2012067075,2013106025,6214,0,201,2019269533,6664,0,201 2012087388,8623689800,6214,2,201,2012320000,6006,0,201 2012088887,8623689800,6214,0,201,8624520081,6529,0,201 2012140209,2013700000,6006,0,201,9733360000,392A,0,201 2012204272,2019269533,6664,0,201 2012226151,2018209998,954F,0,201 2012299682,2018209998,954F,0,201 2012324322,9733360000,392A,0,201 2012334444,2017809469,6664,0,201 2012389608,2012320000,6006,0,201 --- If you don't care about the other fields, and just want to see the keys, using "-o 0,1.1,2.1" will give: ---- $ join -t, -1 1 -2 1 -a 2 -o 0,1.1,2.1 today.txt yesterday.txt 2012067075,2012067075,2012067075 2012087388,2012087388,2012087388 2012088887,2012088887,2012088887 2012140209,2012140209,2012140209 2012204272,,2012204272 2012226151,,2012226151 2012299682,,2012299682 2012324322,,2012324322 2012334444,,2012334444 2012389608,,2012389608 ---- Which again, quickly shows that lines with empty second field exist only in the second file. You can combine "-a 1" and "-a 2", to show all combination of items in both files: --- $ join -t, -1 1 -2 1 -a 1 -a 2 -o 0,1.1,2.1 today.txt yesterday.txt 2012054455,2012054455, 2012067075,2012067075,2012067075 2012087388,2012087388,2012087388 2012088887,2012088887,2012088887 2012120319,2012120319, 2012121177,2012121177, 2012122869,2012122869, 2012140209,2012140209,2012140209 2012143002,2012143002, 2012149116,2012149116, 2012204272,,2012204272 2012226151,,2012226151 2012299682,,2012299682 2012324322,,2012324322 2012334444,,2012334444 2012389608,,2012389608 --- In this example, all lines have three fields: First field is the combined key, and is always non-empty. Second field is non-empty if the key exists in the first file. Third field is non-empty if the key exists in the second file. (and thus, if both second and third fields are non empty, the key is common to both files). > In terms of new functionality, the "-o" option, format, should allow to add > arbitrary data, like ",A", "4", etc., in addition to the list of fields > (2.1 1.1 etc.) I would suggest using a different program (perhaps awk or sed), down-stream from the "join" program to add any additional information you need. Consider combining it with "-o auto" (new in join version 8.10) that will maintain the column ordering of the combined input files, and will allow you to easily add information. Example with "-a 1 -2" AND "-o auto": --- $ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt 2012054455,8624520081,6529,0,201,,,, 2012067075,2013106025,6214,0,201,2019269533,6664,0,201 2012087388,8623689800,6214,2,201,2012320000,6006,0,201 2012088887,8623689800,6214,0,201,8624520081,6529,0,201 2012120319,9739789996,392A,0,201,,,, 2012121177,9739789996,392A,0,201,,,, 2012122869,2013700000,6006,0,201,,,, 2012140209,2013700000,6006,0,201,9733360000,392A,0,201 2012143002,2012339982,6529,0,201,,,, 2012149116,2012339982,6529,0,201,,,, 2012204272,,,,,2019269533,6664,0,201 2012226151,,,,,2018209998,954F,0,201 2012299682,,,,,2018209998,954F,0,201 2012324322,,,,,9733360000,392A,0,201 2012334444,,,,,2017809469,6664,0,201 2012389608,,,,,2012320000,6006,0,201 --- In this example, all lines have nine fields, and are easy to parse: 1. The common key 2-5 - The four fields from the first file (possibly empty) 6-9 - The four fields from the second file (possibly empty). Adding AWK on the output of "join" is now easy, because the fields are in fixed order. for example, adding "AA" as a first field and "44" as the last field: --- $ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt | awk -F, -v OFS=, '{print "AA", $0, "44"}' AA,2012054455,8624520081,6529,0,201,,,,,44 AA,2012067075,2013106025,6214,0,201,2019269533,6664,0,201,44 AA,2012087388,8623689800,6214,2,201,2012320000,6006,0,201,44 AA,2012088887,8623689800,6214,0,201,8624520081,6529,0,201,44 AA,2012120319,9739789996,392A,0,201,,,,,44 AA,2012121177,9739789996,392A,0,201,,,,,44 AA,2012122869,2013700000,6006,0,201,,,,,44 AA,2012140209,2013700000,6006,0,201,9733360000,392A,0,201,44 AA,2012143002,2012339982,6529,0,201,,,,,44 AA,2012149116,2012339982,6529,0,201,,,,,44 AA,2012204272,,,,,2019269533,6664,0,201,44 AA,2012226151,,,,,2018209998,954F,0,201,44 AA,2012299682,,,,,2018209998,954F,0,201,44 AA,2012324322,,,,,9733360000,392A,0,201,44 AA,2012334444,,,,,2017809469,6664,0,201,44 AA,2012389608,,,,,2012320000,6006,0,201,44 --- Or something a little more informative: --- $ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt | awk -F, -v OFS=, '$2=="" && $6!="" { print $1, "Yesterday" } $2!="" && $6=="" { print $1, "Today" } $2!="" && $6!="" { print $1, "Both" }' 2012054455,Today 2012067075,Both 2012087388,Both 2012088887,Both 2012120319,Today 2012121177,Today 2012122869,Today 2012140209,Both 2012143002,Today 2012149116,Today 2012204272,Yesterday 2012226151,Yesterday 2012299682,Yesterday 2012324322,Yesterday 2012334444,Yesterday 2012389608,Yesterday --- Hope this helps, -gordon From unknown Mon Aug 18 11:25:38 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15077: Clarification Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Tue, 13 Aug 2013 03:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15077 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: CDR Cc: 15077@debbugs.gnu.org, Coreutils Received: via spool by 15077-submit@debbugs.gnu.org id=B15077.137636274710757 (code B ref 15077); Tue, 13 Aug 2013 03:00:02 +0000 Received: (at 15077) by debbugs.gnu.org; 13 Aug 2013 02:59:07 +0000 Received: from localhost ([127.0.0.1]:55822 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V94p9-0002nR-DK for submit@debbugs.gnu.org; Mon, 12 Aug 2013 22:59:07 -0400 Received: from mail-ie0-f173.google.com ([209.85.223.173]:56205) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1V94p6-0002mt-Cq for 15077@debbugs.gnu.org; Mon, 12 Aug 2013 22:59:05 -0400 Received: by mail-ie0-f173.google.com with SMTP id k5so9497866iea.4 for <15077@debbugs.gnu.org>; Mon, 12 Aug 2013 19:58:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=zSpD5HStQcaE8JtZBXpshHiHntnohM+MBXxVyBuz3iw=; b=lw8P+bIX8D+LIHKl2VIuN+otdUvfD674w/Qfl/imvJcxKMdSukyncUEXTzguECk+iB XGWVd+l7FCTdGs6KiTtyjYYCrodbIBfOTMI3gRQNO6seS71LNUa5rMZLhkgOIbJLv6NQ IFUkA4b+xfEG7Y9JPKB0ZCx07ma4iwVxu32V9oL4O3Cr4P3LBRpn5pEfQnD0Kc6NDZ15 MY9U3udUFchxpNpqM1V2urcEikT7BW3bxxprJ+TTZmDG7lZF1yDG4m2G0rOaV35nyzsx 1nUuyzVZYa2fRkfFq3fSdvbMKRSufrb0HEvG8brerwsu6XSTjnAd7RLSCbvRUGs+Zjdr lEHA== X-Received: by 10.50.39.84 with SMTP id n20mr1298814igk.14.1376362738587; Mon, 12 Aug 2013 19:58:58 -0700 (PDT) Received: from [10.255.134.204] ([173.183.194.88]) by mx.google.com with ESMTPSA id ir7sm1649388igb.8.2013.08.12.19.58.56 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 12 Aug 2013 19:58:57 -0700 (PDT) Message-ID: <5209A1C8.30900@gmail.com> Date: Mon, 12 Aug 2013 21:02:32 -0600 From: Assaf Gordon User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130630 Icedove/17.0.7 MIME-Version: 1.0 References: <52097812.6070509@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) (CC'ing the list so that others could comment) Hello Federico, On 08/12/2013 06:50 PM, CDR wrote: > How do I get latest, latest version, even beta, or join, sort, etc? I would not recommend using "beta" or "development" versions of GNU coreutils for production code, just to be on the safe side. The stable releases are available as source code here: http://ftp.gnu.org/gnu/coreutils/ With more details here: http://www.gnu.org/software/coreutils/ > One thing that I suggest is to change sort, comm and join to use more > than one core. I had to use a commercial version of sort because the > "regular" version tales for ever to sort a 15G file. The commercial > version is called nsort and it uses all the cores in the machines and > also you may add a flag to give the program a huge memory block. It > works like ten times faster than the "regular" sort. Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see the "--parallel" parameter). Sort also supports the "--buffer-size" parameter to explicitly specify how much memory to use. I'm not familiar with "nsort" and can not comment on nsort vs GNU sort's speeds, I believe that on modern hardware, sorting 15G should take few minutes at most, not "forever" - but that depends on many factors (e.g. cores, memory, disk, etc.). "join" operates on sorted input, and as such, requires very little CPU and memory. I do not think much can be gained from making "join" multi-threaded. I believe the same applies to "comm". > I am using "comm" a lot for business problem that involves comparing > daily files that have 550 MM records. I find it extremely slow. Do > you any suggestion? > Others could perhaps comment on ways to improve performance when using GNU coreutils. I'd assume it very much depends on the technical details you're comparing - perhaps there are ways to improve the workflow. First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk speed, Algorithm, etc.) regards, -gordon From debbugs-submit-bounces@debbugs.gnu.org Thu Oct 11 17:37:07 2018 Received: (at control) by debbugs.gnu.org; 11 Oct 2018 21:37:07 +0000 Received: from localhost ([127.0.0.1]:45626 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gAidm-0005cN-Uj for submit@debbugs.gnu.org; Thu, 11 Oct 2018 17:37:07 -0400 Received: from mail-it1-f181.google.com ([209.85.166.181]:36979) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gAidm-0005bu-8t for control@debbugs.gnu.org; Thu, 11 Oct 2018 17:37:06 -0400 Received: by mail-it1-f181.google.com with SMTP id e74-v6so15446416ita.2 for ; Thu, 11 Oct 2018 14:37:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:from:message-id:date:user-agent:mime-version:content-language :content-transfer-encoding; bh=s2nrocY+sVknyfjQC6DgqfaCBvEnaRhPYDHJ56PMoo4=; b=Fggi+eXuVoP+171HN846MN4N4E4L4rm2J0qKBP/EAEeOps5a7452RSHf/PcUV5IgCI ZkEcFoiGK3yFgpAvRCCx3gxDfFJrTHjpQWEQHaqvxdFo3fR15Z46W5XW9tKlZqUtz6hA lfwFyGF9KfvS8rPspy0/xMOQSQZvf919PZdMCMWEBb2sUTl0bmN7dBnkcrxK5/sqmJrY wa60cvCxWK88/XWOcUKHkdBDZwjErjHihG5JpgyNAIToahyB1e31E85uiuTo18VyBv79 UQcESLx4LrkbxcIgWsjR6RMP7n4HUDYeWkrkXvuuBn2ipcqPMxQtoWTqtju/Nq2AWMNw RIOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=s2nrocY+sVknyfjQC6DgqfaCBvEnaRhPYDHJ56PMoo4=; b=DeGK9ua8ptjPefJMH83xFxCBKy/ISPBEIOLz3jD4cKNzKBFxVGDUJCcTSSak+mqHQJ B3X0Y4IdpNFlau2SIe3WCbKViPTjCovzLtZrHFoQSx1rwluAML9MnCFBkxkuMkuy9Uf3 5SEBF8kuRSKal2ZioZd73BNeHAbutqNVt3RMjaRQekQDLyggTemYOyFV4XyJr+kLLYcb WqHD6KHsX7BvJadtZ5LlRts+HV9ckblGaQAUorVYnIAEo8TKgPG9HWWZ9Jm1Rim++34p 3/5rhpmieDNUZD047nQrwYHj1/VYVF9AiwGul6xD/bVesd+D/xrapQVx/heJGOyG+ggG E5HA== X-Gm-Message-State: ABuFfogdVK5opAAuOzaPLO+7tzo5BZ1gLw1m5qrJxv9tpOitGXDx/6oD uDfnnJGVebYpx+0znRZqhcfF1GhB X-Google-Smtp-Source: ACcGV61uS+P9sPlkKa07mCcMgykgCIVw4+h3cp5aZD+bxcTaOd9fA7BAAlKiQz1ykGiVNe5bFID4nQ== X-Received: by 2002:a24:5a50:: with SMTP id v77-v6mr2753651ita.172.1539293820423; Thu, 11 Oct 2018 14:37:00 -0700 (PDT) Received: from tomato.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id e19-v6sm11655871ioc.10.2018.10.11.14.36.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Oct 2018 14:36:58 -0700 (PDT) To: control@debbugs.gnu.org From: Assaf Gordon Message-ID: <2d8cab49-9bd1-b27f-0b2a-d89fe57e6e4b@gmail.com> Date: Thu, 11 Oct 2018 15:36:57 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: tags 15077 notabug close 15077 stop [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (assafgordon[at]gmail.com) -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.166.181 listed in wl.mailspike.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [209.85.166.181 listed in list.dnswl.org] 1.8 MISSING_SUBJECT Missing Subject: header 0.2 NO_SUBJECT Extra score for no subject X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) tags 15077 notabug close 15077 stop