From unknown Sat Sep 13 11:13:16 2025 X-Loop: help-debbugs@gnu.org Subject: bug#42340: "join" reports that "sort"ed input is not sorted Resent-From: Beth Andres-Beck Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 13 Jul 2020 00:36:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 42340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 42340@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.159460054627220 (code B ref -1); Mon, 13 Jul 2020 00:36:01 +0000 Received: (at submit) by debbugs.gnu.org; 13 Jul 2020 00:35:46 +0000 Received: from localhost ([127.0.0.1]:47182 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jumRc-00074x-O4 for submit@debbugs.gnu.org; Sun, 12 Jul 2020 20:35:45 -0400 Received: from lists.gnu.org ([209.51.188.17]:55940) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1julr3-0005yf-1Z for submit@debbugs.gnu.org; Sun, 12 Jul 2020 19:57:57 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:43530) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1julr2-0007OD-OZ for bug-coreutils@gnu.org; Sun, 12 Jul 2020 19:57:56 -0400 Received: from mail-pg1-x542.google.com ([2607:f8b0:4864:20::542]:32964) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1julr0-0000O7-OK for bug-coreutils@gnu.org; Sun, 12 Jul 2020 19:57:56 -0400 Received: by mail-pg1-x542.google.com with SMTP id o13so5236768pgf.0 for ; Sun, 12 Jul 2020 16:57:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=b7vljEOjPSEtc2Cgapgn81Uc7UXje9p+1KQarG1u6Ds=; b=t9uuqBcUeMBm2PlAV/Mc9lN5l5in+NA/hQTZu6Oxn8Ud6zsqS96PPd3xmxMCxapwF1 4KNUstk7GZxZIv0tQqPk3WBfwLXqSlGZz4fJmGs44ye6GPTk8jL5eXFNIRvFybzVrWsw 5QiS8OmV6NN+llJnz3eFpYK0PKlBqoWtM+OdPa9SHXY1AG2vYkHUbIQ1OZJvE4rxrxP3 PjnjpUdlyazf+kjMOD54kTwhuEPstcW95Oj6/C6b+pr7hWkpzSYYxOcHtkeD/rw6e4rL WH3m7enH8sRZbapxVDJdA/ecCFkNXRiITnMnRH2il3kZImyAPaV5NJkBOCE/2h+HQDvs +PUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=b7vljEOjPSEtc2Cgapgn81Uc7UXje9p+1KQarG1u6Ds=; b=ZkiqO2t8prlc5MEverltgeNZRhp38GffYyrf7CEjMfpclzpmWHi5sshFFmBPy53HE7 WNCpllgvAAxnXvC4mL2Ou9cGkZyEqaEztoszlwSlGyj//05gPUA2UmrjLamfbSMpusTZ zsu38cajWY3uHvZblRqHECicVdwvkQ7CPSWtIPWiPCFBKdsUA1byJlmmqj2t1iHnJjBz He/aHo2ipvR3iGfLH4XFdekFUltgDtAvgSV9TIvMyA+tBjQ7PHLB/bTvS+X30DVVbol7 7Dj8yI55aMKPZfkjS3dl66NEgmlurBlW0K9DKxasoIXGiFPsAxLIcnRg/DgW2jWAEaQM bC9g== X-Gm-Message-State: AOAM533uIJfOwy8qjcCnmaFBWl/JKMeaWqqp7J4A1SLSm0H53FKDugPr K8VxWLv2kTPrityNdxLX90/Ddb3hJlNszNPBIDVawX32XJs= X-Google-Smtp-Source: ABdhPJwZp7Bk+QWwplwtp5icoxx2HzKUpysB+Xir0e1b2vB7N299GmHHx3D6PfbIwoswnM+lGOVFC3AWcpClXjNcoog= X-Received: by 2002:a62:4ec5:: with SMTP id c188mr5237093pfb.199.1594598272376; Sun, 12 Jul 2020 16:57:52 -0700 (PDT) MIME-Version: 1.0 From: Beth Andres-Beck Date: Sun, 12 Jul 2020 16:57:41 -0700 Message-ID: Content-Type: multipart/alternative; boundary="00000000000095a3a605aa47579d" Received-SPF: pass client-ip=2607:f8b0:4864:20::542; envelope-from=bandresbeck@gmail.com; helo=mail-pg1-x542.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: 0.1 (/) X-Mailman-Approved-At: Sun, 12 Jul 2020 20:35:43 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --00000000000095a3a605aa47579d Content-Type: text/plain; charset="UTF-8" In trying to use `join` with `sort` I discovered odd behavior: even after running a file through `sort` using the same delimiter, `join` would still complain that it was out of order. The field I am sorting on is ip addresses, which means that depending on which digits are zero they can be of different lengths, and the fields include periods as well as alpha-numeric characters. Here is a way to reproduce the problem: > printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt > printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt > join -t, a.txt b.txt join: b.txt:2: is not sorted: 1.1.1,b The expected behavior would be that if a file has been sorted by "sort" it will also be considered sorted by join. --- I traced this back to what I believe to be a bug in sort.c when sorting on a field other than the last field, where the original pointer is being incremented one further than it ought to be. On line 1675 it will always increment the pointer one position beyond the delimiter unless the field is the last field. If both `eword` and `echar` are 0 we incremented `eword` on line 1661. Later when we use keylim (where the limfield value is stored) to calculate the length of the field, it will include the delimiter in the comparison. We can illustrate that the problem is including the delimiter because the following case runs correctly without error: > printf '1.1.1Z2\n1.1.12Z2\n1.1.2Z1' | sort -tZ > a.txt > printf '1.1.12Za\n1.1.1Zb\n1.1.21Zc' | sort -tZ > b.txt > join -tZ a.txt b.txt In join.c, in comparison, we are comparing the contents of the keys without the delimiter (on join.c:283 we call extract_field with `ptr` pointing to the start of the key and len defined as `sep - ptr`, where `sep` is the position of the tab character). Cases illustrating the bug in sort: > printf '12,\n1,\n' | sort -t, -k1 1, 12, > printf '12,a\n1,a\n' | sort -t, -k1 12,a 1,a Thank you, Beth Andres-Beck --00000000000095a3a605aa47579d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
In trying to use `join` with `sort` I discovered odd behav= ior: even after running a file through `sort` using the same delimiter, `jo= in` would still complain that it was out of order.

The field I am so= rting on is ip addresses, which means that depending on which digits are ze= ro they can be of different lengths, and the fields include periods as well= as alpha-numeric characters.

Here is a way to reproduce the problem= :

> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a= .txt
> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b= .txt
> join -t, a.txt b.txt
=C2=A0join: b.txt:2: is not sorted: 1.= 1.1,b

The expected behavior would be that if a file has been sorted = by "sort" it will also be considered sorted by join.

---I traced this back to what I believe to be a bug in sort.c when sorting o= n a field other than the last field, where the original pointer is being in= cremented one further than it ought to be.

On line 1675 it will alwa= ys increment the pointer one position beyond the delimiter unless the field= is the last field. If both `eword` and `echar` are 0 we incremented `eword= ` on line 1661.

Later when we use keylim (where the limfield value = is stored) to calculate the length of the field, it will include the delimi= ter in the comparison. We can illustrate that the problem is including the = delimiter because the following case runs correctly without error:

> print= f '1.1.1Z2\n1.1.12Z2\n1.1.2Z1' | sort -tZ > a.txt

> print= f '1.1.12Za\n1.1.1Zb\n1.1.21Zc' | sort -tZ > b.txt

> join = -tZ a.txt b.txt


In join.c, in comparison, we are comparing the = contents of the keys without the delimiter (on join.c:283 we call extract_f= ield with `ptr` pointing to the start of the key and len defined as `sep - = ptr`, where `sep` is the position of the tab character).

Cases illus= trating the bug in sort:
> printf '12,\n1,\n' | sort -t, -k1<= br>1,
12,

> printf '12,a\n1,a\n' | sort -t, -k1
12,= a
1,a

Thank you,
Beth Andres-Beck
--00000000000095a3a605aa47579d-- From unknown Sat Sep 13 11:13:16 2025 X-Loop: help-debbugs@gnu.org Subject: bug#42340: "join" reports that "sort"ed input is not sorted Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 13 Jul 2020 06:59:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Beth Andres-Beck , 42340@debbugs.gnu.org Received: via spool by 42340-submit@debbugs.gnu.org id=B42340.15946235227423 (code B ref 42340); Mon, 13 Jul 2020 06:59:01 +0000 Received: (at 42340) by debbugs.gnu.org; 13 Jul 2020 06:58:42 +0000 Received: from localhost ([127.0.0.1]:47466 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jusQE-0001va-GA for submit@debbugs.gnu.org; Mon, 13 Jul 2020 02:58:42 -0400 Received: from mail-pf1-f178.google.com ([209.85.210.178]:43353) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jusQC-0001vE-FV; Mon, 13 Jul 2020 02:58:40 -0400 Received: by mail-pf1-f178.google.com with SMTP id u18so5590387pfk.10; Sun, 12 Jul 2020 23:58:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=JiWK4q+EB7576oUDoLpUM1jN9dVAczGonpKka3IJ6YY=; b=hkjSqlcjQGQBaWuHXE4owts7xKyDN5V63DKRWxVHE1v0wE698cBaRwto70BR7RMl72 m1sZ6FS0Rh6Z5iYtQL7DTgLzSxS3AHzFxKIbinvd96P9lNV/on8M2qfvTiOZ1eLnLua9 YOC0Xq2SBnhsS9T+cjcvDK0gTZJYy1UXHMLlxCzatkYcTUiDsnS8lTo85EKvljybnV0w E8bAtvS2pWWcMcmsyia/5UGbgL4F8PCSyb3IMRuYjcx8xjsdsr495plQb6o+UdnbzJSq oYF05rZx5fUrDe8sXuopPS7x+vFrsnGDOIhUxK2YZYgm1vSUbQWXLGSIWD9RKxuW3wWM 9YSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=JiWK4q+EB7576oUDoLpUM1jN9dVAczGonpKka3IJ6YY=; b=FJ3lzo46w2D3hFNmXvOIycGP+8Xy53EITmzRz+H5Xx3i9NhFP7sewT7uk7tzFqlv4w TJmoqivCfjtd666XaL1snw5ZYLMcXDZjGGJuezGKKUC+LAu9KDbgFALz9Em3Kq6ls920 SMBa13ZzulGWuBsE1UbzRazBJUlPukDIGPSyoQFevXse0LNYsJ0jFMBLzEyOoletunRR DGrHSooFeFR1WH9gSUa7dLQxQdy5rU5McAWjYet/zy9lcIKU5Z8XtbFJVhJ2/olkmqcC F4x7RC7O1sCQrqRWUruSOWRZt08QpzdN44pDhcJvPBQzygghZfoUDEBBVXmy7d2ML9lb rR2g== X-Gm-Message-State: AOAM532sLZq86+aqTm8u57BbhUBF6EAPkJBAkXF6b1QIrmHbq3B7hQk3 jiP3hKrGeOFt8Q/0T4zgGeFR0Oq+ X-Google-Smtp-Source: ABdhPJxhfbbUhRldGmcluqo2yum8AdbwL012CxNQiexKKAoKQ8HSB2tNXn0aamBAQQiKr8aszyq+4A== X-Received: by 2002:aa7:810c:: with SMTP id b12mr31058945pfi.69.1594623513976; Sun, 12 Jul 2020 23:58:33 -0700 (PDT) Received: from tomato.moose.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id 127sm12259425pgf.5.2020.07.12.23.58.32 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 12 Jul 2020 23:58:33 -0700 (PDT) References: From: Assaf Gordon Message-ID: Date: Mon, 13 Jul 2020 00:58:32 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) tags 42340 notabug close 42340 stop Hello, On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote: > In trying to use `join` with `sort` I discovered odd behavior: even after > running a file through `sort` using the same delimiter, `join` would still > complain that it was out of order. [...] > Here is a way to reproduce the problem: > >> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt >> join -t, a.txt b.txt > join: b.txt:2: is not sorted: 1.1.1,b > > The expected behavior would be that if a file has been sorted by "sort" it > will also be considered sorted by join. [...] > I traced this back to what I believe to be a bug in sort.c This is not a bug in sort or join, just a side-effect of the locale on your system on the sorting results. By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order), the files are ordered in the same way 'join' expected them to be: $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt $ join -t, a.txt b.txt 1.1.1,2,b 1.1.12,2,a --- More details: I'm going to assume your system uses some locale based on UTF-8. You can check it by running 'locale', e.g. on my system: $ locale LANG=en_CA.utf8 LANGUAGE=en_CA:en LC_CTYPE="en_CA.utf8" .. .. Under most UTF-8 locales, punctuation characters are *ignored* in the compared input lines. This might be confusing and non-intuitive, but that's the way most systems have been working for many years (locale ordering is defined in the GNU C Library, and coreutils has no way to change it). Observe the following: $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort 12,a 1,b $ printf '12,a\n1,b\n' | LC_ALL=C sort 1,b 12,a With a UTF-8 locale, the comma character is ignored, and then "12a" appears before "1b" (since the character '2' comes before the character 'b'). With "C" locale, forcing ASCII or "byte comparison", punctuation characters are not ignored, and "1,b" appears before "12,a" (because the comma ',' ASCII value is 44 , which is smaller then the ASCII value digit '2'). --- Somewhat related: Your sort command defines the delimiter ("-t,") but does not define which columns to sort by; sort then uses the entire input line - and there's no need to specify delimiter at all. --- As such, I'm closing this as "not a bug", but discussion can continue by replying to this thread. regards, - assaf From unknown Sat Sep 13 11:13:16 2025 X-Loop: help-debbugs@gnu.org Subject: bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sorted Resent-From: Beth Andres-Beck Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Wed, 15 Jul 2020 23:12:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: notabug To: 42340@debbugs.gnu.org Received: via spool by 42340-submit@debbugs.gnu.org id=B42340.159485467527225 (code B ref 42340); Wed, 15 Jul 2020 23:12:02 +0000 Received: (at 42340) by debbugs.gnu.org; 15 Jul 2020 23:11:15 +0000 Received: from localhost ([127.0.0.1]:53973 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvqYT-000750-1o for submit@debbugs.gnu.org; Wed, 15 Jul 2020 19:11:15 -0400 Received: from mail-pg1-f172.google.com ([209.85.215.172]:32940) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvnlh-0002rW-PA for 42340@debbugs.gnu.org; Wed, 15 Jul 2020 16:12:42 -0400 Received: by mail-pg1-f172.google.com with SMTP id o13so3642172pgf.0 for <42340@debbugs.gnu.org>; Wed, 15 Jul 2020 13:12:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=FqHEjYEv71Zmn30EYYEUoDt2thgk+0iMFNu2FtFcUIc=; b=jWniMmjUYAhIqJixD8x+KA3LHVBK92Dh2aksKRck86fo066vjLAtf+RIHzf8SEHQZg BpXhUMob45q+dsWlDaqnYSmSJG7wkW9Cjdokj4kfTDpXuiFn1VNceOln0Y0l9FtUYmUz Obq1K/8uUeTh56xH6XMYtuqhWbIZBIlgIKamj2jTfKCuQlk/SW6ZyCOC9k+k1HeLzCey Mo8oUVw160HU4PWmTbjoeggY95FdrLubbS5aMuSh+DvS4EDPQ0VtwZQwEX+EWmdS1Ei5 /MSMb8mbsWYuhNvT7KZSWKeVJeHzTmn0im49+QzSE2jFIOXc7E2A92xqjZl6nsXgHwvf OKgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=FqHEjYEv71Zmn30EYYEUoDt2thgk+0iMFNu2FtFcUIc=; b=FKP+/6AIvpUARsmaBzimY51MAavYUnqmeeChESdxBpXpcSTD73Fmd0pSva+Owc2vwt 9Me0ZVThQf3a7U827OsXazJD/e9sgP+ZjHG83mM6eqVJx2sRXoy2BIXszLVVWh7NwaH4 keK8n/cL4VY0HXrqHS/ltpsGw16L3j9p1uLIS98tou2oU3iAdjUm6NyCxLCQnGpd54UQ 4QjCoo4EzBdnAhddXgePUFjSTZUcov6Og9KBxvKn0SuDVPAhSySwwysNW5MccEV57W02 dNGQ+Z9lWdOzU1aefdOmu4eDbAzYibbN24XgMQ7F55s4FuancNLpU0LJrpQELEAPIDc/ 8vTg== X-Gm-Message-State: AOAM532WfMCnJlRAXHWhT3NuEAfrBQM9M6ETaCRC1jlXz2aeEAYmLwNc 6j9XHKXxJLDliQponD2Mu4WJBsHOCXR/opQr8fNEXe6p X-Google-Smtp-Source: ABdhPJyEYw9KGXmjGObkhvvGFdoh5KzBNNeIULIOdjLFI57/qLydPvJ4cFpfcwepoDeBjYflICHRiNp8KT4nTOr1vm0= X-Received: by 2002:a65:5502:: with SMTP id f2mr1175802pgr.375.1594843955656; Wed, 15 Jul 2020 13:12:35 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Beth Andres-Beck Date: Wed, 15 Jul 2020 13:12:24 -0700 Message-ID: Content-Type: multipart/alternative; boundary="00000000000072fdae05aa808b83" X-Spam-Score: 0.0 (/) X-Mailman-Approved-At: Wed, 15 Jul 2020 19:11:12 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --00000000000072fdae05aa808b83 Content-Type: text/plain; charset="UTF-8" If that is the intended behavior, the bug is that: > printf '12,\n1,\n' | sort -t, -k1 -s 1, 12, does _not_ take the remainder of the line into account, and only sorts on the initial field, prioritizing length. It is at the very least unexpected that adding an `a` to the end of both lines would change the sort order of those lines: > printf '12,a\n1,a\n' | sort -t, -k1 -s 12,a 1,a On Sun, Jul 12, 2020 at 11:58 PM Assaf Gordon wrote: > tags 42340 notabug > close 42340 > stop > > Hello, > > On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote: > > In trying to use `join` with `sort` I discovered odd behavior: even after > > running a file through `sort` using the same delimiter, `join` would > still > > complain that it was out of order. > [...] > > Here is a way to reproduce the problem: > > > >> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt > >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt > >> join -t, a.txt b.txt > > join: b.txt:2: is not sorted: 1.1.1,b > > > > The expected behavior would be that if a file has been sorted by "sort" > it > > will also be considered sorted by join. > [...] > > I traced this back to what I believe to be a bug in sort.c > > This is not a bug in sort or join, just a side-effect of the locale on > your system on the sorting results. > > By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order), > the files are ordered in the same way 'join' expected them to be: > > $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt > $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt > $ join -t, a.txt b.txt > 1.1.1,2,b > 1.1.12,2,a > > --- > > More details: > I'm going to assume your system uses some locale based on UTF-8. > You can check it by running 'locale', e.g. on my system: > $ locale > LANG=en_CA.utf8 > LANGUAGE=en_CA:en > LC_CTYPE="en_CA.utf8" > .. > .. > > Under most UTF-8 locales, punctuation characters are *ignored* in the > compared input lines. This might be confusing and non-intuitive, but > that's the way most systems have been working for many years (locale > ordering is defined in the GNU C Library, and coreutils has no way to > change it). > > Observe the following: > > $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort > 12,a > 1,b > > $ printf '12,a\n1,b\n' | LC_ALL=C sort > 1,b > 12,a > > With a UTF-8 locale, the comma character is ignored, and then "12a" > appears before "1b" (since the character '2' comes before the character > 'b'). > > With "C" locale, forcing ASCII or "byte comparison", punctuation > characters are not ignored, and "1,b" appears before "12,a" (because > the comma ',' ASCII value is 44 , which is smaller then the ASCII value > digit '2'). > > --- > > Somewhat related: > Your sort command defines the delimiter ("-t,") but does not define > which columns to sort by; sort then uses the entire input line - and > there's no need to specify delimiter at all. > > --- > > As such, I'm closing this as "not a bug", but discussion can continue by > replying to this thread. > > regards, > - assaf > > --00000000000072fdae05aa808b83 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
If that is the= intended behavior, the bug is that:
>=C2=A0printf = 9;12,\n1,\n' | sort -t, -k1 -s
1,
12,
<= br>
does _not_ take the remainder of the li= ne into account, and only sorts on the initial field,=C2=A0prioritizing len= gth.

It is at the very least unexpected that adding an `a` to the en= d of both lines would change the sort order of those lines:
>=C2=A0printf '12,a\n1,a\n' | sort -t, -k1 -s
12,a<= br>1,a

= On Sun, Jul 12, 2020 at 11:58 PM Assaf Gordon <assafgordon@gmail.com> wrote:
<= /div>
tags 42340 notabug close 42340
stop

Hello,

On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
> In trying to use `join` with `sort` I discovered odd behavior: even af= ter
> running a file through `sort` using the same delimiter, `join` would s= till
> complain that it was out of order.
[...]
> Here is a way to reproduce the problem:
>
>> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt<= br> >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt=
>> join -t, a.txt b.txt
>=C2=A0 =C2=A0join: b.txt:2: is not sorted: 1.1.1,b
>
> The expected behavior would be that if a file has been sorted by "= ;sort" it
> will also be considered sorted by join.
[...]
> I traced this back to what I believe to be a bug in sort.c

This is not a bug in sort or join, just a side-effect of the locale on
your system on the sorting results.

By forcing a C locale with "LC_ALL=3DC" (meaning simple ASCII ord= er),
the files are ordered in the same way 'join' expected them to be:
=C2=A0 $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=3DC sort -t, = > a.txt
=C2=A0 $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=3DC sort -t,= > b.txt
=C2=A0 $ join -t, a.txt b.txt
=C2=A0 1.1.1,2,b
=C2=A0 1.1.12,2,a

---

More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
=C2=A0 =C2=A0$ locale
=C2=A0 =C2=A0LANG=3Den_CA.utf8
=C2=A0 =C2=A0LANGUAGE=3Den_CA:en
=C2=A0 =C2=A0LC_CTYPE=3D"en_CA.utf8"
=C2=A0 =C2=A0..
=C2=A0 =C2=A0..

Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale ordering is defined in the GNU C Library, and coreutils has no way to
change it).

Observe the following:

=C2=A0 =C2=A0$ printf '12,a\n1,b\n' | LC_ALL=3Den_CA.utf8 sort
=C2=A0 =C2=A012,a
=C2=A0 =C2=A01,b

=C2=A0 =C2=A0$ printf '12,a\n1,b\n' | LC_ALL=3DC sort
=C2=A0 =C2=A01,b
=C2=A0 =C2=A012,a

With a UTF-8 locale, the comma character is ignored, and then "12a&quo= t;
appears before "1b" (since the character '2' comes before= the character
'b').

With "C" locale, forcing ASCII or "byte comparison", pu= nctuation
characters are not ignored, and "1,b" appears before "12,a&q= uot; (because
the comma ',' ASCII value is 44 , which is smaller then the ASCII v= alue
digit '2').

---

Somewhat related:
Your sort command defines the delimiter ("-t,") but does not defi= ne
which columns to sort by; sort then uses the entire input line - and
there's no need to specify delimiter at all.

---

As such, I'm closing this as "not a bug", but discussion can = continue by
replying to this thread.

regards,
=C2=A0 - assaf

--00000000000072fdae05aa808b83-- From unknown Sat Sep 13 11:13:16 2025 X-Loop: help-debbugs@gnu.org Subject: bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sorted Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 16 Jul 2020 00:39:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: notabug To: Beth Andres-Beck , 42340@debbugs.gnu.org Received: via spool by 42340-submit@debbugs.gnu.org id=B42340.15948599172975 (code B ref 42340); Thu, 16 Jul 2020 00:39:01 +0000 Received: (at 42340) by debbugs.gnu.org; 16 Jul 2020 00:38:37 +0000 Received: from localhost ([127.0.0.1]:54179 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvrv3-0000lu-5e for submit@debbugs.gnu.org; Wed, 15 Jul 2020 20:38:37 -0400 Received: from mail-pf1-f178.google.com ([209.85.210.178]:40824) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvrv0-0000lh-GC for 42340@debbugs.gnu.org; Wed, 15 Jul 2020 20:38:36 -0400 Received: by mail-pf1-f178.google.com with SMTP id u5so2858202pfn.7 for <42340@debbugs.gnu.org>; Wed, 15 Jul 2020 17:38:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=h0GhADCYsb31Kn3pGvBSLcCLjLA4lACwqGrh0jvh9pc=; b=FbVTs5lCuwJ3L+ifiTOdjlLYwZlkRyIdpv7q80tXkoARHkCVHBKwMvYCk7yBJGkwwK /sMuWK23vLM8E2J+LU/p/xu0MBwLcv03waQUU5Gs6lfmQr2xx7thulK547M8k+3fYJlT mvjNM9DQbnhAgLyN/pctfQmB+FZJuJ5mEhGLJ01ffwFc/pPK5lmo9jrgZGW12y2tWa9O YrlCriI2kXnArvDS4MMB/nhGeaa7o2aqss58igPfLyNYmuUPnkXb3DGxIXWNFsO8iuW5 YOC+1ykPW1daWuTbuY+hLM9btrDbsHs1ll5ECgZyktmiEHJLLhyjKbFMgQ9bjToGKtQ0 2ukQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=h0GhADCYsb31Kn3pGvBSLcCLjLA4lACwqGrh0jvh9pc=; b=nHkrX00UgN614L0YzhlpSXX4KWiwwFI1BmU30KW5xnvUrkQ5qmsTdAmlFcoFcqGigf tjYyKeYMNCBjDH6dLVS2eGH1qmkUagqmU68Mt5ZLQ2d1pSYhQ/V4CFswMCC3Rq59vRO8 exRw/JfuP/YXd8y6+glAAKQFkHa4Oa5h+HZ+yi4KWlJ1nkyhHqhsT1i0q0Q6ywGK3wnt Bsev+2GJHkST1fyeP1WXvf7mIaYywCquQKMG0xr54eQYMOzP9nx9C/GtFcQKOcilDQBq aVhCtPxrHY0lZULtYZeZytN77Mi3VKnVZ4LMZKjuz9tCg2ZBgPkAJxYe1kP0xI616Pum 9sdQ== X-Gm-Message-State: AOAM531Os+CczbcIfCVcJS1v12SkPQdWjI/m6cQrpj8ZaR/SM+NHI1YX 0tPboOp97x/AEfBYoHDLz8CHHYLs X-Google-Smtp-Source: ABdhPJzB6WzbiZI+fcfF2uM3SnFySVLy5k9jpGO4K5RoE29cJ115LnvNI4Ie5GHtRGQcuZCIMMD8uQ== X-Received: by 2002:a63:4b04:: with SMTP id y4mr2037594pga.158.1594859907945; Wed, 15 Jul 2020 17:38:27 -0700 (PDT) Received: from tomato.moose.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id m19sm3212419pgd.13.2020.07.15.17.38.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 15 Jul 2020 17:38:27 -0700 (PDT) References: From: Assaf Gordon Message-ID: <4014c9bb-44cb-28a7-7858-551790f20bcd@gmail.com> Date: Wed, 15 Jul 2020 18:38:25 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hello, On 2020-07-15 2:12 p.m., Beth Andres-Beck wrote: > If that is the intended behavior, the bug is that: >> printf '12,\n1,\n' | sort -t, -k1 -s > 1, > 12, > > does _not_ take the remainder of the line into account, and only sorts on > the initial field, prioritizing length. > > It is at the very least unexpected that adding an `a` to the end of both > lines would change the sort order of those lines: >> printf '12,a\n1,a\n' | sort -t, -k1 -s > 12,a > 1,a > Not a bug, just an incomplete usage :) sort's -k/--key parameter takes two values (the second being optional): the first and last column to use as the key. If the second value is omitted (as in your case), then the key is taken from the first field to the end of the line. And so: "sort -k1,1" means take the first *and only the first* field as the key. "sort -k1" means take the first field until the end of the line as the key. "sort -k1,3" means take the first,second and third fields as the single key. "sort -k1,1 -k2,2 -k3,3" means take the first field as the first key, second field as the second key, and third field as the third key. --- The "--debug" option can help illustrate what sort is doing, by adding underscore characters to show which characters are being used as keys in each line. Consider the following: $ printf '12,\n1,\n' | sort -t, -k1 -s --debug sort: using ‘en_CA.utf8’ sorting rules 1, __ 12, ___ $ printf '12,\n1,\n' | sort -t, -k1,1 -s --debug sort: using ‘en_CA.utf8’ sorting rules 1, _ 12, __ In the first example, the "-k1" means from first field till end of line, the underscore includes the "," characters. In the second example, the "-k1,1" means only the first field, and the comma is not used. Now consider your second case of adding an "a" at the end of each line: $ printf '12,a\n1,a\n' | sort -t, -k1 -s --debug sort: using ‘en_CA.utf8’ sorting rules 12,a ____ 1,a ___ $ printf '12,a\n1,a\n' | sort -t, -k1,1 -s --debug sort: using ‘en_CA.utf8’ sorting rules 1,a _ 12,a __ In the first example, "-k1" means: from first field until the end of the line, and so the entire string "12,a" is compared against "1,a". **AND**, because the locale is a "utf-8" locale, punctuation characters are ignored (as mentioned in the previous email in this thread). So effectively the compared strings are "12a" vs "1a". The ASCII value of "2" is smaller than the ASCII value of "a", and therefore "12a" appears before "1a". If we force C locale, then the order is reversed: $ printf '12,a\n1,a\n' | LC_ALL=C sort -t, -k1 -s --debug sort: using simple byte comparison 1,a ___ 12,a ____ Because now punctuation characters are used, and the ASCII value of "," is smaller than the ASCII value of "2". **HOWEVER**, this result of using "LC_ALL=C" together with "-k1" is only correct by a happy accident :) it is still very likely that "-k1" is not what you wanted - you probably meant to do "-k1,1". --- Lastly, the "-s/--stable" option in the above contrived examples is superfluous - it doesn't affect the output order because there are no equal field values (i.e. "1" vs "12"). A slightly better example to illustrate how "-s" affects ordering is this: $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 1,a 2,b 2,x $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s 1,a 2,x 2,b Here, "1" comes before "2" - that's obvious. But should "2,b" come before "2,x" ? If we do not use "-s/--stable", then "sort" ALSO does one additional comparison of the entire line as a last step (hence "sort --help" says "[disable] last-resort comparison" about "-s/--stable"). The substring ",b" comes before ",x" - therefore "2,b" appears first. If we add "-s/--stable", the last comparison step of the entire line is skipped, and the lines of "2" appear in the order they were in the input (hence - "stable"). By using "--debug" we can see the additional comparison step (indicated by additional underscore lines); $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 --debug sort: using ‘en_CA.utf8’ sorting rules 1,a _ ___ 2,b _ ___ 2,x _ ___ $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s --debug sort: using ‘en_CA.utf8’ sorting rules 1,a _ 2,x _ 2,b _ --- Hope this helps. regards, - assaf