From unknown Wed Jun 25 00:21:46 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#12783 <12783@debbugs.gnu.org> To: bug#12783 <12783@debbugs.gnu.org> Subject: Status: info for sort has an illogical example Reply-To: bug#12783 <12783@debbugs.gnu.org> Date: Wed, 25 Jun 2025 07:21:46 +0000 retitle 12783 info for sort has an illogical example reassign 12783 coreutils submitter 12783 "Kevin O'Gorman" severity 12783 normal tag 12783 moreinfo thanks From debbugs-submit-bounces@debbugs.gnu.org Thu Nov 01 20:14:15 2012 Received: (at submit) by debbugs.gnu.org; 2 Nov 2012 00:14:15 +0000 Received: from localhost ([127.0.0.1]:43678 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TU4tq-0002TP-U7 for submit@debbugs.gnu.org; Thu, 01 Nov 2012 20:14:15 -0400 Received: from eggs.gnu.org ([208.118.235.92]:40390) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TU3S6-0000QX-Hz for submit@debbugs.gnu.org; Thu, 01 Nov 2012 18:41:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TU3PQ-0003e7-M0 for submit@debbugs.gnu.org; Thu, 01 Nov 2012 18:38:45 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:39436) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TU3PQ-0003e2-I9 for submit@debbugs.gnu.org; Thu, 01 Nov 2012 18:38:44 -0400 Received: from eggs.gnu.org ([208.118.235.92]:58594) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TU3PP-0001Kg-Fh for bug-coreutils@gnu.org; Thu, 01 Nov 2012 18:38:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TU3PO-0003dX-DP for bug-coreutils@gnu.org; Thu, 01 Nov 2012 18:38:43 -0400 Received: from mail-bk0-f41.google.com ([209.85.214.41]:62359) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TU3PO-0003dP-6Y for bug-coreutils@gnu.org; Thu, 01 Nov 2012 18:38:42 -0400 Received: by mail-bk0-f41.google.com with SMTP id jm1so1273259bkc.0 for ; Thu, 01 Nov 2012 15:38:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=+UTjOxLlCkmYvnXYtdXr4Hj7fFAHQOvj6DrNeXEfH+8=; b=AQYKZCB0fdcoYtFvdVu3ds4ILcc8+zrQoeVegmGgiu87t6h8ABspXIIGA/hUIgz5UM xTvDQOJoGtH5HORkNv2wHEpeqjAA5iJuH2c9XP0nTJ3i/Ld7UPkB9etaDYygJrUVbuZA wmKFecV7j3WXnTNA2G1jKLpiw0qBzshcYaVT7DS1PSfFmACwA+8QyGU6yfMxfr5VO6HT HPbemeVBWsd1f1JzqL599WSKL24RnqK+BVRLygtABOvnO//0kAujVFNtFlZZwT10qffX OBDZOOXkvPAtOa2ZkQHb1zjUN8RAoRayQue3Hf4ntQTfraWbSY/J2kSC/yRqT7bFMF0y X6RQ== MIME-Version: 1.0 Received: by 10.204.131.75 with SMTP id w11mr12936712bks.111.1351809520608; Thu, 01 Nov 2012 15:38:40 -0700 (PDT) Received: by 10.204.41.72 with HTTP; Thu, 1 Nov 2012 15:38:40 -0700 (PDT) Date: Thu, 1 Nov 2012 15:38:40 -0700 Message-ID: Subject: info for sort has an illogical example From: "Kevin O'Gorman" To: bug-coreutils@gnu.org Content-Type: multipart/alternative; boundary=00151747c0c247a7b204cd76aefa X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -3.4 (---) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 01 Nov 2012 20:14:14 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -3.4 (---) --00151747c0c247a7b204cd76aefa Content-Type: text/plain; charset=UTF-8 This is just a quibble, but I'm really wanting to understand locales and sort. In the footnotes of the info page for sort invocation, I find the following in a footnote: (reformatted and numbered) A, In that case, set the `LC_ALL' environment variable to `C'. B. Note that setting only `LC_COLLATE' has two problems. B1. First, it is ineffective if `LC_ALL' is also set. B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is set to an incompatible value. B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'. The example in B2x is illogical since A and B together mean we're setting LC_COLLATE to C, not some random value like en_US.UTF-8. I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting, or anything besides LC_ALL. I'm writing software that will use sort extensively in unknown environments, and I'd like to keep all adjustments as localized as possible. So far, setting the collating sequence to POSIX is all that I need; no other locale adjustments. -- Kevin O'Gorman programmer, n. an organism that transmutes caffeine into software. --00151747c0c247a7b204cd76aefa Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable This is just a quibble, but I'm really wanting to understand locales an= d sort.
In the footnotes of the info page for sort invocation, I find th= e following in a footnote:

(reformatted and numbered)
A, In that case, set the `LC_ALL' environment variable to `C'.
B= . Note that setting only `LC_COLLATE' has two problems.
B1. First, i= t is ineffective if `LC_ALL' is also set.=C2=A0
B2. Second, it has = undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is u= nset) is set to an incompatible value.=C2=A0
B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PC= K' but `LC_COLLATE' is `en_US.UTF-8'.

The example in B2x= is illogical since A and B together mean we're setting LC_COLLATE to C= , not some random value like en_US.UTF-8.
I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE setting, o= r anything besides LC_ALL.=C2=A0 I'm writing software that will use sor= t extensively in unknown environments, and I'd like to keep all adjustm= ents as localized as possible.=C2=A0 So far, setting the collating sequence= to POSIX is all that I need; no other locale adjustments.

--
Kevin O'Gorman

programmer, n. an organism that transm= utes caffeine into software.
--00151747c0c247a7b204cd76aefa-- From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 02 14:10:25 2012 Received: (at 12783) by debbugs.gnu.org; 2 Nov 2012 18:10:25 +0000 Received: from localhost ([127.0.0.1]:45928 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TULhI-0005Lk-Sg for submit@debbugs.gnu.org; Fri, 02 Nov 2012 14:10:25 -0400 Received: from joseki.proulx.com ([216.17.153.58]:52693) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TULhD-0005LZ-MV for 12783@debbugs.gnu.org; Fri, 02 Nov 2012 14:10:23 -0400 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 41845211D5; Fri, 2 Nov 2012 12:07:29 -0600 (MDT) Received: by hysteria.proulx.com (Postfix, from userid 1000) id E17C22DCD7; Fri, 2 Nov 2012 12:07:28 -0600 (MDT) Date: Fri, 2 Nov 2012 12:07:28 -0600 From: Bob Proulx To: Kevin O'Gorman Subject: Re: bug#12783: info for sort has an illogical example Message-ID: <20121102180728.GA2810@hysteria.proulx.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.1 (/) Kevin O'Gorman wrote: > (reformatted and numbered) > A, In that case, set the `LC_ALL' environment variable to `C'. > B. Note that setting only `LC_COLLATE' has two problems. > B1. First, it is ineffective if `LC_ALL' is also set. > B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if > `LC_CTYPE' is unset) is set to an incompatible value. > B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' > but `LC_COLLATE' is `en_US.UTF-8'. > > The example in B2x is illogical since A and B together mean we're setting > LC_COLLATE to C, not some random value like en_US.UTF-8. > I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting, or > anything besides LC_ALL. I'm writing software that will use sort > extensively in unknown environments, and I'd like to keep all adjustments > as localized as possible. So far, setting the collating sequence to POSIX > is all that I need; no other locale adjustments. I also agree that the above is needlessly disjoint. It doesn't flow. Would you be able to suggest an improvement to the wording that would make it better than the current prose? Of course a submission as a patch would be great. Using git patch submissions is the preferred format. But just saying what you think it should say would also be appreciated. Bob From debbugs-submit-bounces@debbugs.gnu.org Mon Nov 05 12:51:24 2012 Received: (at control) by debbugs.gnu.org; 5 Nov 2012 17:51:24 +0000 Received: from localhost ([127.0.0.1]:50912 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVQpX-000813-ND for submit@debbugs.gnu.org; Mon, 05 Nov 2012 12:51:24 -0500 Received: from joseki.proulx.com ([216.17.153.58]:41849) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVQpU-00080u-Bi for control@debbugs.gnu.org; Mon, 05 Nov 2012 12:51:21 -0500 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id AC580211D5 for ; Mon, 5 Nov 2012 10:48:11 -0700 (MST) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 6A1082DCD1; Mon, 5 Nov 2012 10:48:11 -0700 (MST) Date: Mon, 5 Nov 2012 10:48:11 -0700 From: Bob Proulx To: control@debbugs.gnu.org Subject: tag as needing moreinfo Message-ID: <20121105174811.GA14000@hysteria.proulx.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.4 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.4 (/) tag 12783 + moreinfo thanks From debbugs-submit-bounces@debbugs.gnu.org Mon Nov 05 22:35:04 2012 Received: (at 12783) by debbugs.gnu.org; 6 Nov 2012 03:35:04 +0000 Received: from localhost ([127.0.0.1]:51386 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVZwN-0005yA-Ta for submit@debbugs.gnu.org; Mon, 05 Nov 2012 22:35:04 -0500 Received: from mail-bk0-f44.google.com ([209.85.214.44]:61765) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVZwK-0005xj-Gk for 12783@debbugs.gnu.org; Mon, 05 Nov 2012 22:35:02 -0500 Received: by mail-bk0-f44.google.com with SMTP id jc3so2473710bkc.3 for <12783@debbugs.gnu.org>; Mon, 05 Nov 2012 19:31:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=u6IW9AL+BeoESS9D9JUoTR3mBjdtisnhBGrdOm+w4hw=; b=BDY3O2vL4cqtr87R4qxKGucElOaQLtd0AjlLzN3BYA6UVlA4J0s+MOxEi2bZetq/Gf 94y7bn7GnIfoI2gH71gBpYkF0Wkdd08GULJ6Sko13w3LJNMqJV8AT1FDp1uIthCBpXFR xewhRqgiIIgQ22mjpINzkh2kdOIvQ0TEnIzFIWQszr3+Nydxj2UcMGQJEJoJNJKtn1vc P33AKiOFYzSDfUqc1tq1d8EdMmqYpOQXfXrNVR88QnWD65AJ/EVUAwiTW3jkMr/GATlB GYkN6pHp3h6WyT5CuMaR23dZXtbgL0UKBB5XB2UXuA3SeNfJc7m0k5Q+rPjrwUIEaX5+ r9dw== MIME-Version: 1.0 Received: by 10.204.131.75 with SMTP id w11mr2879690bks.111.1352172708546; Mon, 05 Nov 2012 19:31:48 -0800 (PST) Received: by 10.204.41.72 with HTTP; Mon, 5 Nov 2012 19:31:48 -0800 (PST) In-Reply-To: <20121102180728.GA2810@hysteria.proulx.com> References: <20121102180728.GA2810@hysteria.proulx.com> Date: Mon, 5 Nov 2012 19:31:48 -0800 Message-ID: Subject: Re: bug#12783: info for sort has an illogical example From: "Kevin O'Gorman" To: Bob Proulx Content-Type: multipart/alternative; boundary=00151747c0c2f7c96004cdcb3dc3 X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.1 (/) --00151747c0c2f7c96004cdcb3dc3 Content-Type: text/plain; charset=UTF-8 Looking at a convenient issue of the POSIX standard, I find http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.htm, and in particular where it mentions a "precedence order". It seems that the interaction of environment variables is not unspecified at all. I'm guessing that something else was actually meant: the writer perhaps found it hard to describe in a general way the interaction of a collation locale with the contents of a file to be sorted if it happens that the contents were created in another locale and are to be interpreted in the way they were created. So, applying LC_COLLATE=C to Chinese big-5 could well produce a peculiar order of things. This has already been the case for me. My data is ASCII, comprising numeric data and type data, both normal and encoded. All codes are printable ASCII and are specifically designed to be sorted based on the contents of bytes. This did not work well with LANG=en_US.UTF-8, because for this data 'a' and 'A' are numbers that differ by 26, but the LANG setting was treating them as nearly equivalent. It seems that LC_COLLATE=C is the correct cure. If I'm right, it seems that it would be better to rewrite that footnote in the sort info page something like this: (1) The collation order used by 'sort' is controlled by environment variables in accordance with the POSIX specification. In particular, the first of the LC_ALL, LC_COLLATE, and LANG variables that is defined to a non-null value controls the collation order; in their absence your system has a default. If the collation order is incompatible with your data, you are unlikely to get the desired results. Often, but not always, setting and exporting LC_COLLATE=C in your environment is the right choice, but if your data contains natural language text or proper names the right choice will agree with the encoding used for the data. Setting LC_ALL can be unwise because it can affect many other things as well, such as the language used for error and help messages. See http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html or any later version for more information. You may also want to change the warning in the output of "sort --help" to *** WARNING *** The locale specified by the environment affects sort order. For correct operation it must be compatible with your data. Set LC_COLLATE=C to get the traditional sort order that uses native byte values. On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx wrote: > Kevin O'Gorman wrote: > > (reformatted and numbered) > > A, In that case, set the `LC_ALL' environment variable to `C'. > > B. Note that setting only `LC_COLLATE' has two problems. > > B1. First, it is ineffective if `LC_ALL' is also set. > > B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if > > `LC_CTYPE' is unset) is set to an incompatible value. > > B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' > > but `LC_COLLATE' is `en_US.UTF-8'. > > > > The example in B2x is illogical since A and B together mean we're setting > > LC_COLLATE to C, not some random value like en_US.UTF-8. > > I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting, > or > > anything besides LC_ALL. I'm writing software that will use sort > > extensively in unknown environments, and I'd like to keep all adjustments > > as localized as possible. So far, setting the collating sequence to > POSIX > > is all that I need; no other locale adjustments. > > I also agree that the above is needlessly disjoint. It doesn't flow. > > Would you be able to suggest an improvement to the wording that would > make it better than the current prose? Of course a submission as a > patch would be great. Using git patch submissions is the preferred > format. But just saying what you think it should say would also be > appreciated. > > Bob > -- Kevin O'Gorman programmer, n. an organism that transmutes caffeine into software. --00151747c0c2f7c96004cdcb3dc3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Looking at a convenient issue of the POSIX standard, I find http://pubs.open= group.org/onlinepubs/007908799/xbd/envvar.htm, and in particular where = it mentions a "precedence order".=C2=A0 It seems that the interac= tion of environment variables is not unspecified at all.

I'm guessing that something else was actually meant: the writer per= haps found it hard to describe in a general way the interaction of a collat= ion locale with the contents of a file to be sorted if it happens that the = contents were created in another locale and are to be interpreted in the wa= y they were created.=C2=A0 So, applying LC_COLLATE=3DC to Chinese big-5 cou= ld well produce a peculiar order of things.

This has already been the case for me.=C2=A0 My data is ASCII, comprisi= ng numeric data and type data, both normal and encoded.=C2=A0 All codes are= printable ASCII and are specifically designed to be sorted based on the co= ntents of bytes.=C2=A0 This did not work well with LANG=3Den_US.UTF-8, beca= use for this data 'a' and 'A' are numbers that differ by 26= , but the LANG setting was treating them as nearly equivalent.=C2=A0 It see= ms that LC_COLLATE=3DC is the correct cure.

If I'm right, it seems that it would be better to rewrite that foot= note in the sort info page something like this:

(1) The collation or= der used by 'sort' is controlled by environment variables in accord= ance with the POSIX specification.=C2=A0 In particular, the first of the LC= _ALL,=C2=A0 LC_COLLATE, and LANG variables that is defined to a non-null va= lue controls the collation order; in their absence your system has a defaul= t.=C2=A0 If the collation order is incompatible with your data, you are unl= ikely to get the desired results.=C2=A0 Often, but not always, setting and = exporting LC_COLLATE=3DC in your environment is the right choice, but if yo= ur data contains natural language text or proper names the right choice wil= l agree with the encoding used for the data.=C2=A0 Setting LC_ALL can be un= wise because it can affect many other things as well, such as the language = used for error and help messages. See http://pubs.opengroup.org/onlinepubs/= 007908799/xbd/envvar.html or any later version for more information.
You may also want to change the warning in the output of "sort --h= elp" to
*** WARNING ***
The locale specified by the environment = affects sort order. For correct operation it must be compatible with your d= ata.
Set LC_COLLATE=3DC to get the traditional sort order that uses native byte = values.


On= Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx <bob@proulx.com> wrot= e:
Kevin O'Gorman wrote:
> (reformatted and numbered)
> A, In that case, set the `LC_ALL' environment variable to `C'.=
> B. Note that setting only `LC_COLLATE' has two problems.
> B1. First, it is ineffective if `LC_ALL' is also set.
> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG'= , if
> `LC_CTYPE' is unset) is set to an incompatible value.
> B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_= JP.PCK'
> but `LC_COLLATE' is `en_US.UTF-8'.
>
> The example in B2x is illogical since A and B together mean we're = setting
> LC_COLLATE to C, not some random value like en_US.UTF-8.
> I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE setti= ng, or
> anything besides LC_ALL. =C2=A0I'm writing software that will use = sort
> extensively in unknown environments, and I'd like to keep all adju= stments
> as localized as possible. =C2=A0So far, setting the collating sequence= to POSIX
> is all that I need; no other locale adjustments.

I also agree that the above is needlessly disjoint. =C2=A0It doesn't fl= ow.

Would you be able to suggest an improvement to the wording that would
make it better than the current prose? =C2=A0Of course a submission as a patch would be great. =C2=A0Using git patch submissions is the preferred format. =C2=A0But just saying what you think it should say would also be appreciated.

Bob



--
Kevin O&#= 39;Gorman

programmer, n. an organism that transmutes caffeine into s= oftware.
--00151747c0c2f7c96004cdcb3dc3-- From debbugs-submit-bounces@debbugs.gnu.org Mon Nov 05 23:24:00 2012 Received: (at 12783) by debbugs.gnu.org; 6 Nov 2012 04:24:00 +0000 Received: from localhost ([127.0.0.1]:51411 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVahj-00074J-4i for submit@debbugs.gnu.org; Mon, 05 Nov 2012 23:23:59 -0500 Received: from mail-bk0-f44.google.com ([209.85.214.44]:53597) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVahg-00074B-83 for 12783@debbugs.gnu.org; Mon, 05 Nov 2012 23:23:57 -0500 Received: by mail-bk0-f44.google.com with SMTP id jc3so5800bkc.3 for <12783@debbugs.gnu.org>; Mon, 05 Nov 2012 20:20:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Eyy169680hksAb5WIcRrZ1E1PjjZ+L+to5gUEIN4n7U=; b=LHSikN7BQhJXw4YDZ/RUQeRGVQ72Wppr3QPsOFylD7a/ObT6zH0VJviQXlOAMAlyII myKPa3reU13R8O9/HKh4wW3869Y62lmN1HFOPz+A4mKtUPfr2tmtboXT81xwjBPG2THM KLMG4vHbb67TwE6+85AJO9tdYpbJRbQcTP8qRgu8mi/R+rKwq2RtFTA7Bp0e32voS3rL FvX2lHEVAJeXIPXRtcrsqMvDyJz+EX9GfiERr2YdDztolY6KxORCw94giThUwHophLbv 4Am28yZOlEAVB/bPZmWYxE/1n+otpXUJWp5OicAe6qbKcd19aYm/HhHOiEg4dDRqPQ6V ZR7A== MIME-Version: 1.0 Received: by 10.204.3.220 with SMTP id 28mr2818363bko.87.1352175646406; Mon, 05 Nov 2012 20:20:46 -0800 (PST) Received: by 10.204.41.72 with HTTP; Mon, 5 Nov 2012 20:20:46 -0800 (PST) In-Reply-To: References: <20121102180728.GA2810@hysteria.proulx.com> Date: Mon, 5 Nov 2012 20:20:46 -0800 Message-ID: Subject: Re: bug#12783: info for sort has an illogical example From: "Kevin O'Gorman" To: Bob Proulx Content-Type: multipart/alternative; boundary=000e0cd1e22213f7a404cdcbedbf X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.1 (/) --000e0cd1e22213f7a404cdcbedbf Content-Type: text/plain; charset=UTF-8 One ammendment... I'd say setting LANG=C is also unwise, and for the same reason. make that - It can be unwise to set LC_ALL or LANG to affect sort order because they may affect many other things as well, such as the language used for error and help messages. On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman wrote: > Looking at a convenient issue of the POSIX standard, I find > http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.htm, and in > particular where it mentions a "precedence order". It seems that the > interaction of environment variables is not unspecified at all. > > I'm guessing that something else was actually meant: the writer perhaps > found it hard to describe in a general way the interaction of a collation > locale with the contents of a file to be sorted if it happens that the > contents were created in another locale and are to be interpreted in the > way they were created. So, applying LC_COLLATE=C to Chinese big-5 could > well produce a peculiar order of things. > > This has already been the case for me. My data is ASCII, comprising > numeric data and type data, both normal and encoded. All codes are > printable ASCII and are specifically designed to be sorted based on the > contents of bytes. This did not work well with LANG=en_US.UTF-8, because > for this data 'a' and 'A' are numbers that differ by 26, but the LANG > setting was treating them as nearly equivalent. It seems that LC_COLLATE=C > is the correct cure. > > If I'm right, it seems that it would be better to rewrite that footnote in > the sort info page something like this: > > (1) The collation order used by 'sort' is controlled by environment > variables in accordance with the POSIX specification. In particular, the > first of the LC_ALL, LC_COLLATE, and LANG variables that is defined to a > non-null value controls the collation order; in their absence your system > has a default. If the collation order is incompatible with your data, you > are unlikely to get the desired results. Often, but not always, setting > and exporting LC_COLLATE=C in your environment is the right choice, but if > your data contains natural language text or proper names the right choice > will agree with the encoding used for the data. Setting LC_ALL can be > unwise because it can affect many other things as well, such as the > language used for error and help messages. See > http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html or any > later version for more information. > > You may also want to change the warning in the output of "sort --help" to > *** WARNING *** > The locale specified by the environment affects sort order. For correct > operation it must be compatible with your data. > Set LC_COLLATE=C to get the traditional sort order that uses native byte > values. > > > On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx wrote: > >> Kevin O'Gorman wrote: >> > (reformatted and numbered) >> > A, In that case, set the `LC_ALL' environment variable to `C'. >> > B. Note that setting only `LC_COLLATE' has two problems. >> > B1. First, it is ineffective if `LC_ALL' is also set. >> > B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if >> > `LC_CTYPE' is unset) is set to an incompatible value. >> > B2x. For example, you get undefined behavior if `LC_CTYPE' is >> `ja_JP.PCK' >> > but `LC_COLLATE' is `en_US.UTF-8'. >> > >> > The example in B2x is illogical since A and B together mean we're >> setting >> > LC_COLLATE to C, not some random value like en_US.UTF-8. >> > I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting, >> or >> > anything besides LC_ALL. I'm writing software that will use sort >> > extensively in unknown environments, and I'd like to keep all >> adjustments >> > as localized as possible. So far, setting the collating sequence to >> POSIX >> > is all that I need; no other locale adjustments. >> >> I also agree that the above is needlessly disjoint. It doesn't flow. >> >> Would you be able to suggest an improvement to the wording that would >> make it better than the current prose? Of course a submission as a >> patch would be great. Using git patch submissions is the preferred >> format. But just saying what you think it should say would also be >> appreciated. >> >> Bob >> > > > > -- > Kevin O'Gorman > > programmer, n. an organism that transmutes caffeine into software. > -- Kevin O'Gorman programmer, n. an organism that transmutes caffeine into software. --000e0cd1e22213f7a404cdcbedbf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable One ammendment...=C2=A0 I'd say setting LANG=3DC is also unwise, and fo= r the same reason.=C2=A0 make that
- It can be unwise to set LC_ALL or L= ANG to affect sort order because they may affect many other things as=20 well, such as the language used for error and help messages.

On Mon, Nov 5, 2012 at 7:31 PM, = Kevin O'Gorman <kogorman@gmail.com> wrote:
Looking at a convenient i= ssue of the POSIX standard, I find http://pubs.opengroup.o= rg/onlinepubs/007908799/xbd/envvar.htm, and in particular where it ment= ions a "precedence order".=C2=A0 It seems that the interaction of= environment variables is not unspecified at all.

I'm guessing that something else was actually meant: the writer per= haps found it hard to describe in a general way the interaction of a collat= ion locale with the contents of a file to be sorted if it happens that the = contents were created in another locale and are to be interpreted in the wa= y they were created.=C2=A0 So, applying LC_COLLATE=3DC to Chinese big-5 cou= ld well produce a peculiar order of things.

This has already been the case for me.=C2=A0 My data is ASCII, comprisi= ng numeric data and type data, both normal and encoded.=C2=A0 All codes are= printable ASCII and are specifically designed to be sorted based on the co= ntents of bytes.=C2=A0 This did not work well with LANG=3Den_US.UTF-8, beca= use for this data 'a' and 'A' are numbers that differ by 26= , but the LANG setting was treating them as nearly equivalent.=C2=A0 It see= ms that LC_COLLATE=3DC is the correct cure.

If I'm right, it seems that it would be better to rewrite that foot= note in the sort info page something like this:

(1) The collation or= der used by 'sort' is controlled by environment variables in accord= ance with the POSIX specification.=C2=A0 In particular, the first of the LC= _ALL,=C2=A0 LC_COLLATE, and LANG variables that is defined to a non-null va= lue controls the collation order; in their absence your system has a defaul= t.=C2=A0 If the collation order is incompatible with your data, you are unl= ikely to get the desired results.=C2=A0 Often, but not always, setting and = exporting LC_COLLATE=3DC in your environment is the right choice, but if yo= ur data contains natural language text or proper names the right choice wil= l agree with the encoding used for the data.=C2=A0 Setting LC_ALL can be un= wise because it can affect many other things as well, such as the language = used for error and help messages. See http://pubs.opengro= up.org/onlinepubs/007908799/xbd/envvar.html or any later version for mo= re information.

You may also want to change the warning in the output of "sort --h= elp" to
*** WARNING ***
The locale specified by the environment = affects sort order. For correct operation it must be compatible with your d= ata.
Set LC_COLLATE=3DC to get the traditional sort order that uses native byte = values.


On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx <bob@proulx= .com> wrote:
Kevin O'Gorman wrote:=
> (reformatted and numbered)
> A, In that case, set the `LC_ALL' environment variable to `C'.=
> B. Note that setting only `LC_COLLATE' has two problems.
> B1. First, it is ineffective if `LC_ALL' is also set.
> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG'= , if
> `LC_CTYPE' is unset) is set to an incompatible value.
> B2x. For example, you get undefined behavior if `LC_CTYPE' is `ja_= JP.PCK'
> but `LC_COLLATE' is `en_US.UTF-8'.
>
> The example in B2x is illogical since A and B together mean we're = setting
> LC_COLLATE to C, not some random value like en_US.UTF-8.
> I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE setti= ng, or
> anything besides LC_ALL. =C2=A0I'm writing software that will use = sort
> extensively in unknown environments, and I'd like to keep all adju= stments
> as localized as possible. =C2=A0So far, setting the collating sequence= to POSIX
> is all that I need; no other locale adjustments.

I also agree that the above is needlessly disjoint. =C2=A0It doesn't fl= ow.

Would you be able to suggest an improvement to the wording that would
make it better than the current prose? =C2=A0Of course a submission as a patch would be great. =C2=A0Using git patch submissions is the preferred format. =C2=A0But just saying what you think it should say would also be appreciated.

Bob



--
Kevin O'Gorman

programmer, n. an organism t= hat transmutes caffeine into software.



--
Kevin O'Gorman
<= br>programmer, n. an organism that transmutes caffeine into software.
--000e0cd1e22213f7a404cdcbedbf-- From debbugs-submit-bounces@debbugs.gnu.org Tue Nov 06 09:47:27 2012 Received: (at 12783) by debbugs.gnu.org; 6 Nov 2012 14:47:27 +0000 Received: from localhost ([127.0.0.1]:51775 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVkR4-0004xR-Ug for submit@debbugs.gnu.org; Tue, 06 Nov 2012 09:47:27 -0500 Received: from mx1.redhat.com ([209.132.183.28]:28361) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVkR1-0004xI-9l for 12783@debbugs.gnu.org; Tue, 06 Nov 2012 09:47:25 -0500 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id qA6EiBVC022789 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 6 Nov 2012 09:44:11 -0500 Received: from [10.36.116.53] (ovpn-116-53.ams2.redhat.com [10.36.116.53]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id qA6Ei747001098 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 6 Nov 2012 09:44:09 -0500 Message-ID: <50992237.6070909@draigBrady.com> Date: Tue, 06 Nov 2012 14:44:07 +0000 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 MIME-Version: 1.0 To: "Kevin O'Gorman" Subject: Re: bug#12783: info for sort has an illogical example References: <20121102180728.GA2810@hysteria.proulx.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id qA6EiBVC022789 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org, Bob Proulx X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.9 (------) On 11/06/2012 04:20 AM, Kevin O'Gorman wrote: > One ammendment... I'd say setting LANG=3DC is also unwise, and for the= same > reason. make that > - It can be unwise to set LC_ALL or LANG to affect sort order because t= hey > may affect many other things as well, such as the language used for err= or > and help messages. > > On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman wro= te: > >> Looking at a convenient issue of the POSIX standard, I find >> http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.htm, and in >> particular where it mentions a "precedence order". It seems that the >> interaction of environment variables is not unspecified at all. >> >> I'm guessing that something else was actually meant: the writer perhap= s >> found it hard to describe in a general way the interaction of a collat= ion >> locale with the contents of a file to be sorted if it happens that the >> contents were created in another locale and are to be interpreted in t= he >> way they were created. So, applying LC_COLLATE=3DC to Chinese big-5 c= ould >> well produce a peculiar order of things. >> >> This has already been the case for me. My data is ASCII, comprising >> numeric data and type data, both normal and encoded. All codes are >> printable ASCII and are specifically designed to be sorted based on th= e >> contents of bytes. This did not work well with LANG=3Den_US.UTF-8, be= cause >> for this data 'a' and 'A' are numbers that differ by 26, but the LANG >> setting was treating them as nearly equivalent. It seems that LC_COLL= ATE=3DC >> is the correct cure. >> >> If I'm right, it seems that it would be better to rewrite that footnot= e in >> the sort info page something like this: >> >> (1) The collation order used by 'sort' is controlled by environment >> variables in accordance with the POSIX specification. In particular, = the >> first of the LC_ALL, LC_COLLATE, and LANG variables that is defined t= o a >> non-null value controls the collation order; in their absence your sys= tem >> has a default. If the collation order is incompatible with your data,= you >> are unlikely to get the desired results. Often, but not always, setti= ng >> and exporting LC_COLLATE=3DC in your environment is the right choice, = but if >> your data contains natural language text or proper names the right cho= ice >> will agree with the encoding used for the data. Setting LC_ALL can be >> unwise because it can affect many other things as well, such as the >> language used for error and help messages. See >> http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html or any >> later version for more information. >> >> You may also want to change the warning in the output of "sort --help"= to >> *** WARNING *** >> The locale specified by the environment affects sort order. For correc= t >> operation it must be compatible with your data. >> Set LC_COLLATE=3DC to get the traditional sort order that uses native = byte >> values. >> >> >> On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx wrote: >> >>> Kevin O'Gorman wrote: >>>> (reformatted and numbered) >>>> A, In that case, set the `LC_ALL' environment variable to `C'. >>>> B. Note that setting only `LC_COLLATE' has two problems. >>>> B1. First, it is ineffective if `LC_ALL' is also set. >>>> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if >>>> `LC_CTYPE' is unset) is set to an incompatible value. >>>> B2x. For example, you get undefined behavior if `LC_CTYPE' is >>> `ja_JP.PCK' >>>> but `LC_COLLATE' is `en_US.UTF-8'. >>>> >>>> The example in B2x is illogical since A and B together mean we're >>> setting >>>> LC_COLLATE to C, not some random value like en_US.UTF-8. >>>> I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE set= ting, >>> or >>>> anything besides LC_ALL. I'm writing software that will use sort >>>> extensively in unknown environments, and I'd like to keep all >>> adjustments >>>> as localized as possible. So far, setting the collating sequence to >>> POSIX >>>> is all that I need; no other locale adjustments. >>> >>> I also agree that the above is needlessly disjoint. It doesn't flow. >>> >>> Would you be able to suggest an improvement to the wording that would >>> make it better than the current prose? Of course a submission as a >>> patch would be great. Using git patch submissions is the preferred >>> format. But just saying what you think it should say would also be >>> appreciated. Thanks for cleaning that up Kevin. Your description is clearer. I'm not sure we can drop the warning about LC_TYPE though. While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux.= .. $ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do printf "%s\n" B a | echo $(env -i $e=3Den_US sort) done a B a B a B B a ... it may be significant when specifying (multibyte) characters to skip etc. and thus impacts the sort order in that way. This is either with common downstream i18n patches or future multibyte handling in upstream sort. Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since glibc-2.3.= 3): http://www.gnu.org/software/libc/manual/html_node/Charset-conversion-in-g= ettext.html thanks, P=C3=A1draig. From debbugs-submit-bounces@debbugs.gnu.org Tue Nov 06 21:59:38 2012 Received: (at 12783) by debbugs.gnu.org; 7 Nov 2012 02:59:38 +0000 Received: from localhost ([127.0.0.1]:53100 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVvrd-0008Vu-4E for submit@debbugs.gnu.org; Tue, 06 Nov 2012 21:59:38 -0500 Received: from mail-bk0-f44.google.com ([209.85.214.44]:52311) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TVvrS-0008Vb-8U for 12783@debbugs.gnu.org; Tue, 06 Nov 2012 21:59:30 -0500 Received: by mail-bk0-f44.google.com with SMTP id jc3so496611bkc.3 for <12783@debbugs.gnu.org>; Tue, 06 Nov 2012 18:59:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=8uBCyHa0cIiLw9kLql0UltFiqL/vrjdYHxLEzt43B10=; b=X/xTJcLHq2GLGPapw7bWFq9Mor+H8dH71Bp7C36w+waNvZiwsOK8HMcBINYJqvfL2Y nptBl4HgJwZGmonpDT2Owu0LfN8ThhELadlJcaWNjrbLh/I9efgJNeEMcsrkvPbyHM6d U8RmenTFcwxebEpKz0G1yvOKV+2PVP/BPNNCSYZFJxusqSPY8rUpJ2zbtRu34+RT6Fbg uAlNnzBWIXYSAKRc3NC64qVcaS+gTKwxgbEqSjzTlTomSwHZHudkuu3MjwLgrMHcld+X nb39YEKSeefYrNFwfoOpevkZTXbeGbHMs5K9+bde9pGe5jY1mvLaLYToXKMMLoUwIvSZ vTjg== MIME-Version: 1.0 Received: by 10.204.3.220 with SMTP id 28mr752455bko.87.1352257170934; Tue, 06 Nov 2012 18:59:30 -0800 (PST) Received: by 10.204.41.72 with HTTP; Tue, 6 Nov 2012 18:59:30 -0800 (PST) In-Reply-To: <50992237.6070909@draigBrady.com> References: <20121102180728.GA2810@hysteria.proulx.com> <50992237.6070909@draigBrady.com> Date: Tue, 6 Nov 2012 18:59:30 -0800 Message-ID: Subject: Re: bug#12783: info for sort has an illogical example From: "Kevin O'Gorman" To: =?UTF-8?Q?P=C3=A1draig_Brady?= Content-Type: multipart/alternative; boundary=000e0cd1e222518ce904cddee853 X-Spam-Score: 0.1 (/) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org, Bob Proulx X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.6 (--) --000e0cd1e222518ce904cddee853 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Nov 6, 2012 at 6:44 AM, P=C3=A1draig Brady wrote= : > On 11/06/2012 04:20 AM, Kevin O'Gorman wrote: > >> One ammendment... I'd say setting LANG=3DC is also unwise, and for the = same >> reason. make that >> - It can be unwise to set LC_ALL or LANG to affect sort order because th= ey >> may affect many other things as well, such as the language used for erro= r >> and help messages. >> >> On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman >> wrote: >> >> Looking at a convenient issue of the POSIX standard, I find >>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.htm, >>> and in >>> particular where it mentions a "precedence order". It seems that the >>> interaction of environment variables is not unspecified at all. >>> >>> I'm guessing that something else was actually meant: the writer perhaps >>> found it hard to describe in a general way the interaction of a collati= on >>> locale with the contents of a file to be sorted if it happens that the >>> contents were created in another locale and are to be interpreted in th= e >>> way they were created. So, applying LC_COLLATE=3DC to Chinese big-5 co= uld >>> well produce a peculiar order of things. >>> >>> This has already been the case for me. My data is ASCII, comprising >>> numeric data and type data, both normal and encoded. All codes are >>> printable ASCII and are specifically designed to be sorted based on the >>> contents of bytes. This did not work well with LANG=3Den_US.UTF-8, bec= ause >>> for this data 'a' and 'A' are numbers that differ by 26, but the LANG >>> setting was treating them as nearly equivalent. It seems that >>> LC_COLLATE=3DC >>> is the correct cure. >>> >>> If I'm right, it seems that it would be better to rewrite that footnote >>> in >>> the sort info page something like this: >>> >>> (1) The collation order used by 'sort' is controlled by environment >>> variables in accordance with the POSIX specification. In particular, t= he >>> first of the LC_ALL, LC_COLLATE, and LANG variables that is defined to= a >>> non-null value controls the collation order; in their absence your syst= em >>> has a default. If the collation order is incompatible with your data, >>> you >>> are unlikely to get the desired results. Often, but not always, settin= g >>> and exporting LC_COLLATE=3DC in your environment is the right choice, b= ut >>> if >>> your data contains natural language text or proper names the right choi= ce >>> will agree with the encoding used for the data. Setting LC_ALL can be >>> unwise because it can affect many other things as well, such as the >>> language used for error and help messages. See >>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.htmlor any >>> later version for more information. >>> >>> You may also want to change the warning in the output of "sort --help" = to >>> *** WARNING *** >>> The locale specified by the environment affects sort order. For correct >>> operation it must be compatible with your data. >>> Set LC_COLLATE=3DC to get the traditional sort order that uses native b= yte >>> values. >>> >>> >>> On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx wrote: >>> >>> Kevin O'Gorman wrote: >>>> >>>>> (reformatted and numbered) >>>>> A, In that case, set the `LC_ALL' environment variable to `C'. >>>>> B. Note that setting only `LC_COLLATE' has two problems. >>>>> B1. First, it is ineffective if `LC_ALL' is also set. >>>>> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if >>>>> `LC_CTYPE' is unset) is set to an incompatible value. >>>>> B2x. For example, you get undefined behavior if `LC_CTYPE' is >>>>> >>>> `ja_JP.PCK' >>>> >>>>> but `LC_COLLATE' is `en_US.UTF-8'. >>>>> >>>>> The example in B2x is illogical since A and B together mean we're >>>>> >>>> setting >>>> >>>>> LC_COLLATE to C, not some random value like en_US.UTF-8. >>>>> I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE sett= ing, >>>>> >>>> or >>>> >>>>> anything besides LC_ALL. I'm writing software that will use sort >>>>> extensively in unknown environments, and I'd like to keep all >>>>> >>>> adjustments >>>> >>>>> as localized as possible. So far, setting the collating sequence to >>>>> >>>> POSIX >>>> >>>>> is all that I need; no other locale adjustments. >>>>> >>>> >>>> I also agree that the above is needlessly disjoint. It doesn't flow. >>>> >>>> Would you be able to suggest an improvement to the wording that would >>>> make it better than the current prose? Of course a submission as a >>>> patch would be great. Using git patch submissions is the preferred >>>> format. But just saying what you think it should say would also be >>>> appreciated. >>>> >>> > Thanks for cleaning that up Kevin. > Your description is clearer. > I'm not sure we can drop the warning about LC_TYPE though. > > While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux.= .. > > $ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do > printf "%s\n" B a | echo $(env -i $e=3Den_US sort) > done > a B > a B > a B > B a > > ... it may be significant when specifying (multibyte) characters > to skip etc. and thus impacts the sort order in that way. > This is either with common downstream i18n patches or future > multibyte handling in upstream sort. > > Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since > glibc-2.3.3): > http://www.gnu.org/software/**libc/manual/html_node/Charset-** > conversion-in-gettext.html > > thanks, > P=C3=A1draig. > What I wrote is only a suggestion. As I'm far from expert in these matters, I'll leave the final form to you all. Thanks for your work on coreutils, and your attention to this matter. I think my work is done here= . --=20 Kevin O'Gorman programmer, n. an organism that transmutes caffeine into software. --000e0cd1e222518ce904cddee853 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Tue, Nov 6, 2012 a= t 6:44 AM, P=C3=A1draig Brady <P@draigbrady.com> wrote:
On 11/06/2012 04:20 AM, Kevin O'= ;Gorman wrote:
One ammendment... =C2=A0I'd say setting LANG=3DC is also unwise, and fo= r the same
reason. =C2=A0make that
- It can be unwise to set LC_ALL or LANG to affect sort order because they<= br> may affect many other things as well, such as the language used for error and help messages.

On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman <kogorman@gmail.com> wrote:

Looking at a convenient issue of the POSIX standard, I find
http://pubs.opengroup.org/onlinepubs/007908799/xbd/= envvar.htm, and in
particular where it mentions a "precedence order". =C2=A0It seems= that the
interaction of environment variables is not unspecified at all.

I'm guessing that something else was actually meant: the writer perhaps=
found it hard to describe in a general way the interaction of a collation locale with the contents of a file to be sorted if it happens that the
contents were created in another locale and are to be interpreted in the way they were created. =C2=A0So, applying LC_COLLATE=3DC to Chinese big-5 c= ould
well produce a peculiar order of things.

This has already been the case for me. =C2=A0My data is ASCII, comprising numeric data and type data, both normal and encoded. =C2=A0All codes are printable ASCII and are specifically designed to be sorted based on the
contents of bytes. =C2=A0This did not work well with LANG=3Den_US.UTF-8, be= cause
for this data 'a' and 'A' are numbers that differ by 26, bu= t the LANG
setting was treating them as nearly equivalent. =C2=A0It seems that LC_COLL= ATE=3DC
is the correct cure.

If I'm right, it seems that it would be better to rewrite that footnote= in
the sort info page something like this:

(1) The collation order used by 'sort' is controlled by environment=
variables in accordance with the POSIX specification. =C2=A0In particular, = the
first of the LC_ALL, =C2=A0LC_COLLATE, and LANG variables that is defined t= o a
non-null value controls the collation order; in their absence your system has a default. =C2=A0If the collation order is incompatible with your data,= you
are unlikely to get the desired results. =C2=A0Often, but not always, setti= ng
and exporting LC_COLLATE=3DC in your environment is the right choice, but i= f
your data contains natural language text or proper names the right choice will agree with the encoding used for the data. =C2=A0Setting LC_ALL can be=
unwise because it can affect many other things as well, such as the
language used for error and help messages. See
http://pubs.opengroup.org/onlinepubs/007908799/xbd= /envvar.html or any
later version for more information.

You may also want to change the warning in the output of "sort --help&= quot; to
*** WARNING ***
The locale specified by the environment affects sort order. For correct
operation it must be compatible with your data.
Set LC_COLLATE=3DC to get the traditional sort order that uses native byte<= br> values.


On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx <bob@proulx.com> wrote:

Kevin O'Gorman wrote:
(reformatted and numbered)
A, In that case, set the `LC_ALL' environment variable to `C'.
B. Note that setting only `LC_COLLATE' has two problems.
B1. First, it is ineffective if `LC_ALL' is also set.
B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if<= br> `LC_CTYPE' is unset) is set to an incompatible value.
B2x. For example, you get undefined behavior if `LC_CTYPE' is
`ja_JP.PCK'
but `LC_COLLATE' is `en_US.UTF-8'.

The example in B2x is illogical since A and B together mean we're
setting
LC_COLLATE to C, not some random value like en_US.UTF-8.
I want to know if LC_COLLATE=3DC can be messed up by an LC_CTYPE setting,
or
anything besides LC_ALL. =C2=A0I'm writing software that will use sort<= br> extensively in unknown environments, and I'd like to keep all
adjustments
as localized as possible. =C2=A0So far, setting the collating sequence to
POSIX
is all that I need; no other locale adjustments.

I also agree that the above is needlessly disjoint. =C2=A0It doesn't fl= ow.

Would you be able to suggest an improvement to the wording that would
make it better than the current prose? =C2=A0Of course a submission as a patch would be great. =C2=A0Using git patch submissions is the preferred format. =C2=A0But just saying what you think it should say would also be appreciated.

Thanks for cleaning that up Kevin.
Your description is clearer.
I'm not sure we can drop the warning about LC_TYPE though.

While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux...=

$ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do
=C2=A0 =C2=A0 printf "%s\n" B a | echo $(env -i $e=3Den_US sort)<= br> done
a B
a B
a B
B a

... it may be significant when specifying (multibyte) characters
to skip etc. and thus impacts the sort order in that way.
This is either with common downstream i18n patches or future
multibyte handling in upstream sort.

Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since glibc-2.3.3)= :
http://www.gnu.org/software/= libc/manual/html_node/Charset-conversion-in-gettext.html

thanks,
P=C3=A1draig.

What I wrote is only a suggestion.=C2=A0 As I'm = far from expert in these matters, I'll leave the final form to you all.= =C2=A0 Thanks for your work on coreutils, and your attention to this matter= .=C2=A0 I think my work is done here.

--
Kevin O'Gorman

programmer, n. an organism that transm= utes caffeine into software.
--000e0cd1e222518ce904cddee853-- From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 23 18:14:33 2018 Received: (at 12783) by debbugs.gnu.org; 23 Oct 2018 22:14:33 +0000 Received: from localhost ([127.0.0.1]:38731 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gF4wb-0000MV-A2 for submit@debbugs.gnu.org; Tue, 23 Oct 2018 18:14:33 -0400 Received: from mail-io1-f65.google.com ([209.85.166.65]:35472) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gF4wa-0000MH-Bl for 12783@debbugs.gnu.org; Tue, 23 Oct 2018 18:14:32 -0400 Received: by mail-io1-f65.google.com with SMTP id 79-v6so1916388iou.2 for <12783@debbugs.gnu.org>; Tue, 23 Oct 2018 15:14:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=2LoE1nkTNzOt4NTgA76HrGoViFLI1d8F4LZSj82rXPA=; b=R2UiBH9XSmSitbqrbDV//jQJNrssrI+n9fDue4TfNRpiZkAl0lh6EP/qYgiWQNuD8j QsosNrXm6UOjA14SJ0+MUsVAFS0ikrHtBXP4dWi/J6BzwEbjpu7wA5qx90BHN2scSWdk VtZF6V0FVtMvHH2P/W6m37/YVUU5CKG+OOapSBu9un+sdsdKq968W3i7IjffOitpblwN GMhM5o/HwpGWYWoOxtrdf9wn9SYUR/0zhYxWkg1N0g2sc6cxuaWdt56HgdaEq7my2I5K nA4qYlIw/Xd0DY1MkWVGBywSyG4qYXNnWPh31u37XJyPfr6EB1yCxqc6B77PCZBN1Itb lHzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=2LoE1nkTNzOt4NTgA76HrGoViFLI1d8F4LZSj82rXPA=; b=YwTf3OWVzC5K9GywTqgLRPJfQMBUfuBNosGGWFLoCC4wMTA9djVpxwaJvcjOq61eMM R2txyFWetLA7JY5sEUrXtT+gjKX5coaOzWnY3y8eN90KpZ2oPzqQwkFsabOZODhIqHYJ Gs/0XyQMsKUNh1XfF+2ZdTZwTh5HVlhqr6l64yBv4ICP9UQpgRXyw5WZO8ALX1xE1X2l wvFthLFU5BlSCfj0SxnhqQGMEgwAYrEahkrEt92ohfnvnGtpq/GkSrXGRndPY+N/3zt3 oqrjpErqxGHvQAeD+4hReFYqJ2l6jl3sr5voShzTRrx5cqwrpOa+hgdhhInSQ2fFFYcq iaXQ== X-Gm-Message-State: AGRZ1gL+AEkG/BmPj9C02g43EZa1ghN+usezZeDHMjmtbYQWwouzGYUP znoPiXvYTilTw0phuTcgTSo= X-Google-Smtp-Source: AJdET5c00DvPurcBSQ7iU+60CZcaCxJR+JsQWkFyUbFaJJfCq2UIBFsIAN6KMCKUcRUVAM1x70rzzA== X-Received: by 2002:a6b:1406:: with SMTP id 6-v6mr4040631iou.218.1540332866757; Tue, 23 Oct 2018 15:14:26 -0700 (PDT) Received: from tomato.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id 10-v6sm2644437itl.32.2018.10.23.15.14.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Oct 2018 15:14:25 -0700 (PDT) Subject: Re: bug#12783: info for sort has an illogical example To: Kevin O'Gorman , =?UTF-8?Q?P=c3=a1draig_Brady?= References: <20121102180728.GA2810@hysteria.proulx.com> <50992237.6070909@draigBrady.com> From: Assaf Gordon Message-ID: Date: Tue, 23 Oct 2018 16:14:23 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 12783 Cc: 12783@debbugs.gnu.org, Bob Proulx X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) close 12783 stop (triaging old bugs) On 06/11/12 07:59 PM, Kevin O'Gorman wrote: > On Tue, Nov 6, 2012 at 6:44 AM, Pádraig Brady wrote: [....] >> ... it may be significant when specifying (multibyte) characters >> to skip etc. and thus impacts the sort order in that way. >> This is either with common downstream i18n patches or future >> multibyte handling in upstream sort. >> >> Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since >> glibc-2.3.3): >> http://www.gnu.org/software/**libc/manual/html_node/Charset-** >> conversion-in-gettext.html >> >> thanks, >> Pádraig. >> > > What I wrote is only a suggestion. As I'm far from expert in these > matters, I'll leave the final form to you all. Thanks for your work on > coreutils, and your attention to this matter. I think my work is done here. With no further comments in 6 years, I'm closing this bug. -assaf From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 30 00:28:06 2018 Received: (at control) by debbugs.gnu.org; 30 Oct 2018 04:28:06 +0000 Received: from localhost ([127.0.0.1]:53004 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gHLdO-0006mf-CR for submit@debbugs.gnu.org; Tue, 30 Oct 2018 00:28:06 -0400 Received: from mail-io1-f43.google.com ([209.85.166.43]:42676) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gHLdM-0006lq-7W for control@debbugs.gnu.org; Tue, 30 Oct 2018 00:28:04 -0400 Received: by mail-io1-f43.google.com with SMTP id n18-v6so6444003ioa.9 for ; Mon, 29 Oct 2018 21:28:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:from:message-id:date:user-agent:mime-version:content-language :content-transfer-encoding; bh=jQ8aAptLX7RpD/QuFYDTI9IYEV0DohWfmwpdXCPcPrM=; b=K72wUZcHMU7/YLchwaV5zcKfzt5HiyeyG0Dmk/Wgb4aa55RRz2A16NaN233wzZuCpc HHHepoj4jqdWuTxlQ3dAhmB179RO2YjraU3Y4VfPWNKTt6gaSkE3cnGEk3HVgv7X0jYz 6BZ9g+k/MfoDByA5NkFUjTjnixMFgbveZqo8P1d1Fc1rzTE6XBqkerPGmnmmWYrNBYeg RJ2AU5mCWnzUDcs9lKog3g+Zal56H0SbvqVdNew9CXY7QrTboo4LMa4ENO9gXv4RnJ4k d+Syox7hQ90OIt2leGC3x+HkXUyXEmsBsaq7OKn+tStVWICIuTQl5ju/wFW2sBKdA85V BvnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:message-id:date:user-agent:mime-version :content-language:content-transfer-encoding; bh=jQ8aAptLX7RpD/QuFYDTI9IYEV0DohWfmwpdXCPcPrM=; b=cO34Edu9Vm//DBxWSD0UPtPTmkZ/JIqGtBnMTgflacwD6cVuQmZvzpb7tIfszrFbHm if1+3ky/vKmA628/LqSpcmYtoePx8oMYqc5EDrv+YWZGE0MnJIhCcu+CLpWscCTtnuWw renyRI7/2Few3SyvznZCiP0RscKpqIfREv26R97n6fkPrV4roCFdsXu6oV/E7XFGe7Vw IX3KtNCpoF5u3lXIH3szTVD3dii2/HHwdXgRhV+vVHt1OJu9qc6bkpYfjvhGktU5a8yf 0jj/YI8kW6k6jz0rA84qRMPQ77pb0wF+7lC+oDyliQ3NuPBpqwPgfSumDtbXV7YOmQsi hQFA== X-Gm-Message-State: AGRZ1gIFWq6/DQdOLuqysSVVcUaUpgQjXGyAWYIHgqWoL+blFl6l57H+ Lss3gb+5G927QEB6PhEjS/J6yrqZJWU= X-Google-Smtp-Source: AJdET5dTQ+xjuSp6YHxIGhg4ui/5Ng0c+YbFeVCSyyNu+/tTEsxW8jdQ2fCmZwDYCmq9eOiOMYU6Sw== X-Received: by 2002:a6b:1bd5:: with SMTP id b204-v6mr10192242iob.105.1540873678115; Mon, 29 Oct 2018 21:27:58 -0700 (PDT) Received: from tomato.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id q205-v6sm8345369itc.2.2018.10.29.21.27.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Oct 2018 21:27:57 -0700 (PDT) To: control@debbugs.gnu.org From: Assaf Gordon Message-ID: <8939db41-c61e-afd3-1c90-80e459e71184@gmail.com> Date: Mon, 29 Oct 2018 22:27:55 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: close 12783 stop [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (assafgordon[at]gmail.com) -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [209.85.166.43 listed in list.dnswl.org] 1.8 MISSING_SUBJECT Missing Subject: header 0.2 NO_SUBJECT Extra score for no subject X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) close 12783 stop From unknown Wed Jun 25 00:21:46 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 27 Nov 2018 12:24:13 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator