On Tue, Nov 6, 2012 at 6:44 AM, Pádraig Brady wrote: > On 11/06/2012 04:20 AM, Kevin O'Gorman wrote: > >> One ammendment... I'd say setting LANG=C is also unwise, and for the same >> reason. make that >> - It can be unwise to set LC_ALL or LANG to affect sort order because they >> may affect many other things as well, such as the language used for error >> and help messages. >> >> On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman >> wrote: >> >> Looking at a convenient issue of the POSIX standard, I find >>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.htm, >>> and in >>> particular where it mentions a "precedence order". It seems that the >>> interaction of environment variables is not unspecified at all. >>> >>> I'm guessing that something else was actually meant: the writer perhaps >>> found it hard to describe in a general way the interaction of a collation >>> locale with the contents of a file to be sorted if it happens that the >>> contents were created in another locale and are to be interpreted in the >>> way they were created. So, applying LC_COLLATE=C to Chinese big-5 could >>> well produce a peculiar order of things. >>> >>> This has already been the case for me. My data is ASCII, comprising >>> numeric data and type data, both normal and encoded. All codes are >>> printable ASCII and are specifically designed to be sorted based on the >>> contents of bytes. This did not work well with LANG=en_US.UTF-8, because >>> for this data 'a' and 'A' are numbers that differ by 26, but the LANG >>> setting was treating them as nearly equivalent. It seems that >>> LC_COLLATE=C >>> is the correct cure. >>> >>> If I'm right, it seems that it would be better to rewrite that footnote >>> in >>> the sort info page something like this: >>> >>> (1) The collation order used by 'sort' is controlled by environment >>> variables in accordance with the POSIX specification. In particular, the >>> first of the LC_ALL, LC_COLLATE, and LANG variables that is defined to a >>> non-null value controls the collation order; in their absence your system >>> has a default. If the collation order is incompatible with your data, >>> you >>> are unlikely to get the desired results. Often, but not always, setting >>> and exporting LC_COLLATE=C in your environment is the right choice, but >>> if >>> your data contains natural language text or proper names the right choice >>> will agree with the encoding used for the data. Setting LC_ALL can be >>> unwise because it can affect many other things as well, such as the >>> language used for error and help messages. See >>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.htmlor any >>> later version for more information. >>> >>> You may also want to change the warning in the output of "sort --help" to >>> *** WARNING *** >>> The locale specified by the environment affects sort order. For correct >>> operation it must be compatible with your data. >>> Set LC_COLLATE=C to get the traditional sort order that uses native byte >>> values. >>> >>> >>> On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx wrote: >>> >>> Kevin O'Gorman wrote: >>>> >>>>> (reformatted and numbered) >>>>> A, In that case, set the `LC_ALL' environment variable to `C'. >>>>> B. Note that setting only `LC_COLLATE' has two problems. >>>>> B1. First, it is ineffective if `LC_ALL' is also set. >>>>> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if >>>>> `LC_CTYPE' is unset) is set to an incompatible value. >>>>> B2x. For example, you get undefined behavior if `LC_CTYPE' is >>>>> >>>> `ja_JP.PCK' >>>> >>>>> but `LC_COLLATE' is `en_US.UTF-8'. >>>>> >>>>> The example in B2x is illogical since A and B together mean we're >>>>> >>>> setting >>>> >>>>> LC_COLLATE to C, not some random value like en_US.UTF-8. >>>>> I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting, >>>>> >>>> or >>>> >>>>> anything besides LC_ALL. I'm writing software that will use sort >>>>> extensively in unknown environments, and I'd like to keep all >>>>> >>>> adjustments >>>> >>>>> as localized as possible. So far, setting the collating sequence to >>>>> >>>> POSIX >>>> >>>>> is all that I need; no other locale adjustments. >>>>> >>>> >>>> I also agree that the above is needlessly disjoint. It doesn't flow. >>>> >>>> Would you be able to suggest an improvement to the wording that would >>>> make it better than the current prose? Of course a submission as a >>>> patch would be great. Using git patch submissions is the preferred >>>> format. But just saying what you think it should say would also be >>>> appreciated. >>>> >>> > Thanks for cleaning that up Kevin. > Your description is clearer. > I'm not sure we can drop the warning about LC_TYPE though. > > While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux... > > $ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do > printf "%s\n" B a | echo $(env -i $e=en_US sort) > done > a B > a B > a B > B a > > ... it may be significant when specifying (multibyte) characters > to skip etc. and thus impacts the sort order in that way. > This is either with common downstream i18n patches or future > multibyte handling in upstream sort. > > Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since > glibc-2.3.3): > http://www.gnu.org/software/**libc/manual/html_node/Charset-** > conversion-in-gettext.html > > thanks, > Pádraig. > What I wrote is only a suggestion. As I'm far from expert in these matters, I'll leave the final form to you all. Thanks for your work on coreutils, and your attention to this matter. I think my work is done here. -- Kevin O'Gorman programmer, n. an organism that transmutes caffeine into software.