GNU bug report logs - #22155
Wrong char count with UTF8 in sort -k

Previous Next

Package: coreutils;

Reported by: Holger Klene <h.klene <at> gmx.de>

Date: Sat, 12 Dec 2015 22:55:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Holger Klene <h.klene <at> gmx.de>
To: bug-coreutils <at> gnu.org
Subject: Wrong char count with UTF8 in sort -k
Date: Sat, 12 Dec 2015 23:53:40 +0100
[Message part 1 (text/plain, inline)]
Hello!

Given a text-file "sort.but.txt" with find-output like this:
07. Feb 2015 15:57 ./mess.jpg
05. Mär 2015 13:30 ./mess.jpg

Basically two columns: a date and a filename
I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k 
to skip over the date column

> sort sort.bug.txt -u -s -k 1.20 --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher
wahrscheinlich auch „b“ angeben
05. Mär 2015 13:30 ./mess.jpg
                  ___________
07. Feb 2015 15:57 ./mess.jpg
                   __________

As the underlines in debug mode show, the keys start position depends on whether the month 
name contains pure ASCII or the German Umlaut ä.

There's a hint coming up, to apply option -b as this one character offset could possibly be 
overcome thanks to the separating whitespace between the columns.

> sort sort.bug.txt -u -s -k 1.20 -b --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
05. Mär 2015 13:30 ./mess.jpg
                   __________
07. Feb 2015 15:57 ./mess.jpg
                   __________

In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the 
second line. You can add more lines for the same file, but sort insists on keeping exactly two: one 
with Umlaut and the other without.

This is: sort (GNU coreutils) 8.23

Thanks for the great utilities.
Holger

-- 
|_|/    MfG
| |\    Holger Klene

PGP Key ID: 0x22FFE57E
[Message part 2 (text/html, inline)]
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 9 years and 161 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.