GNU bug report logs -
#22155
Wrong char count with UTF8 in sort -k
Previous Next
Reported by: Holger Klene <h.klene <at> gmx.de>
Date: Sat, 12 Dec 2015 22:55:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Sun, 13 Dec 2015 02:32:51 +0000
with message-id <566CD8D3.3030702 <at> draigBrady.com>
and subject line Re: bug#22155: Wrong char count with UTF8 in sort -k
has caused the debbugs.gnu.org bug report #22155,
regarding Wrong char count with UTF8 in sort -k
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
22155: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22155
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
Hello!
Given a text-file "sort.but.txt" with find-output like this:
07. Feb 2015 15:57 ./mess.jpg
05. Mär 2015 13:30 ./mess.jpg
Basically two columns: a date and a filename
I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k
to skip over the date column
> sort sort.bug.txt -u -s -k 1.20 --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher
wahrscheinlich auch „b“ angeben
05. Mär 2015 13:30 ./mess.jpg
___________
07. Feb 2015 15:57 ./mess.jpg
__________
As the underlines in debug mode show, the keys start position depends on whether the month
name contains pure ASCII or the German Umlaut ä.
There's a hint coming up, to apply option -b as this one character offset could possibly be
overcome thanks to the separating whitespace between the columns.
> sort sort.bug.txt -u -s -k 1.20 -b --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
05. Mär 2015 13:30 ./mess.jpg
__________
07. Feb 2015 15:57 ./mess.jpg
__________
In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the
second line. You can add more lines for the same file, but sort insists on keeping exactly two: one
with Umlaut and the other without.
This is: sort (GNU coreutils) 8.23
Thanks for the great utilities.
Holger
--
|_|/ MfG
| |\ Holger Klene
PGP Key ID: 0x22FFE57E
[Message part 4 (text/html, inline)]
[signature.asc (application/pgp-signature, inline)]
[Message part 6 (message/rfc822, inline)]
[Message part 7 (text/plain, inline)]
On 13/12/15 01:32, Pádraig Brady wrote:
> On 12/12/15 22:53, Holger Klene wrote:
>>> sort sort.bug.txt -u -s -k 1.20 -b --debug
>> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
>> 05. Mär 2015 13:30 ./mess.jpg
>> __________
>> 07. Feb 2015 15:57 ./mess.jpg
>> __________
>>
>> In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without.
>
> That's a bug in --debug because the implementation was split
> from the actual processing done during the sort (for performance reasons).
> Therefore we'll need to fix --debug to show what's being actually done
Patch attached.
thanks,
Pádraig.
[sort-debug-b.patch (text/x-patch, attachment)]
This bug report was last modified 9 years and 161 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.