GNU bug report logs -
#22155
Wrong char count with UTF8 in sort -k
Previous Next
Reported by: Holger Klene <h.klene <at> gmx.de>
Date: Sat, 12 Dec 2015 22:55:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22155 in the body.
You can then email your comments to 22155 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#22155
; Package
coreutils
.
(Sat, 12 Dec 2015 22:55:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Holger Klene <h.klene <at> gmx.de>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 12 Dec 2015 22:55:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello!
Given a text-file "sort.but.txt" with find-output like this:
07. Feb 2015 15:57 ./mess.jpg
05. Mär 2015 13:30 ./mess.jpg
Basically two columns: a date and a filename
I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k
to skip over the date column
> sort sort.bug.txt -u -s -k 1.20 --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher
wahrscheinlich auch „b“ angeben
05. Mär 2015 13:30 ./mess.jpg
___________
07. Feb 2015 15:57 ./mess.jpg
__________
As the underlines in debug mode show, the keys start position depends on whether the month
name contains pure ASCII or the German Umlaut ä.
There's a hint coming up, to apply option -b as this one character offset could possibly be
overcome thanks to the separating whitespace between the columns.
> sort sort.bug.txt -u -s -k 1.20 -b --debug
sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
05. Mär 2015 13:30 ./mess.jpg
__________
07. Feb 2015 15:57 ./mess.jpg
__________
In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the
second line. You can add more lines for the same file, but sort insists on keeping exactly two: one
with Umlaut and the other without.
This is: sort (GNU coreutils) 8.23
Thanks for the great utilities.
Holger
--
|_|/ MfG
| |\ Holger Klene
PGP Key ID: 0x22FFE57E
[Message part 2 (text/html, inline)]
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#22155
; Package
coreutils
.
(Sun, 13 Dec 2015 01:33:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 22155 <at> debbugs.gnu.org (full text, mbox):
On 12/12/15 22:53, Holger Klene wrote:
> Hello!
>
>
>
> Given a text-file "sort.but.txt" with find-output like this:
>
> 07. Feb 2015 15:57 ./mess.jpg
> 05. Mär 2015 13:30 ./mess.jpg
>
>
>
> Basically two columns: a date and a filename
>
> I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k to skip over the date column
>
>> sort sort.bug.txt -u -s -k 1.20 --debug
Note the -s is implicit with -u.
Ideally the above should just work, and does
on Fedora/RHEL/Suse with the i18n patch applied.
Details on that patch at
http://www.pixelbeat.org/docs/coreutils_i18n/
> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
> sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher
> wahrscheinlich auch „b“ angeben
> 05. Mär 2015 13:30 ./mess.jpg
> ___________
> 07. Feb 2015 15:57 ./mess.jpg
> __________
>
> As the underlines in debug mode show, the keys start position depends on whether the month name contains pure ASCII or the German Umlaut ä.
>
> There's a hint coming up, to apply option -b as this one character offset could possibly be overcome thanks to the separating whitespace between the columns.
>
>> sort sort.bug.txt -u -s -k 1.20 -b --debug
>
> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
> 05. Mär 2015 13:30 ./mess.jpg
> __________
> 07. Feb 2015 15:57 ./mess.jpg
> __________
>
> In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without.
That's a bug in --debug because the implementation was split
from the actual processing done during the sort (for performance reasons).
Therefore we'll need to fix --debug to show what's being actually done
which is...
-b is applied _before_ the -k offsets are determined,
and so is ineffective in your case.
That is confirmed with:
$ ltrace -e strcoll sort sort.bug.txt -u -k 1.20b
sort->strcoll("./mess.jpg", " ./mess.jpg") = 15
05. Mär 2015 13:30 ./mess.jpg
sort->strcoll("./mess.jpg", " ./mess.jpg") = 15
07. Feb 2015 15:57 ./mess.jpg
Perhaps it would be better in your case to operate
directly on the fifth field?
$ sort sort.bug.txt -u -k5b,5 --debug
sort: using ‘en_IE.utf8’ sorting rules
07. Feb 2015 15:57 ./mess.jpg
__________
thanks,
Pádraig
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Sun, 13 Dec 2015 02:33:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Holger Klene <h.klene <at> gmx.de>
:
bug acknowledged by developer.
(Sun, 13 Dec 2015 02:33:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 22155-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 13/12/15 01:32, Pádraig Brady wrote:
> On 12/12/15 22:53, Holger Klene wrote:
>>> sort sort.bug.txt -u -s -k 1.20 -b --debug
>> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet
>> 05. Mär 2015 13:30 ./mess.jpg
>> __________
>> 07. Feb 2015 15:57 ./mess.jpg
>> __________
>>
>> In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without.
>
> That's a bug in --debug because the implementation was split
> from the actual processing done during the sort (for performance reasons).
> Therefore we'll need to fix --debug to show what's being actually done
Patch attached.
thanks,
Pádraig.
[sort-debug-b.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 10 Jan 2016 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 160 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.