From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 12 17:54:48 2015 Received: (at submit) by debbugs.gnu.org; 12 Dec 2015 22:54:48 +0000 Received: from localhost ([127.0.0.1]:49774 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7t3v-0005l2-VH for submit@debbugs.gnu.org; Sat, 12 Dec 2015 17:54:48 -0500 Received: from eggs.gnu.org ([208.118.235.92]:34381) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7t3t-0005km-JY for submit@debbugs.gnu.org; Sat, 12 Dec 2015 17:54:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a7t3m-0003Hi-Uk for submit@debbugs.gnu.org; Sat, 12 Dec 2015 17:54:40 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, HTML_MESSAGE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:49668) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7t3m-0003He-Rh for submit@debbugs.gnu.org; Sat, 12 Dec 2015 17:54:38 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44380) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7t3l-0005rC-FL for bug-coreutils@gnu.org; Sat, 12 Dec 2015 17:54:38 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a7t3i-0003HC-90 for bug-coreutils@gnu.org; Sat, 12 Dec 2015 17:54:37 -0500 Received: from mout.gmx.net ([212.227.17.22]:52557) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7t3h-0003Fr-Sv for bug-coreutils@gnu.org; Sat, 12 Dec 2015 17:54:34 -0500 Received: from meerschweinchen.localnet ([79.248.44.141]) by mail.gmx.com (mrgmx103) with ESMTPSA (Nemesis) id 0MhQju-1ZvKI6084w-00MfBn for ; Sat, 12 Dec 2015 23:54:32 +0100 From: Holger Klene To: bug-coreutils@gnu.org Subject: Wrong char count with UTF8 in sort -k Date: Sat, 12 Dec 2015 23:53:40 +0100 Message-ID: <2109306.KdFIEfxH1W@meerschweinchen> Organization: privat User-Agent: KMail/4.14.6 (Linux/3.19.0-39-generic; KDE/4.14.6; x86_64; ; ) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2701918.xtKoNoaT5V"; micalg="pgp-sha256"; protocol="application/pgp-signature" X-Provags-ID: V03:K0:X6NTE4RRtuXn3RR903gdf/gjNgJ8q7XK9rBGfiNIsWmm0M4vzWA dl7tcwQK5MUuLMEtUiZB00vj94lhSYyc5AXeR0ufJzDi8a23Ez6veEC8t6l+5gjLTX39T2l ghvNxHhqlQC5ufZAPrHuKj1/VD5FyDjM10ZK1BBrv1iL+2XwtwJ9uTAQrz5IId2TnXn5EX6 WCg75QOGMQwInSLyIWtlw== X-UI-Out-Filterresults: notjunk:1;V01:K0:pN1SP2rBqPc=:IlM1W7MN8eJpOwkTQlHUNn QXOGJhm/QbpVKHlben0EQNiqXyUv1TM5SXNPJE+oP843yQBUtkkfhS8Vx3gaorHna/BPCMPSC e4F8IHaEfiGf64oQEOjSV5izYv+hd5Maa+pLS7NAMHpGhHhYKFjICiO8gdSXsSuBOOSQW6M8m ZNFAcFoCAi2gBsNsDOkAozUGG1y19dNZaXSuKenzebid+4+xM25USanMxdUAyb6/V89cb8xjI XUXtCuVPJ9Bo6bJ3D2ibIUW7tfIXtoPr36Obvw6jNC9TpGsGocGaiDLz8cz9DgVpCj5DcRVjo wbblBD/TbofpG7Y5sEPGP94l2RVLD8pXWncPyfh0hwgFi4/fEB5iXlce0QvIbCD1VGQt0Q/6t 7XF2iQGWM2W1D42Z7zfdhmbdxrJPs1ULaFT4Oz6TaU4yZMfw0UFAyxh1MgryiUTYp59vCWfSn MSa/Cvv9pVtvZghvfKWen0aljLXQ3efvww/T8GiX/shoWCoglDIUNZ9YIvwPbAae0zBdFjpvq Ikd8aBvUIbMizzTehVXbEsKloh7EmRw8tAeOHaZhUJPiUgn9IaE7kyE8VawuGM+vuCeiMRonM eC64389NMDsafyyCnK75t5rJON408rHoqz3L5blZBDWNpfC9f11JNLoVVZitikitOm/uKCTDs qTFAQ7F8p6nyxpSHZCaHPmV6111e0RmVnUhzKid1C9ohdkzlOBFgeeSqNX/88Mbf+gsn6cGlu EFc6G/jX2FizFzT5zJiUEZbqCfeCy7qQyfeJp5kSsOUarPO6oJOmPJ4OInc= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.1 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.1 (----) --nextPart2701918.xtKoNoaT5V Content-Type: multipart/alternative; boundary="nextPart1461987.tC5YOZJ1fP" Content-Transfer-Encoding: quoted-printable This is a multi-part message in MIME format. =2D-nextPart1461987.tC5YOZJ1fP Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hello! Given a text-file "sort.but.txt" with find-output like this: 07. Feb 2015 15:57 ./mess.jpg 05. M=C3=A4r 2015 13:30 ./mess.jpg Basically two columns: a date and a filename I want sort to discard the duplicate lines for the same file using -u t= o keep only the first and -k=20 to skip over the date column > sort sort.bug.txt -u -s -k 1.20 --debug sort: es werden die Sortierregeln f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C v= erwendet sort: f=C3=BChrende Leerzeichen sind signifikant in Schl=C3=BCssel 1: S= ie sollten daher wahrscheinlich auch =E2=80=9Eb=E2=80=9C angeben 05. M=C3=A4r 2015 13:30 ./mess.jpg ___________ 07. Feb 2015 15:57 ./mess.jpg __________ As the underlines in debug mode show, the keys start position depends o= n whether the month=20 name contains pure ASCII or the German Umlaut =C3=A4. There's a hint coming up, to apply option -b as this one character offs= et could possibly be=20 overcome thanks to the separating whitespace between the columns. > sort sort.bug.txt -u -s -k 1.20 -b --debug sort: es werden die Sortierregeln f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C v= erwendet 05. M=C3=A4r 2015 13:30 ./mess.jpg __________ 07. Feb 2015 15:57 ./mess.jpg __________ In fact, it does correct the underlines, but still -u gives both lines,= though I want it to discard the=20 second line. You can add more lines for the same file, but sort insists= on keeping exactly two: one=20 with Umlaut and the other without. This is: sort (GNU coreutils) 8.23 Thanks for the great utilities. Holger =2D-=20 |_|/ MfG | |\ Holger Klene PGP Key ID: 0x22FFE57E =2D-nextPart1461987.tC5YOZJ1fP Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"

Hel= lo!

 

Giv= en a text-file "sort.but.txt" with find-output like this:

=

07. Feb 2015 15:57 ./mess.jpg

05. M=C3=A4r 2015 13:30 ./mess.j= pg

 

Bas= ically two columns: a date and a filename

I w= ant sort to discard the duplicate lines for the same file using -u to k= eep only the first and -k to skip over the date column

 

> sort sort.bug.txt -u -s -k = 1.20 --debug

sort: es werden die Sortierregel= n f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C verwendet

sort: f=C3=BChrende Leerzeichen = sind signifikant in Schl=C3=BCssel 1: Sie sollten daher

wahrscheinlich auch =E2=80=9Eb=E2= =80=9C angeben

05. M=C3=A4r 2015 13:30 ./mess.j= pg

___________

07. Feb 2015 15:57 ./mess.jpg

__________

 

As = the underlines in debug mode show, the keys start position depends on w= hether the month name contains pure ASCII or the German Umlaut =C3=A4.<= /p>

 

The= re's a hint coming up, to apply option -b as this one character offset = could possibly be overcome thanks to the separating whitespace between = the columns.

 

> sort sort.bug.txt -u -s -k = 1.20 -b --debug

sort: es werden die Sortierregel= n f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C verwendet

05. M=C3=A4r 2015 13:30 ./mess.j= pg

__________

07. Feb 2015 15:57 ./mess.jpg

__________

 

In = fact, it does correct the underlines, but still -u gives both lines, th= ough I want it to discard the second line. You can add more lines for t= he same file, but sort insists on keeping exactly two: one with Umlaut = and the other without.

 

Thi= s is: sort (GNU coreutils) 8.23

 

Tha= nks for the great utilities.

Hol= ger

 

-- =

|_|= / MfG

| |\ Holger Klene<= /p>

 

PGP Key ID: 0x22FFE57= E

=2D-nextPart1461987.tC5YOZJ1fP-- This is a multi-part message in MIME format. --nextPart1461987.tC5YOZJ1fP Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hello! Given a text-file "sort.but.txt" with find-output like this: 07. Feb 2015 15:57 ./mess.jpg 05. M=C3=A4r 2015 13:30 ./mess.jpg Basically two columns: a date and a filename I want sort to discard the duplicate lines for the same file using -u t= o keep only the first and -k=20 to skip over the date column > sort sort.bug.txt -u -s -k 1.20 --debug sort: es werden die Sortierregeln f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C v= erwendet sort: f=C3=BChrende Leerzeichen sind signifikant in Schl=C3=BCssel 1: S= ie sollten daher wahrscheinlich auch =E2=80=9Eb=E2=80=9C angeben 05. M=C3=A4r 2015 13:30 ./mess.jpg ___________ 07. Feb 2015 15:57 ./mess.jpg __________ As the underlines in debug mode show, the keys start position depends o= n whether the month=20 name contains pure ASCII or the German Umlaut =C3=A4. There's a hint coming up, to apply option -b as this one character offs= et could possibly be=20 overcome thanks to the separating whitespace between the columns. > sort sort.bug.txt -u -s -k 1.20 -b --debug sort: es werden die Sortierregeln f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C v= erwendet 05. M=C3=A4r 2015 13:30 ./mess.jpg __________ 07. Feb 2015 15:57 ./mess.jpg __________ In fact, it does correct the underlines, but still -u gives both lines,= though I want it to discard the=20 second line. You can add more lines for the same file, but sort insists= on keeping exactly two: one=20 with Umlaut and the other without. This is: sort (GNU coreutils) 8.23 Thanks for the great utilities. Holger --=20 |_|/ MfG | |\ Holger Klene PGP Key ID: 0x22FFE57E --nextPart1461987.tC5YOZJ1fP Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"

Hel= lo!

 

Giv= en a text-file "sort.but.txt" with find-output like this:

=

07. Feb 2015 15:57 ./mess.jpg

05. M=C3=A4r 2015 13:30 ./mess.j= pg

 

Bas= ically two columns: a date and a filename

I w= ant sort to discard the duplicate lines for the same file using -u to k= eep only the first and -k to skip over the date column

 

> sort sort.bug.txt -u -s -k = 1.20 --debug

sort: es werden die Sortierregel= n f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C verwendet

sort: f=C3=BChrende Leerzeichen = sind signifikant in Schl=C3=BCssel 1: Sie sollten daher

wahrscheinlich auch =E2=80=9Eb=E2= =80=9C angeben

05. M=C3=A4r 2015 13:30 ./mess.j= pg

___________

07. Feb 2015 15:57 ./mess.jpg

__________

 

As = the underlines in debug mode show, the keys start position depends on w= hether the month name contains pure ASCII or the German Umlaut =C3=A4.<= /p>

 

The= re's a hint coming up, to apply option -b as this one character offset = could possibly be overcome thanks to the separating whitespace between = the columns.

 

> sort sort.bug.txt -u -s -k = 1.20 -b --debug

sort: es werden die Sortierregel= n f=C3=BCr =C2=BBde_DE.UTF-8=E2=80=9C verwendet

05. M=C3=A4r 2015 13:30 ./mess.j= pg

__________

07. Feb 2015 15:57 ./mess.jpg

__________

 

In = fact, it does correct the underlines, but still -u gives both lines, th= ough I want it to discard the second line. You can add more lines for t= he same file, but sort insists on keeping exactly two: one with Umlaut = and the other without.

 

Thi= s is: sort (GNU coreutils) 8.23

 

Tha= nks for the great utilities.

Hol= ger

 

-- =

|_|= / MfG

| |\ Holger Klene<= /p>

 

PGP Key ID: 0x22FFE57= E

--nextPart1461987.tC5YOZJ1fP-- --nextPart2701918.xtKoNoaT5V Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWbKWmAAoJEH950t460Swg7B8QAKh/JQvNZAtBgzZP51grXU7y E8JTumSzjsb6OWvI5OgFmscm+wpyo8ww7w/MRaWbBWhGcG10j0yY+gJXJ6LvDenu bKDi0qQ3/PYZ9Q4dT/IwAng4baNIHzJ0KvPrRGMsrHFMKkd2bArPzLfkNStXWaC6 ugXmHSpz1Vg9Ne2q4IkmhHH3t4PWW4hL1iay0QQbNFZDa+BIZi4z5gdtXMb03pXH X1lXhwbz5OUIbVzK1r57H35nV/oPEEv9ynvqb0fPrm97+zJxaIDWjjgVVtmifI6S GMGvyV28UJ8qY6p49TVgbkFYc8w2xWe8MFz7SZuqPUYGpxNz6tQKmmI/CpSR5JXJ gCpdGJNuVGaYVE4+0+naSYMzTgrJLK+WJveVhP6LlkrtgyPc79j41p1IaTiPYw1w IzkxO+gP2ko4oUuPtCT9VHjx3u8c6BnbN8Ov6orO+N36wR2/NtCmkbag4X4RHWuE Ov0CYABpbOdrbydEuxgigCRTXb3m0MT/bmsUr7kawaM0YIN9jwb0bgR5mM1RNtnl sRDEQs+rgOQ/Xl7txO0B9NEQuD1cx6gkxlQPKyAuHsCjS8pDMArBDWCWcOTsCsBt AAkymhFpP2AjsdH/FI13gT4LIpOm5vTqdP1J7rl37+fBLXNpzP8ibF3IeZLM1YSQ DICHGa2b6uiN+XSq1VXd =QXjI -----END PGP SIGNATURE----- --nextPart2701918.xtKoNoaT5V-- From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 12 20:32:59 2015 Received: (at 22155) by debbugs.gnu.org; 13 Dec 2015 01:32:59 +0000 Received: from localhost ([127.0.0.1]:49805 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7vX1-0000pp-Bj for submit@debbugs.gnu.org; Sat, 12 Dec 2015 20:32:59 -0500 Received: from mail1.vodafone.ie ([213.233.128.43]:8118) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7vWz-0000pc-KV for 22155@debbugs.gnu.org; Sat, 12 Dec 2015 20:32:58 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AuAGAF/KbFZtTcUi/2dsb2JhbABegmlRU2+CYQi8PSGFaQEBAQECgSJMAQEBAQEBgQuENAEBAQMBDAYRBAsBSwsJAg0BCgICBRYLAgIJAwIBAgFFBgEMCAEBBRmIBQwJjj6TC4orhW2MQYEBhFmFeYd3gUkFjTSJQo8fh2GPZIN0Y4QEPoUgAQEB Received: from unknown (HELO localhost.localdomain) ([109.77.197.34]) by mail1.vodafone.ie with ESMTP; 13 Dec 2015 01:32:48 +0000 Subject: Re: bug#22155: Wrong char count with UTF8 in sort -k To: Holger Klene , 22155@debbugs.gnu.org References: <2109306.KdFIEfxH1W@meerschweinchen> From: =?UTF-8?Q?P=c3=a1draig_Brady?= X-Enigmail-Draft-Status: N1110 Message-ID: <566CCABF.9040401@draigBrady.com> Date: Sun, 13 Dec 2015 01:32:47 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <2109306.KdFIEfxH1W@meerschweinchen> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22155 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 12/12/15 22:53, Holger Klene wrote: > Hello! > > > > Given a text-file "sort.but.txt" with find-output like this: > > 07. Feb 2015 15:57 ./mess.jpg > 05. Mär 2015 13:30 ./mess.jpg > > > > Basically two columns: a date and a filename > > I want sort to discard the duplicate lines for the same file using -u to keep only the first and -k to skip over the date column > >> sort sort.bug.txt -u -s -k 1.20 --debug Note the -s is implicit with -u. Ideally the above should just work, and does on Fedora/RHEL/Suse with the i18n patch applied. Details on that patch at http://www.pixelbeat.org/docs/coreutils_i18n/ > sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet > sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher > wahrscheinlich auch „b“ angeben > 05. Mär 2015 13:30 ./mess.jpg > ___________ > 07. Feb 2015 15:57 ./mess.jpg > __________ > > As the underlines in debug mode show, the keys start position depends on whether the month name contains pure ASCII or the German Umlaut ä. > > There's a hint coming up, to apply option -b as this one character offset could possibly be overcome thanks to the separating whitespace between the columns. > >> sort sort.bug.txt -u -s -k 1.20 -b --debug > > sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet > 05. Mär 2015 13:30 ./mess.jpg > __________ > 07. Feb 2015 15:57 ./mess.jpg > __________ > > In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without. That's a bug in --debug because the implementation was split from the actual processing done during the sort (for performance reasons). Therefore we'll need to fix --debug to show what's being actually done which is... -b is applied _before_ the -k offsets are determined, and so is ineffective in your case. That is confirmed with: $ ltrace -e strcoll sort sort.bug.txt -u -k 1.20b sort->strcoll("./mess.jpg", " ./mess.jpg") = 15 05. Mär 2015 13:30 ./mess.jpg sort->strcoll("./mess.jpg", " ./mess.jpg") = 15 07. Feb 2015 15:57 ./mess.jpg Perhaps it would be better in your case to operate directly on the fifth field? $ sort sort.bug.txt -u -k5b,5 --debug sort: using ‘en_IE.utf8’ sorting rules 07. Feb 2015 15:57 ./mess.jpg __________ thanks, Pádraig From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 12 21:33:01 2015 Received: (at 22155-done) by debbugs.gnu.org; 13 Dec 2015 02:33:01 +0000 Received: from localhost ([127.0.0.1]:49811 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7wT6-0002Dz-Vz for submit@debbugs.gnu.org; Sat, 12 Dec 2015 21:33:01 -0500 Received: from mail1.vodafone.ie ([213.233.128.43]:2098) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1a7wT5-0002Dm-Jz for 22155-done@debbugs.gnu.org; Sat, 12 Dec 2015 21:33:00 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AmcFAG/YbFZtTcUi/2dsb2JhbABegmlRU2gGgmK8Qh2FbgECAoEiTAEBAQEBAYELhDQBAQEEEhEEYgsNAQMDAQIBCRYLAgIJAwIBAgE9CAYBDAYCAQEWCIgRBKE/iiuFbYwUAQEBAQEBBAEBAQEBAQETCYVahXmEKFOCfIFJBZMDg3OCaYFiaoozhxiMDYdLY4QEPjSDIYFLAQEB Received: from unknown (HELO localhost.localdomain) ([109.77.197.34]) by mail1.vodafone.ie with ESMTP; 13 Dec 2015 02:32:52 +0000 Subject: Re: bug#22155: Wrong char count with UTF8 in sort -k To: Holger Klene , 22155-done@debbugs.gnu.org References: <2109306.KdFIEfxH1W@meerschweinchen> <566CCABF.9040401@draigBrady.com> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <566CD8D3.3030702@draigBrady.com> Date: Sun, 13 Dec 2015 02:32:51 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <566CCABF.9040401@draigBrady.com> Content-Type: multipart/mixed; boundary="------------030202020105040909030907" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22155-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------030202020105040909030907 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 13/12/15 01:32, Pádraig Brady wrote: > On 12/12/15 22:53, Holger Klene wrote: >>> sort sort.bug.txt -u -s -k 1.20 -b --debug >> sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet >> 05. Mär 2015 13:30 ./mess.jpg >> __________ >> 07. Feb 2015 15:57 ./mess.jpg >> __________ >> >> In fact, it does correct the underlines, but still -u gives both lines, though I want it to discard the second line. You can add more lines for the same file, but sort insists on keeping exactly two: one with Umlaut and the other without. > > That's a bug in --debug because the implementation was split > from the actual processing done during the sort (for performance reasons). > Therefore we'll need to fix --debug to show what's being actually done Patch attached. thanks, Pádraig. --------------030202020105040909030907 Content-Type: text/x-patch; name="sort-debug-b.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="sort-debug-b.patch" >From e0c1f772d505d40166dc308706baecedc23efdab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sun, 13 Dec 2015 02:14:06 +0000 Subject: [PATCH] sort: fix --debug marking for -b -k1.x We were erroneously skipping blanks in the marked comparison _after_ the key start offset was applied. * src/sort.c (debug_keys): Don't skip starting blanks if already handled by begfield(). * tests/misc/sort-debug-keys.sh: Add a test case. * NEWS: Mention the bug fix. Fixes http://bugs.gnu.org/22155 --- NEWS | 4 ++++ src/sort.c | 3 ++- tests/misc/sort-debug-keys.sh | 7 +++++++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index 2988146..367fb63 100644 --- a/NEWS +++ b/NEWS @@ -15,6 +15,10 @@ GNU coreutils NEWS -*- outline -*- shred again uses defined patterns for all iteration counts. [bug introduced in coreutils-5.93] + sort --debug -b now correctly marks the matching extents for keys + that specify an offset for the first field. + [bug introduced with the --debug feature in coreutils-8.6] + ** New commands base32 is added to complement the existing base64 command, diff --git a/src/sort.c b/src/sort.c index 399b964..29a3617 100644 --- a/src/sort.c +++ b/src/sort.c @@ -2274,7 +2274,8 @@ debug_key (struct line const *line, struct keyfield const *key) if (key->eword != SIZE_MAX) lim = limfield (line, key); - if (key->skipsblanks || key->month || key_numeric (key)) + if ((key->skipsblanks && key->sword == SIZE_MAX) + || key->month || key_numeric (key)) { char saved = *lim; *lim = '\0'; diff --git a/tests/misc/sort-debug-keys.sh b/tests/misc/sort-debug-keys.sh index a0a2874..fadd19c 100755 --- a/tests/misc/sort-debug-keys.sh +++ b/tests/misc/sort-debug-keys.sh @@ -238,6 +238,10 @@ A>chr10 ^ no match for key B>chr1 ^ no match for key +1 2 + __ +1 3 + __ EOF ( @@ -282,6 +286,9 @@ printf '\0\ta\n' | sort -s -k2b,2 --debug | tr -d '\0' # Check that key end before key start is not underlined printf 'A\tchr10\nB\tchr1\n' | sort -s -k2.4b,2.3n --debug + +# Ensure that -b applied before -k offsets +printf '1 2\n1 3\n' | sort -s -k1.2b --debug ) > out compare exp out || fail=1 -- 2.5.0 --------------030202020105040909030907-- From unknown Tue Jun 17 22:16:55 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 10 Jan 2016 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator