From unknown Fri Sep 05 11:00:40 2025 X-Loop: help-debbugs@gnu.org Subject: bug#49340: small sort takes hours for UTF-8 locale Resent-From: Jon Klaas Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Fri, 02 Jul 2021 20:52:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 49340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 49340@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.16252591035419 (code B ref -1); Fri, 02 Jul 2021 20:52:01 +0000 Received: (at submit) by debbugs.gnu.org; 2 Jul 2021 20:51:43 +0000 Received: from localhost ([127.0.0.1]:37204 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzQ8S-0001PJ-JI for submit@debbugs.gnu.org; Fri, 02 Jul 2021 16:51:43 -0400 Received: from lists.gnu.org ([209.51.188.17]:48674) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzOtz-0007yC-Iy for submit@debbugs.gnu.org; Fri, 02 Jul 2021 15:32:42 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:54576) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lzOtz-00008S-9Z for bug-coreutils@gnu.org; Fri, 02 Jul 2021 15:32:39 -0400 Received: from mail-qt1-x82b.google.com ([2607:f8b0:4864:20::82b]:33789) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lzOtw-0003sl-PX for bug-coreutils@gnu.org; Fri, 02 Jul 2021 15:32:39 -0400 Received: by mail-qt1-x82b.google.com with SMTP id w13so7467319qtc.0 for ; Fri, 02 Jul 2021 12:32:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=d+huVNyhwCTB4te24tW0/qy2kw3+miGbE5wlwy+nHV8=; b=TMuhuYFQ5W0Bo9BROlQpRrvdMMnFOP1VYm9XZKvMr8sVYKsB/YRVmmlki1fdWBT9x5 6MxEfyFaVFnv58AoXKxtLlUTeJ9ftHVXdAQxPzN/cicZFqgAjDvSdGo62yMXSg/iZC/a tASjJZvtkxLZD0tGtZh0ekAMiDGnkZUHd4tt3XHI2vIWdOrSYgJzXBP75pchNNkzNDME fYDPj7B9GIB+euj/CrBm7jZWRLO91fGRbe0uWXIg0lAfNAzArmzeTN2RNIxWYJ5ncG2q YFl0hXQtXvabCVuKTBSMuEhjiEsY0xcpnziH0+b49I5D7y81pBJyxU00IXu+ibFOHaWo PxUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=d+huVNyhwCTB4te24tW0/qy2kw3+miGbE5wlwy+nHV8=; b=rT8jWtUX1dLt+w99eZR2xa26rxmsK9KZYkvGG4QG7FLwPBiPK7BMM1FhfXiGZfH6r6 vVHBe4fryjiDkU1Pktv7yHjGB6FXaOEm5DoLN42uReKxBFZAuZI7C4V5WBz1ZEfieIE5 skWm1StgifwoTjr1ktACTm1d9ZzeplYQNZ6BQxJcSW4DXUFB3U1DiQpWP06FESY5wSAD kNwYq4jxF4rDcdP7TWZPtH2PMD2WZe3e2NNhNZoc3ONs9YMzMyNwXC+aOKUt9U6L5Cl5 M3mqT4QZ/2uL7zt69/7ytZYcVWLIjAQCxNzA7aTxVMXoEznUb++0uN8r0Hk4mnDsMhnx euDg== X-Gm-Message-State: AOAM530iDfX2LoAwbKv4tC9MM3X04s/CKKYYh+IDU+z+6A2pGMI28A2f DZSyvwkUkP5hTSNzI3Z5GTTbUiar6iaCgc25O9uYfUAZz5s= X-Google-Smtp-Source: ABdhPJwi9AuRZT+al7uMvgzXoOyjIhCCR+eE75/D07XJJZv7jhMk//9X4aHbrc7gMZRWS29QuMeebow65bbxQG6E7kM= X-Received: by 2002:ac8:6bc1:: with SMTP id b1mr1263992qtt.217.1625254354647; Fri, 02 Jul 2021 12:32:34 -0700 (PDT) MIME-Version: 1.0 From: Jon Klaas Date: Fri, 2 Jul 2021 14:32:21 -0500 Message-ID: Content-Type: multipart/mixed; boundary="0000000000007acbaa05c6290470" Received-SPF: pass client-ip=2607:f8b0:4864:20::82b; envelope-from=blagothakus@gmail.com; helo=mail-qt1-x82b.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Mailman-Approved-At: Fri, 02 Jul 2021 16:51:38 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --0000000000007acbaa05c6290470 Content-Type: multipart/alternative; boundary="0000000000007acba905c629046e" --0000000000007acba905c629046e Content-Type: text/plain; charset="UTF-8" Hello, I encountered a file that was taking hours to sort that was expected to take negligible time. This seems to be due to the locale LANG=en_US.UTF-8. I've worked around the problem by using LC_ALL=C, but thought I would report this, as I didn't see a relevant bug report. This was seen on centos 8 using package coreutils-8.30-6.el8.x86_64 and the current coreutils-8.30-8.el8.x86_64 #takes under 1 second. export LC_ALL=C sort tst00776.out #slow sort takes many hours export LC_ALL=en_US.UTF-8 sort tst00776.out Looks like most of the time is consumed here: #0 0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6 #1 0x00005600d195d365 in strcoll_loop () #2 0x00005600d195bebd in xmemcoll0 () #3 0x00005600d1951176 in compare () #4 0x00005600d1951224 in sequential_sort () #5 0x00005600d19511d5 in sequential_sort () #6 0x00005600d195374b in sortlines () #7 0x00005600d194d96b in main () It's possible the input (attached) has invalid UTF-8. I also tried on an older RHEL 7 and did NOT reproduce the problem with coreutils.x86_64 8.22-23.el7 Thanks, Jon Klaas --0000000000007acba905c629046e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,

I enc= ountered a file that was taking hours to sort that was expected to=C2=A0tak= e=C2=A0negligible=C2=A0time.=C2=A0 This seems to be due to the locale LANG= =3Den_US.UTF-8.=C2=A0 I've worked around the problem by using LC_ALL=3D= C, but thought I would report this, as I didn't see a relevant bug repo= rt.

This was seen on centos 8 using package=C2=A0<= /div>
coreutils-8.30-6.el8.x86_64
and the current
coreutils-8.30-8.el8.x86_64


#takes= under 1 second.
export LC_ALL=3DC
sort tst00776.out

#slow sort takes many hours
export LC_ALL= =3Den_US.UTF-8
sort tst00776.out

Looks l= ike most of the time is consumed here:

#0=C2= =A0 0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
#1= =C2=A0 0x00005600d195d365 in strcoll_loop ()
#2=C2=A0 0x00005600d= 195bebd in xmemcoll0 ()
#3=C2=A0 0x00005600d1951176 in compare ()=
#4=C2=A0 0x00005600d1951224 in sequential_sort ()
#5= =C2=A0 0x00005600d19511d5 in sequential_sort ()
#6=C2=A0 0x000056= 00d195374b in sortlines ()
#7=C2=A0 0x00005600d194d96b in main ()=

It's possible the input (attached) has = invalid UTF-8.

I also tried on an older RHEL 7 and= did NOT reproduce the problem with
coreutils.x86_64=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.22-23.el7
<= div>
Thanks,

Jon Klaas
--0000000000007acba905c629046e-- --0000000000007acbaa05c6290470 Content-Type: application/x-gzip; name="tst00776.out.gz" Content-Disposition: attachment; filename="tst00776.out.gz" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_kqmp3qau0 H4sICA1O32AAA3RzdDAwNzc2Lm91dADt2dFv28iBwOF3/hXzYni3dzKo7K6126IPaSs0AVIDKwmH 9klgpJHFKyXKnFGM9q8/OXTd3cKJ7Vhoc5vvJyAONdRHzkimDTPFHOJi3YZ2tQrF/KFm4+kslMN+ I6dclqPR+Vm6asJ9exfzLu5ilcN1ndeh6i73m7jNKVQpbPdNE95VzT6mYjKeDqazSThWxaDveODR JCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgcAvFHzu+Rx9 PpIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIk SZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZKkz69iGLq4aLtlSLGJixyXxcnJ rkop1KtwN1anu+FwXed1qMJ23zQh5a7eXp4V84eajaezUL7oN3LKZTkanZ+lqybct3cx7+IuVrk/ 1qruUg5Vd7nfxO3hP6k/9ruq2cdiMp4OprPJz+c06PuXmf7Hp//NJ00/Hc5hu/zA/NO9C/CsD8S9 i/cc8GgSEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA4BcK Pvd8jj4fSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIk SZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSdLnVzEMXVy03TKk2MRF jsvi5GRXpRTqVbgbq9PdcLiu8zpUYbtvmpByV28vz4r5Q83G01kov+03csplORqdn6WrJty3dzHv 4i5W+fZY3eV+E7c5hVXbhUW73+ZQpfD32LXFZDwdTGeTD09v0PeRBfisVui7T1qhfk3GV/uqCbkN g3fxF7cw549dmI+sy389Yl2e2sPr+FTwaBLwP/PyUHRVXlfbDw0/fbT4+PDTR++d4XPoZ035ky8l t2z/LX94cliGXG9ieuxFZfQpF5V//jg6/BDK1dsmHk5gFbu4Xbi8AIFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBP7yweeeT7FcV92mCot2s9vn2KWwe5dDk5fh bdN20ahRo0Y/n9FnXu8kSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIk SZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZIkSZK+6Iph6OKi 7ZYhxSYuclwWJye7KqVQr8LdWJ3uhsN1nddhua66TRUW7Wa3z7FLYfcuhyYvw9um7eLhZbtY3ez8 bRlyvYmpmD/UbDydhfL7fiOnXJaj0flZumrCfXsX/RH6k6m6y/0mbnMKVQrTH9/M/+fl5PevXk7C y4s/vN+e/unlmzevL2bFZDwdTGeT463eoO944NEkIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBwC8UfO75FBev/vF4piRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJ kiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJ kiRJkiRJkiRJkiRJkiRJ+mwrhqGLi7ZbhhSbuMhxWZyc7KqUQr0Kd2N1uhsO13Veh4tXh6FdrG6e +C7kehNTMX+o2Xg6C+UP/UZOuSxHo/OzdNWE+/Yu+gP0x6u6y/0mbnMKVQrTH9/MZ68v/vL6YlZM xtPBdDY53noM+o4HHk0CAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAI/ELB555PcfGqfzzTkSRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJ kiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiR9 xhXD0MVF2y1Dik1c5LgsTk52VUqhXoW7sTrdDYfrOq/DxavD0C5WN098G3K9iamYP9RsPJ2FYdlv 5JTLcjQ6P0tXTbhv76I/QH+8qrvcb+I2p1ClMP3xzfx3r//4+mJWTMbTwXQ2Od5yDPqOBx5NAgKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCPxCweeeT3Hx6qeP Z2qSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmS JEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSpM+yYhi6uGi7ZUixiYscl8XJ ya5KKdSrcDdWp7vhcF3ndbh4dRjaxermiVHI9SamYv5Qs/F0FobDfiOnXJaj0flZumrCfXsX/QH6 4zXxcEbb/eZt7EK7ClV3ud/EbU5Ff1q3J/PVttrE4dfhdBLTIOXuNKy6dhMW+5TbTezm5Shcr2MX w/v9fnvaVXldbU9/Uzx11X5VxK5ru68GL8py+P3Xvw6ztg2reB3eVc0+HhZrFxf1qv7ZYr5/Rbt6 /4rwkVc8chlfPHYZ5z9dxy6mfZOrbQ6H5am3l+Gyu3kTu3CzEOEPL2cv5396+ef5m/FFMRlPB9PZ 5HiftEHf8cCjSUAgEAgEAoFAIBAIBP6iwBuzqI4cEAgEAoHAR/Vv+3n/aXd3mri9PHxpV+FFWZZh sa666jDaPfoezzefco/n7qZOqFJYtvu3TXQTAggEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgE AoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQC gUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKB QCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFA IBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAg EAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQ CAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAI BAKBQCAQCAQCgUAgEAgEAoFAIPAXDz73fIqLVz9/PNOTJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmS JEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmSJEmS JEmSJEmSJEmSJEmSJH2GFcPQxUXbLUOKTVzkuCxOTnZVSqFehbuxOt0Nh+s6r8PFq8PQLlY3T3wf cr2JqZg/1Gw8nYXht/1GTrksR6Pzs3TVhPv2Lub9EfoDVt3lfhO3OYVV24VFu9/mUKUwLCbj6WA6 m3x4goO+jyxBV+V1tX3uGk1imqfchfHVvmpCbsMt+5O1e+wSfffUJaq3YRvTzZuRFlVTdWG13y5y 3W4fXJ6n9vByPhU8mvRvBj90nAc+Tk8fLT4+/PTRD03o0y4Ft+7d5eD8aZeD8+N91t/P4vYD32/8 48P6kxmfn390nt2/fhf/UD52JqOnzqS/sIXD4ertZWhX4UX5z2UcluXtQvoWBgKBQCAQCAQCgUDg /zfwuedTrI8cEAgEAoHAR/XJdwp+/qfusgyLddVVh/Hu0XcLvj/y39iHt39jf//LxcdvHbwoX5Qf nXi1DbHr2i687LrqbyHVf49h1rZhXV+uHzu/Hz5pfv0/XXuY4bLKVch/2x3m1J/i7Wy/Or17/07/ +/R//9qcfh1OJzENDitzGlZduwmLfcrtJnbzchSu17GLYVtt4vC3p/0dntPfPPTL26+K9/P/anBY qm9efP3r8Hr7rmrqZdjuN29jd/sm3Ldi9b07/h89JVJKNW8VAA== --0000000000007acbaa05c6290470-- From unknown Fri Sep 05 11:00:40 2025 X-Loop: help-debbugs@gnu.org Subject: bug#49340: small sort takes hours for UTF-8 locale Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Fri, 02 Jul 2021 23:20:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 49340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Jon Klaas , 49340@debbugs.gnu.org Received: via spool by 49340-submit@debbugs.gnu.org id=B49340.162526796318984 (code B ref 49340); Fri, 02 Jul 2021 23:20:02 +0000 Received: (at 49340) by debbugs.gnu.org; 2 Jul 2021 23:19:23 +0000 Received: from localhost ([127.0.0.1]:37283 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzSRP-0004w7-1L for submit@debbugs.gnu.org; Fri, 02 Jul 2021 19:19:23 -0400 Received: from mail-wm1-f47.google.com ([209.85.128.47]:43701) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzSRH-0004vp-TL for 49340@debbugs.gnu.org; Fri, 02 Jul 2021 19:19:21 -0400 Received: by mail-wm1-f47.google.com with SMTP id q18-20020a1ce9120000b02901f259f3a250so7336557wmc.2 for <49340@debbugs.gnu.org>; Fri, 02 Jul 2021 16:19:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=IkY+xcQFU5zONwt2RliH1ErWgRstO/i895wX9oCz70Y=; b=em4eZyCAGjQND0CznlXfwlnYKbLyXyBISqBZcSXQkCFm6xROAL/+82z6A09f9LpqlK poTa19MsZKT2s9MMMuiBDjAFGyZYrJjoD2u0NDC+tJuNtUICpYhACmpbjb43ZRJ+oSg8 u4b7FFo5+/14hh9EuDnfXzMfZd1QspK5jeeKPMuhCkfYnyJ86A36XGnAj6D6WF5cgWLT bkyw8SdjWub5dcZS8A7vEbtzRncrE3gHJ19G69Qd8KY1NhopAwSi7HkQZ06UZb0t2s4G nxYW08SqS5/YzhsCoDRFC8GGGygCwRakIn5kYnpfpgCOeMu3LoFD5Gk3FrVjuLf+6exu 1udA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=IkY+xcQFU5zONwt2RliH1ErWgRstO/i895wX9oCz70Y=; b=GEdQrrIOLyYmWfd5vWsiSQ8vjk3RAMn+B2VBJ1SiQKd3IKrAOPmylMPVpuGSLpuc7B RB1bYsqxxsDvpLTLi2hRs5t2sFPgOEv5Kuii2F3JPqizPtKkyFxZxcBaG6HeCp+C93Ez D8IJDIX6KbpiiUY9rReDs1iY4lhf7CO2J6QQaK3Rlz3gl+v3iD+GOjt9IHcwK5VyAP1K /ZkmqcGXSmbO8tcvQk01PwKH0C7iGuYkaI+IceMi1i1tq0gU8EbXFrZr2j1OxiDRBioD UUmh/wHEEJJQRN3Zvc/jXSlI94AdsSeXkT5w/kYeA95X84rJGy9Zf8QotFyfUUNjzf7m Csqw== X-Gm-Message-State: AOAM531AI+ML0Iw9yg0qtLa+feI3ot+BvNWnNLo2DvHsQR8DkTDtmqFq EmbD6TNW/axs0pLLLCw753asemVj5Fc= X-Google-Smtp-Source: ABdhPJzLMt8k9t5TjO7cwOSI3SLEP6M43hV9EO9fc15F8QMmBwbhs69Q+p4aprxJdf1fHXxReiGS7g== X-Received: by 2002:a7b:c318:: with SMTP id k24mr1900663wmj.144.1625267949728; Fri, 02 Jul 2021 16:19:09 -0700 (PDT) Received: from localhost.localdomain (86-42-15-3-dynamic.agg2.lod.rsl-rtd.eircom.net. [86.42.15.3]) by smtp.googlemail.com with UTF8SMTPSA id e12sm4652656wrw.34.2021.07.02.16.19.08 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 02 Jul 2021 16:19:09 -0700 (PDT) References: From: =?UTF-8?Q?P=C3=A1draig?= Brady Message-ID: <6d10c384-81b6-f454-ba5c-94799caf0b12@draigBrady.com> Date: Sat, 3 Jul 2021 00:19:07 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Thunderbird/84.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On 02/07/2021 20:32, Jon Klaas wrote: > Hello, > > I encountered a file that was taking hours to sort that was expected > to take negligible time. This seems to be due to the locale > LANG=en_US.UTF-8. I've worked around the problem by using LC_ALL=C, but > thought I would report this, as I didn't see a relevant bug report. > > This was seen on centos 8 using package > coreutils-8.30-6.el8.x86_64 > and the current > coreutils-8.30-8.el8.x86_64 > > > #takes under 1 second. > export LC_ALL=C > sort tst00776.out > > #slow sort takes many hours > export LC_ALL=en_US.UTF-8 > sort tst00776.out > > Looks like most of the time is consumed here: > > #0 0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6 > #1 0x00005600d195d365 in strcoll_loop () > #2 0x00005600d195bebd in xmemcoll0 () > #3 0x00005600d1951176 in compare () > #4 0x00005600d1951224 in sequential_sort () > #5 0x00005600d19511d5 in sequential_sort () > #6 0x00005600d195374b in sortlines () > #7 0x00005600d194d96b in main () > > It's possible the input (attached) has invalid UTF-8. > > I also tried on an older RHEL 7 and did NOT reproduce the problem with > coreutils.x86_64 8.22-23.el7 There are 7 lines in that input that are 65500 characters long, which is triggering the slow behavior. You can see the length distribution like: awk '{print length}' < tst00776.out | uniq -c | less There are no NUL bytes: $ grep -Pa '\x00' tst00776.out | wc -l 0 Also it's just ASCII data: $ iconv -fUTF8 -tASCII < tst00776.out | wc -l 11743 Since your data is ASCII, using LC_ALL=C is most appropriate to avoid strcoll(): $ LC_ALL=C sort < tst00776.out | wc -l 11743 You could also limit the length of lines compared with: $ sort -k1,1.80 -s < tst00776.out | wc -l 11743 The vast majority of the time is spent in strcoll() so this is a glibc issue rather than coreutils. I think this is tracked in glibc at: https://sourceware.org/bugzilla/show_bug.cgi?id=18441 Now saying that, we might be able to improve things. For example, using strxfrm() + strcmp() to minimize processing. cheers, Pádraig From unknown Fri Sep 05 11:00:40 2025 X-Loop: help-debbugs@gnu.org Subject: bug#49340: small sort takes hours for UTF-8 locale Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sat, 03 Jul 2021 00:27:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 49340 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 49340@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.162527196225309 (code B ref -1); Sat, 03 Jul 2021 00:27:01 +0000 Received: (at submit) by debbugs.gnu.org; 3 Jul 2021 00:26:02 +0000 Received: from localhost ([127.0.0.1]:37379 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzTTu-0006a9-FK for submit@debbugs.gnu.org; Fri, 02 Jul 2021 20:26:02 -0400 Received: from lists.gnu.org ([209.51.188.17]:42498) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lzTTs-0006Zk-1O for submit@debbugs.gnu.org; Fri, 02 Jul 2021 20:26:01 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:41122) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lzTTr-0004Ew-RT for bug-coreutils@gnu.org; Fri, 02 Jul 2021 20:25:59 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:40642) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lzTTn-0002e3-95 for bug-coreutils@gnu.org; Fri, 02 Jul 2021 20:25:59 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 59CD516008D for ; Fri, 2 Jul 2021 17:25:52 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id CTjOlomUtmaR for ; Fri, 2 Jul 2021 17:25:51 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 97F871600B2 for ; Fri, 2 Jul 2021 17:25:51 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id eJYvqvTxaGGk for ; Fri, 2 Jul 2021 17:25:51 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 58EAF16008D for ; Fri, 2 Jul 2021 17:25:51 -0700 (PDT) References: <6d10c384-81b6-f454-ba5c-94799caf0b12@draigBrady.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Fri, 2 Jul 2021 17:25:51 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <6d10c384-81b6-f454-ba5c-94799caf0b12@draigBrady.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=131.179.128.68; envelope-from=eggert@cs.ucla.edu; helo=zimbra.cs.ucla.edu X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 7/2/21 4:19 PM, P=C3=A1draig Brady wrote: > we might be able to improve things. > For example, using strxfrm() + strcmp() to minimize processing. I tried that long ago, and it was waaayyy slower than strcoll in the=20 typical case. glibc strxfrm is not at all optimized. Which is fine, since strxfrm is a dumb API: its only point is=20 performance but its portable API is inherently low-performance for=20 typical uses. I've never seen it useful. In short, this is a glibc strcoll bug and should be fixed there.