GNU bug report logs - #49340
small sort takes hours for UTF-8 locale

Previous Next

Package: coreutils;

Reported by: Jon Klaas <blagothakus <at> gmail.com>

Date: Fri, 2 Jul 2021 20:52:01 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Jon Klaas <blagothakus <at> gmail.com>
To: 49340 <at> debbugs.gnu.org
Subject: bug#49340: small sort takes hours for UTF-8 locale
Date: Fri, 2 Jul 2021 14:32:21 -0500
[Message part 1 (text/plain, inline)]
Hello,

I encountered a file that was taking hours to sort that was expected
to take negligible time.  This seems to be due to the locale
LANG=en_US.UTF-8.  I've worked around the problem by using LC_ALL=C, but
thought I would report this, as I didn't see a relevant bug report.

This was seen on centos 8 using package
coreutils-8.30-6.el8.x86_64
and the current
coreutils-8.30-8.el8.x86_64


#takes under 1 second.
export LC_ALL=C
sort tst00776.out

#slow sort takes many hours
export LC_ALL=en_US.UTF-8
sort tst00776.out

Looks like most of the time is consumed here:

#0  0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
#1  0x00005600d195d365 in strcoll_loop ()
#2  0x00005600d195bebd in xmemcoll0 ()
#3  0x00005600d1951176 in compare ()
#4  0x00005600d1951224 in sequential_sort ()
#5  0x00005600d19511d5 in sequential_sort ()
#6  0x00005600d195374b in sortlines ()
#7  0x00005600d194d96b in main ()

It's possible the input (attached) has invalid UTF-8.

I also tried on an older RHEL 7 and did NOT reproduce the problem with
coreutils.x86_64                    8.22-23.el7

Thanks,

Jon Klaas
[Message part 2 (text/html, inline)]
[tst00776.out.gz (application/x-gzip, attachment)]

This bug report was last modified 4 years and 65 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.