GNU bug report logs - #49340
small sort takes hours for UTF-8 locale

Previous Next

Package: coreutils;

Reported by: Jon Klaas <blagothakus <at> gmail.com>

Date: Fri, 2 Jul 2021 20:52:01 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: Jon Klaas <blagothakus <at> gmail.com>, 49340 <at> debbugs.gnu.org
Subject: bug#49340: small sort takes hours for UTF-8 locale
Date: Sat, 3 Jul 2021 00:19:07 +0100
On 02/07/2021 20:32, Jon Klaas wrote:
> Hello,
> 
> I encountered a file that was taking hours to sort that was expected
> to take negligible time.  This seems to be due to the locale
> LANG=en_US.UTF-8.  I've worked around the problem by using LC_ALL=C, but
> thought I would report this, as I didn't see a relevant bug report.
> 
> This was seen on centos 8 using package
> coreutils-8.30-6.el8.x86_64
> and the current
> coreutils-8.30-8.el8.x86_64
> 
> 
> #takes under 1 second.
> export LC_ALL=C
> sort tst00776.out
> 
> #slow sort takes many hours
> export LC_ALL=en_US.UTF-8
> sort tst00776.out
> 
> Looks like most of the time is consumed here:
> 
> #0  0x00007f4a65425c4b in strcoll_l () from /lib64/libc.so.6
> #1  0x00005600d195d365 in strcoll_loop ()
> #2  0x00005600d195bebd in xmemcoll0 ()
> #3  0x00005600d1951176 in compare ()
> #4  0x00005600d1951224 in sequential_sort ()
> #5  0x00005600d19511d5 in sequential_sort ()
> #6  0x00005600d195374b in sortlines ()
> #7  0x00005600d194d96b in main ()
> 
> It's possible the input (attached) has invalid UTF-8.
> 
> I also tried on an older RHEL 7 and did NOT reproduce the problem with
> coreutils.x86_64                    8.22-23.el7

There are 7 lines in that input that are 65500 characters long,
which is triggering the slow behavior.
You can see the length distribution like:
  awk '{print length}' < tst00776.out | uniq -c | less

There are no NUL bytes:
  $ grep -Pa '\x00' tst00776.out | wc -l
  0

Also it's just ASCII data:
  $ iconv -fUTF8 -tASCII < tst00776.out | wc -l
  11743

Since your data is ASCII, using LC_ALL=C is most appropriate to avoid strcoll():
  $ LC_ALL=C sort < tst00776.out | wc -l
  11743

You could also limit the length of lines compared with:
  $ sort -k1,1.80 -s < tst00776.out | wc -l
  11743

The vast majority of the time is spent in strcoll()
so this is a glibc issue rather than coreutils.
I think this is tracked in glibc at:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441

Now saying that, we might be able to improve things.
For example, using strxfrm() + strcmp() to minimize processing.

cheers,
Pádraig




This bug report was last modified 4 years and 65 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.