GNU bug report logs - #8824
sort occasionally hangs - appears to be in a merge-sort loop

Previous Next

Package: coreutils;

Reported by: Mina Naguib <mina.naguib <at> bloomdigital.com>

Date: Wed, 8 Jun 2011 20:41:02 UTC

Severity: normal

Tags: moreinfo

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 8824 in the body.
You can then email your comments to 8824 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8824; Package coreutils. (Wed, 08 Jun 2011 20:41:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mina Naguib <mina.naguib <at> bloomdigital.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 08 Jun 2011 20:41:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mina Naguib <mina.naguib <at> bloomdigital.com>
To: bug-coreutils <at> gnu.org
Subject: sort occasionally hangs - appears to be in a merge-sort loop
Date: Wed, 8 Jun 2011 16:16:17 -0400
Hi

I've observed a few instances when `sort` simply "hangs" and never returns the sorted data.

Currently using coreutils 8.12 on gentoo linux 2.6.25, `locale` outputs:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

An example ongoing right now on one of my servers.  Sort invoked as:
/usr/bin/sort -t '|' -k1,1 -k6,6n

It was fed via STDIN pipe-delimited line-based records, and STDIN was closed.  The parent process is waiting on sort to start receiving the output via STDOUT which never happens.

Input was roughly 350 million records, each record looking something like this:
000WQg16MdC2Puk|direct_count|2049|63581|15090|1306695571

I dug into the sort process with strace and I believe it's doing one of the final merge sorts:
read(12, "986|61518|14528|1301616354\nnHM5Og"..., 4096) = 4096
read(12, "|1967|61925|14584|1301634780\nnHMk"..., 4096) = 4096
read(12, "533\nnHNQDohSrKT3Ffn|direct_count|"..., 4096) = 4096
write(4, "4488|1302016964\nnGq7jG4hfFS9flL|d"..., 4096) = 4096
write(4, "nGq9POv11Jh7OiQ|direct_count|1965"..., 4096) = 4096
write(4, "05|1299763468\nnGqCgAT3jwkrzhc|dir"..., 4096) = 4096
write(4, "460\nnGqFItcr2F1Pz8n|tag|1995|6275"..., 4096) = 4096
write(4, "JuhUfMkOPC7J|direct_count|1914|57"..., 4096) = 4096
write(4, "|1914|57180|13801|1302640204\nnGqO"..., 4096) = 4096

After watching it do this diddy for 5.5 hours I dug in further.  Here's its open temp files:

$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun  8 15:37:41 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 3123896320  12705303 /tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1 1199907358  12705551 /tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1 1181994889  12705627 /tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1 1199857626  12705636 /tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1 1196868088  12705640 /tmp/sortCUkW6a
     6  sort    31370  dmt   10r   REG    8,1 1199488185  12705644 /tmp/sortSDopQX
     7  sort    31370  dmt   11r   REG    8,1 1188267263  12705646 /tmp/sortUdXqke
     8  sort    31370  dmt   12r   REG    8,1 1176087055  12705665 /tmp/sortMiavw5
     9  sort    31370  dmt   13r   REG    8,1 1184572313  12705666 /tmp/sortUf5q03
    10  sort    31370  dmt   14r   REG    8,1 1187433621  12705672 /tmp/sorti3mKO9
    11  sort    31370  dmt   15r   REG    8,1 1184537051  12705680 /tmp/sort0iYrQr
    12  sort    31370  dmt   16r   REG    8,1 1186874578  12705696 /tmp/sortUwVkpQ
    13  sort    31370  dmt   17r   REG    8,1 1173767980  12705703 /tmp/sortov5uyz
    14  sort    31370  dmt   18r   REG    8,1 1172616590  12705711 /tmp/sort2ibXqt
    15  sort    31370  dmt   19r   REG    8,1 1184319006  12705714 /tmp/sortA1DmGE
    16  sort    31370  dmt   20r   REG    8,1 1188372691  12705728 /tmp/sortibGbvd
    17  sort    31370  dmt   21r   REG    8,1 1185407259  12705713 /tmp/sort4Ou1u6

again:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun  8 15:37:58 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 3432607744  12705303 /tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1 1199907358  12705551 /tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1 1181994889  12705627 /tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1 1199857626  12705636 /tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1 1196868088  12705640 /tmp/sortCUkW6a
     6  sort    31370  dmt   10r   REG    8,1 1199488185  12705644 /tmp/sortSDopQX
     7  sort    31370  dmt   11r   REG    8,1 1188267263  12705646 /tmp/sortUdXqke
     8  sort    31370  dmt   12r   REG    8,1 1176087055  12705665 /tmp/sortMiavw5
     9  sort    31370  dmt   13r   REG    8,1 1184572313  12705666 /tmp/sortUf5q03
    10  sort    31370  dmt   14r   REG    8,1 1187433621  12705672 /tmp/sorti3mKO9
    11  sort    31370  dmt   15r   REG    8,1 1184537051  12705680 /tmp/sort0iYrQr
    12  sort    31370  dmt   16r   REG    8,1 1186874578  12705696 /tmp/sortUwVkpQ
    13  sort    31370  dmt   17r   REG    8,1 1173767980  12705703 /tmp/sortov5uyz
    14  sort    31370  dmt   18r   REG    8,1 1172616590  12705711 /tmp/sort2ibXqt
    15  sort    31370  dmt   19r   REG    8,1 1184319006  12705714 /tmp/sortA1DmGE
    16  sort    31370  dmt   20r   REG    8,1 1188372691  12705728 /tmp/sortibGbvd
    17  sort    31370  dmt   21r   REG    8,1 1185407259  12705713 /tmp/sort4Ou1u6


again several times, you see as the main writable file approaches the total size sum of the other files combined:
Wed Jun  8 15:55:20 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1 18965975040  12705303 /tmp/sort0YtEtm
     2	sort    31370  dmt    5r   REG    8,1  1199907358  12705551 /tmp/sortWNOpxy
     3	sort    31370  dmt    7r   REG    8,1  1181994889  12705627 /tmp/sortIpm44r
     4	sort    31370  dmt    8r   REG    8,1  1199857626  12705636 /tmp/sort2WOPPo
     5	sort    31370  dmt    9r   REG    8,1  1196868088  12705640 /tmp/sortCUkW6a
     6	sort    31370  dmt   10r   REG    8,1  1199488185  12705644 /tmp/sortSDopQX
     7	sort    31370  dmt   11r   REG    8,1  1188267263  12705646 /tmp/sortUdXqke
     8	sort    31370  dmt   12r   REG    8,1  1176087055  12705665 /tmp/sortMiavw5
     9	sort    31370  dmt   13r   REG    8,1  1184572313  12705666 /tmp/sortUf5q03
    10	sort    31370  dmt   14r   REG    8,1  1187433621  12705672 /tmp/sorti3mKO9
    11	sort    31370  dmt   15r   REG    8,1  1184537051  12705680 /tmp/sort0iYrQr
    12	sort    31370  dmt   16r   REG    8,1  1186874578  12705696 /tmp/sortUwVkpQ
    13	sort    31370  dmt   17r   REG    8,1  1173767980  12705703 /tmp/sortov5uyz
    14	sort    31370  dmt   18r   REG    8,1  1172616590  12705711 /tmp/sort2ibXqt
    15	sort    31370  dmt   19r   REG    8,1  1184319006  12705714 /tmp/sortA1DmGE
    16	sort    31370  dmt   20r   REG    8,1  1188372691  12705728 /tmp/sortibGbvd
    17	sort    31370  dmt   21r   REG    8,1  1185407259  12705713 /tmp/sort4Ou1u6

then what appears to be finalization and cleanup:
Wed Jun  8 15:55:28 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1 18990366720  12705303 /tmp/sort0YtEtm
     2	sort    31370  dmt    5r   REG    8,1  1199907358  12705551 /tmp/sortWNOpxy
     3	sort    31370  dmt    7r   REG    8,1  1181994889  12705627 /tmp/sortIpm44r
     4	sort    31370  dmt    8r   REG    8,1  1199857626  12705636 /tmp/sort2WOPPo
     5	sort    31370  dmt    9r   REG    8,1  1196868088  12705640 /tmp/sortCUkW6a
     6	sort    31370  dmt   12r   REG    8,1  1176087055  12705665 /tmp/sortMiavw5
     7	sort    31370  dmt   13r   REG    8,1  1184572313  12705666 /tmp/sortUf5q03
     8	sort    31370  dmt   14r   REG    8,1  1187433621  12705672 /tmp/sorti3mKO9
     9	sort    31370  dmt   15r   REG    8,1  1184537051  12705680 /tmp/sort0iYrQr
    10	sort    31370  dmt   16r   REG    8,1  1186874578  12705696 /tmp/sortUwVkpQ
    11	sort    31370  dmt   17r   REG    8,1  1173767980  12705703 /tmp/sortov5uyz
    12	sort    31370  dmt   18r   REG    8,1  1172616590  12705711 /tmp/sort2ibXqt

again, almost done:
Wed Jun  8 15:55:34 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1 18990370816  12705303 /tmp/sort0YtEtm
     2	sort    31370  dmt   16r   REG    8,1  1186874578  12705696 /tmp/sortUwVkpQ

again, 1 main file left:
Wed Jun  8 15:55:35 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1 18990370816  12705303 /tmp/sort0YtEtm

But alas, 16 more readable files pop up and 1 writable starts adding up again:
Wed Jun  8 15:55:36 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1          0  12705551 /tmp/sortevbKK1
     2	sort    31370  dmt    5r   REG    8,1 1178392494  12705739 /tmp/sortIIvZhW
     3	sort    31370  dmt    7r   REG    8,1 1180102559  12706259 /tmp/sortwZr2sU
     4	sort    31370  dmt    8r   REG    8,1 1181439270  12706530 /tmp/sort8B2hbx
     5	sort    31370  dmt    9r   REG    8,1 1183275901  12706731 /tmp/sorteGJa35
     6	sort    31370  dmt   10r   REG    8,1 1178538478  12706793 /tmp/sortoGXmOo
     7	sort    31370  dmt   11r   REG    8,1 1191511729  12706795 /tmp/sortG0RNr9
     8	sort    31370  dmt   12r   REG    8,1 1199965207  12706799 /tmp/sort6Q8tAy
     9	sort    31370  dmt   13r   REG    8,1 1192944995  12706817 /tmp/sortCGXzT2
    10	sort    31370  dmt   14r   REG    8,1 1191485559  12706823 /tmp/sortMwsygv
    11	sort    31370  dmt   15r   REG    8,1 1200144993  12706830 /tmp/sortyA2RWw
    12	sort    31370  dmt   16r   REG    8,1 1198251727  12706866 /tmp/sortyZlqHC
    13	sort    31370  dmt   17r   REG    8,1 1184876212  12706871 /tmp/sortCCjKOH
    14	sort    31370  dmt   18r   REG    8,1 1174916186  12706880 /tmp/sortKrUJSJ
    15	sort    31370  dmt   19r   REG    8,1 1184800950  12706824 /tmp/sort2Vjy7s
    16	sort    31370  dmt   20r   REG    8,1 1187817055  12706891 /tmp/sort89xsBm
    17	sort    31370  dmt   21r   REG    8,1 1185686697  12706944 /tmp/sortwjlE5L

Wed Jun  8 15:55:40 EDT 2011
     1	sort    31370  dmt    4u   REG    8,1   62631936  12705551 /tmp/sortevbKK1
     2	sort    31370  dmt    5r   REG    8,1 1178392494  12705739 /tmp/sortIIvZhW
     3	sort    31370  dmt    7r   REG    8,1 1180102559  12706259 /tmp/sortwZr2sU
     4	sort    31370  dmt    8r   REG    8,1 1181439270  12706530 /tmp/sort8B2hbx
     5	sort    31370  dmt    9r   REG    8,1 1183275901  12706731 /tmp/sorteGJa35
     6	sort    31370  dmt   10r   REG    8,1 1178538478  12706793 /tmp/sortoGXmOo
     7	sort    31370  dmt   11r   REG    8,1 1191511729  12706795 /tmp/sortG0RNr9
     8	sort    31370  dmt   12r   REG    8,1 1199965207  12706799 /tmp/sort6Q8tAy
     9	sort    31370  dmt   13r   REG    8,1 1192944995  12706817 /tmp/sortCGXzT2
    10	sort    31370  dmt   14r   REG    8,1 1191485559  12706823 /tmp/sortMwsygv
    11	sort    31370  dmt   15r   REG    8,1 1200144993  12706830 /tmp/sortyA2RWw
    12	sort    31370  dmt   16r   REG    8,1 1198251727  12706866 /tmp/sortyZlqHC
    13	sort    31370  dmt   17r   REG    8,1 1184876212  12706871 /tmp/sortCCjKOH
    14	sort    31370  dmt   18r   REG    8,1 1174916186  12706880 /tmp/sortKrUJSJ
    15	sort    31370  dmt   19r   REG    8,1 1184800950  12706824 /tmp/sort2Vjy7s
    16	sort    31370  dmt   20r   REG    8,1 1187817055  12706891 /tmp/sort89xsBm
    17	sort    31370  dmt   21r   REG    8,1 1185686697  12706944 /tmp/sortwjlE5L

I may be mistaken, but I don't think this is normal behaviour - once the 16-way merge sort is done, I see no reason to observe this whole process happening over and over.

Ironically, I had upgraded from coreutils 8.5 to 8.12 because other tools on the same machine invokes sort with "--compress-program=gzip" and had observed similar-but-different hanging on that tool's invocation.  I dug in and found this commit ( http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=1b31ce6982a9151d9dfe2ea3595ad7595cb9ca86 ) which encouraged me to try the upgrade to fix the hang-while-compressing issue, but I'm now seeing this hanging even in the non-compressed invocation.

In my case, I think I'll downgrade coreutils back down to 8.5 as well as disable all tools from invoking it with "--compress-program" to get predictable non-hanging behaviour across the board.

If I can be of further help please let me know.

Thank you.






Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8824; Package coreutils. (Thu, 09 Jun 2011 06:47:02 GMT) Full text and rfc822 format available.

Message #8 received at 8824 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mina Naguib <mina.naguib <at> bloomdigital.com>
Cc: 8824 <at> debbugs.gnu.org
Subject: Re: bug#8824: sort occasionally hangs - appears to be in a merge-sort
	loop
Date: Wed, 08 Jun 2011 23:46:34 -0700
On 06/08/11 13:16, Mina Naguib wrote:

> I may be mistaken, but I don't think this is normal behaviour - once
> the 16-way merge sort is done, I see no reason to observe this whole
> process happening over and over.

Thanks for your bug report.  If 'sort' is breaking up its input into 4
MiB chunks, sorting them, creating a separate temp file for each
chunk, and then merging the results with a 16-way merge, then the
first level of 16-way merges will produce 64 MiB files, and the second
level will produce 1 GiB temp files, which is about the size you're
observing.  Since your input is about 18 GiB in size (is that right?),
I'd expect to see two third-level merges.  The first would be a 16-way
merge, generating about 16 GiB total.  The second would be a roughly
2-way merge, generating about 2 GiB.  Then there would be a single
fourth-level merge of these two big files into the final 18 GiB of
output.

But you're observing a second third-level merge that's the same
size as the first.  So something is wrong here.

I just now tried to reproduce your bug on my host (Fedora 14 x86-64
with 8 GiB RAM) as follows:

shuf -i 1-350000000 -o col1
shuf -i 1-350000000 -o col2
paste col1 col2 |
  awk '{printf "%s|direct_count|2049|63581|15090|%s\n", $1, $2}' |
  sort -t '|' -k1,1 -k6,6n |
  cat >sortout

Alas, this didn't reproduce the problem; it worked just fine.

Can you modify the above recipe somehow and make the problem
happen on your host?

How much RAM do you have?  Is your host x86 or x86-64 or what?
(That "4 MiB" in my example is an absurdly small number, and
this is a performance bug in 'sort', but that's a different
matter I think.)




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8824; Package coreutils. (Thu, 09 Jun 2011 08:33:01 GMT) Full text and rfc822 format available.

Message #11 received at 8824 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Mina Naguib <mina.naguib <at> bloomdigital.com>
Cc: 8824 <at> debbugs.gnu.org
Subject: Re: bug#8824: sort occasionally hangs - appears to be in a merge-sort
	loop
Date: Thu, 09 Jun 2011 09:29:10 +0100
On 08/06/11 21:16, Mina Naguib wrote:
> 
> I've observed a few instances when `sort` simply "hangs" and never returns the sorted data.

> In my case, I think I'll downgrade coreutils back down to 8.5 as well as disable all tools from invoking it with "--compress-program" to get predictable non-hanging behaviour across the board.

coreutils 8.6 got the threaded sort implementation, so maybe that's triggering an obscure bug.
If you were on a multi processor machine, you could restrict `sort` to a single
thread by adding the --parallel=1 option. It's worth absolving that I think.

thanks for the very detailed bug report,
Pádraig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8824; Package coreutils. (Fri, 10 Jun 2011 13:55:02 GMT) Full text and rfc822 format available.

Message #14 received at 8824 <at> debbugs.gnu.org (full text, mbox):

From: Mina Naguib <mina.naguib <at> bloomdigital.com>
To: 8824 <at> debbugs.gnu.org
Subject: Re: bug#8824: sort occasionally hangs - appears to be in a merge-sort
	loop
Date: Fri, 10 Jun 2011 09:54:08 -0400
Hi

I realized that there was a mistake on my part in the initial report.

On 2011-06-09, at 2:46 AM, Paul Eggert wrote:
> Thanks for your bug report.  If 'sort' is breaking up its input into 4
> MiB chunks, sorting them, creating a separate temp file for each
> chunk, and then merging the results with a 16-way merge, then the
> first level of 16-way merges will produce 64 MiB files, and the second
> level will produce 1 GiB temp files, which is about the size you're
> observing.  Since your input is about 18 GiB in size (is that right?),
> I'd expect to see two third-level merges.  The first would be a 16-way
> merge, generating about 16 GiB total.  The second would be a roughly
> 2-way merge, generating about 2 GiB.  Then there would be a single
> fourth-level merge of these two big files into the final 18 GiB of
> output.


The input data is not 350 million records, but significantly higher (to the tune of 2 billion).  I'd estimate that at ~56 bytes per record, the data set's size was roughly 100Gigs

I may have jumped the gun and associated the delay/looping I've observed with the unrelated "hang" bug I've observed earlier with compressed temp files in coreutils 8.5.

Given this new information, do you think the behaviour I observed is reasonable ?  Or is there still the possibility of a bug worth pursuing ?


> How much RAM do you have?  Is your host x86 or x86-64 or what?
> (That "4 MiB" in my example is an absurdly small number, and
> this is a performance bug in 'sort', but that's a different
> matter I think.)

This occurred on an x86_64 box with 12G RAM, 4 x 2.66Ghz CPUs.  The disks are quite slow as it's all on the same local RAID5 volume for both the read-data and the written temp files and eventual output. 


Thank you.



Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8824; Package coreutils. (Fri, 10 Jun 2011 16:35:02 GMT) Full text and rfc822 format available.

Message #17 received at 8824 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mina Naguib <mina.naguib <at> bloomdigital.com>
Cc: 8824 <at> debbugs.gnu.org
Subject: Re: bug#8824: sort occasionally hangs - appears to be in a merge-sort
	loop
Date: Fri, 10 Jun 2011 09:33:36 -0700
On 06/10/11 06:54, Mina Naguib wrote:

> Given this new information, do you think the behaviour I observed is reasonable?
> Or is there still the possibility of a bug worth pursuing ?

I suspect there is a *performance* bug, but not a correctness bug.

Does the performance improve if you use "sort -S 6G"
for the big sort?




Added tag(s) moreinfo. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:35:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 8824 <at> debbugs.gnu.org and Mina Naguib <mina.naguib <at> bloomdigital.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:35:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 09 Nov 2018 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 228 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.