GNU bug report logs -
#8824
sort occasionally hangs - appears to be in a merge-sort loop
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 8824 in the body.
You can then email your comments to 8824 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#8824
; Package
coreutils
.
(Wed, 08 Jun 2011 20:41:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mina Naguib <mina.naguib <at> bloomdigital.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 08 Jun 2011 20:41:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi
I've observed a few instances when `sort` simply "hangs" and never returns the sorted data.
Currently using coreutils 8.12 on gentoo linux 2.6.25, `locale` outputs:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
An example ongoing right now on one of my servers. Sort invoked as:
/usr/bin/sort -t '|' -k1,1 -k6,6n
It was fed via STDIN pipe-delimited line-based records, and STDIN was closed. The parent process is waiting on sort to start receiving the output via STDOUT which never happens.
Input was roughly 350 million records, each record looking something like this:
000WQg16MdC2Puk|direct_count|2049|63581|15090|1306695571
I dug into the sort process with strace and I believe it's doing one of the final merge sorts:
read(12, "986|61518|14528|1301616354\nnHM5Og"..., 4096) = 4096
read(12, "|1967|61925|14584|1301634780\nnHMk"..., 4096) = 4096
read(12, "533\nnHNQDohSrKT3Ffn|direct_count|"..., 4096) = 4096
write(4, "4488|1302016964\nnGq7jG4hfFS9flL|d"..., 4096) = 4096
write(4, "nGq9POv11Jh7OiQ|direct_count|1965"..., 4096) = 4096
write(4, "05|1299763468\nnGqCgAT3jwkrzhc|dir"..., 4096) = 4096
write(4, "460\nnGqFItcr2F1Pz8n|tag|1995|6275"..., 4096) = 4096
write(4, "JuhUfMkOPC7J|direct_count|1914|57"..., 4096) = 4096
write(4, "|1914|57180|13801|1302640204\nnGqO"..., 4096) = 4096
After watching it do this diddy for 5.5 hours I dug in further. Here's its open temp files:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun 8 15:37:41 EDT 2011
1 sort 31370 dmt 4u REG 8,1 3123896320 12705303 /tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551 /tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627 /tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636 /tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640 /tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644 /tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646 /tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665 /tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666 /tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672 /tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680 /tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696 /tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703 /tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711 /tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714 /tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728 /tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713 /tmp/sort4Ou1u6
again:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun 8 15:37:58 EDT 2011
1 sort 31370 dmt 4u REG 8,1 3432607744 12705303 /tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551 /tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627 /tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636 /tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640 /tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644 /tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646 /tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665 /tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666 /tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672 /tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680 /tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696 /tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703 /tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711 /tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714 /tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728 /tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713 /tmp/sort4Ou1u6
again several times, you see as the main writable file approaches the total size sum of the other files combined:
Wed Jun 8 15:55:20 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18965975040 12705303 /tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551 /tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627 /tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636 /tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640 /tmp/sortCUkW6a
6 sort 31370 dmt 10r REG 8,1 1199488185 12705644 /tmp/sortSDopQX
7 sort 31370 dmt 11r REG 8,1 1188267263 12705646 /tmp/sortUdXqke
8 sort 31370 dmt 12r REG 8,1 1176087055 12705665 /tmp/sortMiavw5
9 sort 31370 dmt 13r REG 8,1 1184572313 12705666 /tmp/sortUf5q03
10 sort 31370 dmt 14r REG 8,1 1187433621 12705672 /tmp/sorti3mKO9
11 sort 31370 dmt 15r REG 8,1 1184537051 12705680 /tmp/sort0iYrQr
12 sort 31370 dmt 16r REG 8,1 1186874578 12705696 /tmp/sortUwVkpQ
13 sort 31370 dmt 17r REG 8,1 1173767980 12705703 /tmp/sortov5uyz
14 sort 31370 dmt 18r REG 8,1 1172616590 12705711 /tmp/sort2ibXqt
15 sort 31370 dmt 19r REG 8,1 1184319006 12705714 /tmp/sortA1DmGE
16 sort 31370 dmt 20r REG 8,1 1188372691 12705728 /tmp/sortibGbvd
17 sort 31370 dmt 21r REG 8,1 1185407259 12705713 /tmp/sort4Ou1u6
then what appears to be finalization and cleanup:
Wed Jun 8 15:55:28 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990366720 12705303 /tmp/sort0YtEtm
2 sort 31370 dmt 5r REG 8,1 1199907358 12705551 /tmp/sortWNOpxy
3 sort 31370 dmt 7r REG 8,1 1181994889 12705627 /tmp/sortIpm44r
4 sort 31370 dmt 8r REG 8,1 1199857626 12705636 /tmp/sort2WOPPo
5 sort 31370 dmt 9r REG 8,1 1196868088 12705640 /tmp/sortCUkW6a
6 sort 31370 dmt 12r REG 8,1 1176087055 12705665 /tmp/sortMiavw5
7 sort 31370 dmt 13r REG 8,1 1184572313 12705666 /tmp/sortUf5q03
8 sort 31370 dmt 14r REG 8,1 1187433621 12705672 /tmp/sorti3mKO9
9 sort 31370 dmt 15r REG 8,1 1184537051 12705680 /tmp/sort0iYrQr
10 sort 31370 dmt 16r REG 8,1 1186874578 12705696 /tmp/sortUwVkpQ
11 sort 31370 dmt 17r REG 8,1 1173767980 12705703 /tmp/sortov5uyz
12 sort 31370 dmt 18r REG 8,1 1172616590 12705711 /tmp/sort2ibXqt
again, almost done:
Wed Jun 8 15:55:34 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990370816 12705303 /tmp/sort0YtEtm
2 sort 31370 dmt 16r REG 8,1 1186874578 12705696 /tmp/sortUwVkpQ
again, 1 main file left:
Wed Jun 8 15:55:35 EDT 2011
1 sort 31370 dmt 4u REG 8,1 18990370816 12705303 /tmp/sort0YtEtm
But alas, 16 more readable files pop up and 1 writable starts adding up again:
Wed Jun 8 15:55:36 EDT 2011
1 sort 31370 dmt 4u REG 8,1 0 12705551 /tmp/sortevbKK1
2 sort 31370 dmt 5r REG 8,1 1178392494 12705739 /tmp/sortIIvZhW
3 sort 31370 dmt 7r REG 8,1 1180102559 12706259 /tmp/sortwZr2sU
4 sort 31370 dmt 8r REG 8,1 1181439270 12706530 /tmp/sort8B2hbx
5 sort 31370 dmt 9r REG 8,1 1183275901 12706731 /tmp/sorteGJa35
6 sort 31370 dmt 10r REG 8,1 1178538478 12706793 /tmp/sortoGXmOo
7 sort 31370 dmt 11r REG 8,1 1191511729 12706795 /tmp/sortG0RNr9
8 sort 31370 dmt 12r REG 8,1 1199965207 12706799 /tmp/sort6Q8tAy
9 sort 31370 dmt 13r REG 8,1 1192944995 12706817 /tmp/sortCGXzT2
10 sort 31370 dmt 14r REG 8,1 1191485559 12706823 /tmp/sortMwsygv
11 sort 31370 dmt 15r REG 8,1 1200144993 12706830 /tmp/sortyA2RWw
12 sort 31370 dmt 16r REG 8,1 1198251727 12706866 /tmp/sortyZlqHC
13 sort 31370 dmt 17r REG 8,1 1184876212 12706871 /tmp/sortCCjKOH
14 sort 31370 dmt 18r REG 8,1 1174916186 12706880 /tmp/sortKrUJSJ
15 sort 31370 dmt 19r REG 8,1 1184800950 12706824 /tmp/sort2Vjy7s
16 sort 31370 dmt 20r REG 8,1 1187817055 12706891 /tmp/sort89xsBm
17 sort 31370 dmt 21r REG 8,1 1185686697 12706944 /tmp/sortwjlE5L
Wed Jun 8 15:55:40 EDT 2011
1 sort 31370 dmt 4u REG 8,1 62631936 12705551 /tmp/sortevbKK1
2 sort 31370 dmt 5r REG 8,1 1178392494 12705739 /tmp/sortIIvZhW
3 sort 31370 dmt 7r REG 8,1 1180102559 12706259 /tmp/sortwZr2sU
4 sort 31370 dmt 8r REG 8,1 1181439270 12706530 /tmp/sort8B2hbx
5 sort 31370 dmt 9r REG 8,1 1183275901 12706731 /tmp/sorteGJa35
6 sort 31370 dmt 10r REG 8,1 1178538478 12706793 /tmp/sortoGXmOo
7 sort 31370 dmt 11r REG 8,1 1191511729 12706795 /tmp/sortG0RNr9
8 sort 31370 dmt 12r REG 8,1 1199965207 12706799 /tmp/sort6Q8tAy
9 sort 31370 dmt 13r REG 8,1 1192944995 12706817 /tmp/sortCGXzT2
10 sort 31370 dmt 14r REG 8,1 1191485559 12706823 /tmp/sortMwsygv
11 sort 31370 dmt 15r REG 8,1 1200144993 12706830 /tmp/sortyA2RWw
12 sort 31370 dmt 16r REG 8,1 1198251727 12706866 /tmp/sortyZlqHC
13 sort 31370 dmt 17r REG 8,1 1184876212 12706871 /tmp/sortCCjKOH
14 sort 31370 dmt 18r REG 8,1 1174916186 12706880 /tmp/sortKrUJSJ
15 sort 31370 dmt 19r REG 8,1 1184800950 12706824 /tmp/sort2Vjy7s
16 sort 31370 dmt 20r REG 8,1 1187817055 12706891 /tmp/sort89xsBm
17 sort 31370 dmt 21r REG 8,1 1185686697 12706944 /tmp/sortwjlE5L
I may be mistaken, but I don't think this is normal behaviour - once the 16-way merge sort is done, I see no reason to observe this whole process happening over and over.
Ironically, I had upgraded from coreutils 8.5 to 8.12 because other tools on the same machine invokes sort with "--compress-program=gzip" and had observed similar-but-different hanging on that tool's invocation. I dug in and found this commit ( http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=1b31ce6982a9151d9dfe2ea3595ad7595cb9ca86 ) which encouraged me to try the upgrade to fix the hang-while-compressing issue, but I'm now seeing this hanging even in the non-compressed invocation.
In my case, I think I'll downgrade coreutils back down to 8.5 as well as disable all tools from invoking it with "--compress-program" to get predictable non-hanging behaviour across the board.
If I can be of further help please let me know.
Thank you.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#8824
; Package
coreutils
.
(Thu, 09 Jun 2011 06:47:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 8824 <at> debbugs.gnu.org (full text, mbox):
On 06/08/11 13:16, Mina Naguib wrote:
> I may be mistaken, but I don't think this is normal behaviour - once
> the 16-way merge sort is done, I see no reason to observe this whole
> process happening over and over.
Thanks for your bug report. If 'sort' is breaking up its input into 4
MiB chunks, sorting them, creating a separate temp file for each
chunk, and then merging the results with a 16-way merge, then the
first level of 16-way merges will produce 64 MiB files, and the second
level will produce 1 GiB temp files, which is about the size you're
observing. Since your input is about 18 GiB in size (is that right?),
I'd expect to see two third-level merges. The first would be a 16-way
merge, generating about 16 GiB total. The second would be a roughly
2-way merge, generating about 2 GiB. Then there would be a single
fourth-level merge of these two big files into the final 18 GiB of
output.
But you're observing a second third-level merge that's the same
size as the first. So something is wrong here.
I just now tried to reproduce your bug on my host (Fedora 14 x86-64
with 8 GiB RAM) as follows:
shuf -i 1-350000000 -o col1
shuf -i 1-350000000 -o col2
paste col1 col2 |
awk '{printf "%s|direct_count|2049|63581|15090|%s\n", $1, $2}' |
sort -t '|' -k1,1 -k6,6n |
cat >sortout
Alas, this didn't reproduce the problem; it worked just fine.
Can you modify the above recipe somehow and make the problem
happen on your host?
How much RAM do you have? Is your host x86 or x86-64 or what?
(That "4 MiB" in my example is an absurdly small number, and
this is a performance bug in 'sort', but that's a different
matter I think.)
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#8824
; Package
coreutils
.
(Thu, 09 Jun 2011 08:33:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 8824 <at> debbugs.gnu.org (full text, mbox):
On 08/06/11 21:16, Mina Naguib wrote:
>
> I've observed a few instances when `sort` simply "hangs" and never returns the sorted data.
> In my case, I think I'll downgrade coreutils back down to 8.5 as well as disable all tools from invoking it with "--compress-program" to get predictable non-hanging behaviour across the board.
coreutils 8.6 got the threaded sort implementation, so maybe that's triggering an obscure bug.
If you were on a multi processor machine, you could restrict `sort` to a single
thread by adding the --parallel=1 option. It's worth absolving that I think.
thanks for the very detailed bug report,
Pádraig.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#8824
; Package
coreutils
.
(Fri, 10 Jun 2011 13:55:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 8824 <at> debbugs.gnu.org (full text, mbox):
Hi
I realized that there was a mistake on my part in the initial report.
On 2011-06-09, at 2:46 AM, Paul Eggert wrote:
> Thanks for your bug report. If 'sort' is breaking up its input into 4
> MiB chunks, sorting them, creating a separate temp file for each
> chunk, and then merging the results with a 16-way merge, then the
> first level of 16-way merges will produce 64 MiB files, and the second
> level will produce 1 GiB temp files, which is about the size you're
> observing. Since your input is about 18 GiB in size (is that right?),
> I'd expect to see two third-level merges. The first would be a 16-way
> merge, generating about 16 GiB total. The second would be a roughly
> 2-way merge, generating about 2 GiB. Then there would be a single
> fourth-level merge of these two big files into the final 18 GiB of
> output.
The input data is not 350 million records, but significantly higher (to the tune of 2 billion). I'd estimate that at ~56 bytes per record, the data set's size was roughly 100Gigs
I may have jumped the gun and associated the delay/looping I've observed with the unrelated "hang" bug I've observed earlier with compressed temp files in coreutils 8.5.
Given this new information, do you think the behaviour I observed is reasonable ? Or is there still the possibility of a bug worth pursuing ?
> How much RAM do you have? Is your host x86 or x86-64 or what?
> (That "4 MiB" in my example is an absurdly small number, and
> this is a performance bug in 'sort', but that's a different
> matter I think.)
This occurred on an x86_64 box with 12G RAM, 4 x 2.66Ghz CPUs. The disks are quite slow as it's all on the same local RAID5 volume for both the read-data and the written temp files and eventual output.
Thank you.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#8824
; Package
coreutils
.
(Fri, 10 Jun 2011 16:35:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 8824 <at> debbugs.gnu.org (full text, mbox):
On 06/10/11 06:54, Mina Naguib wrote:
> Given this new information, do you think the behaviour I observed is reasonable?
> Or is there still the possibility of a bug worth pursuing ?
I suspect there is a *performance* bug, but not a correctness bug.
Does the performance improve if you use "sort -S 6G"
for the big sort?
Added tag(s) moreinfo.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 11 Oct 2018 22:35:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
8824 <at> debbugs.gnu.org and Mina Naguib <mina.naguib <at> bloomdigital.com>
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Thu, 11 Oct 2018 22:35:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 09 Nov 2018 12:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 228 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.