Package: coreutils;
Reported by: Andras Salamon <andras <at> dns.net>
Date: Thu, 18 Aug 2011 16:11:01 UTC
Severity: normal
Tags: notabug
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Andras Salamon <andras <at> dns.net> To: Pádraig Brady <P <at> draigBrady.com> Cc: 9321 <at> debbugs.gnu.org Subject: bug#9321: repeated segfaults sorting large files in 8.12 Date: Sat, 20 Aug 2011 21:58:57 +0100
On Fri, Aug 19, 2011 at 11:54:46PM +0100, Pádraig Brady wrote: >On 08/18/2011 03:30 PM, Andras Salamon wrote: >> I am seeing repeated (but not reliably repeatable) segmentation faults >> sorting datasets in the 100MB-100GB range on a 64-bit Debian system >> using GNU sort 8.12 (and also 8.9). Stack traces seem to indicate >> problems during the merge phase, usually when the temporary files >> are being combined. >Andras, could you give the exact command line your having issue with, >and perhaps make sort inputs available too? The sort inputs are several-gigabyte-range files containing strings, each typically 60 to 140 bytes long, one per line. There are many duplicates, and the first reason to sort is to establish the distribution of duplicates. I would be happy to make available data if I could find a reasonably sized file that causes a reproducible segfault. The problem seems easier to reproduce with larger files, unfortunately. >Do the --batch-size=NMERGE or --compress-program=PROG options change anything? Thanks for the suggestion, I will try forcing smaller batches. Compressing batches was something I tried early on with no apparent change in likelihood of failure, but it led to much slower runtimes. >Also there were temp file handling changes made in 7.2 so could you try: >ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz Here are some of the relevant-seeming parts of a gdb session for coreutils-7.1. Here ?.xz is a compressed file which has already been sorted, around 35MB in size. Built with: configure CFLAGS=-g --disable-nls Commandline: % nohup xzcat 1.xz 2.xz 3.xz 4.xz | sort -S 100M -T /home/a/tmp | xz > o.xz & Segmentation fault ../bin/sort -T /home/a/tmp -S 100M | (core dumped) During the run there were 435 temp files active at one point. There may have been more at a later stage, but these were reduced to a final 32 which remained after the crash. There is around 600GB free disk space on this volume. % du -smc sort* | tail -1 29556 total % ls -sktr sort* 62776 sortR07gPu 62056 sortS3H1Mu 10848 sortECN8Nx 951020 sortlk9Xd1 1001668 sortrDhnFQ 1001420 sortItDvPu 1001216 sortIBlIVY 1001500 sortDWg5Vj 1012504 sortOulxqu 916424 sortOTNgnn 907976 sortRlRPsA 997840 sortuQbWXj 1001328 sortoWTS4K 1001436 sort3GpGf2 1001544 sortVudEk7 1009412 sortJou3Y3 926628 sortL2SeVF 950584 sortSTuAkJ 1001376 sortX9rCaf 1000928 sortAjXZkz 1001120 sortQzXcgK 1001412 sortLwoe9K 1012704 sortM4WHnD 955044 sort1c8ja8 981680 sortJhX3rd 1001040 sortqGq4yV 1000596 sort7obBHs 1000540 sortW4fLHR 1000800 sortSzB3s6 999624 sortMD7K0b 305892 sortqSxpe4 3183480 sortcOqzkh (gdb) bt #0 0x000000000040e6bc in memcoll ( s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, s1len=15564440312192434243, s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68) at memcoll.c:50 #1 0x000000000040af4c in xmemcoll ( s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, s1len=15564440312192434243, s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68) at xmemcoll.c:43 #2 0x00000000004059ee in compare (a=0x5b4a7f0, b=0x301dfc0) at sort.c:2059 #3 0x0000000000406815 in mergefps (files=0x24063e0, ntemps=15, nfiles=15, ofp=0x23ff8e0, output_file=0x24062ec "/home/a/tmp/sortcOqzkh") at sort.c:2326 #4 0x000000000040708f in merge (files=0x24063e0, ntemps=16, nfiles=32, output_file=0x0) at sort.c:2567 #5 0x000000000040766a in sort (files=0x61c660, nfiles=0, output_file=0x0) at sort.c:2699 #6 0x000000000040908c in main (argc=5, argv=0x7fff149247a8) at sort.c:3425 In context, line 2326 marked with ***: { size_t lo = 1; size_t hi = nfiles; size_t probe = lo; size_t ord0 = ord[0]; size_t count_of_smaller_lines; while (lo < hi) { *** int cmp = compare (cur[ord0], cur[ord[probe]]); *** if (cmp < 0 || (cmp == 0 && ord0 < ord[probe])) hi = probe; else lo = probe + 1; probe = (lo + hi) / 2; } count_of_smaller_lines = lo - 1; for (j = 0; j < count_of_smaller_lines; j++) ord[j] = ord[j + 1]; ord[count_of_smaller_lines] = ord0; } In stack frame 3: (gdb) p ord[0]@15 $51 = {7, 0, 14, 8, 1, 2, 9, 3, 10, 4, 11, 12, 5, 13, 6} (gdb) print *cur[7] $52 = {text = 0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, length = 15564440312192434244, keybeg = 0x0, keylim = 0x0} (gdb) print *(cur[7]-1) $54 = { text = 0x5824d9c "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"..., length = 68, keybeg = 0xa500000000000000 <Address 0xa500000000000000 out of bounds>, keylim = 0x8900000000000000 <Address 0x8900000000000000 out of bounds>} (gdb) print *(cur[7]+1) $55 = { text = 0x5824d14 "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"..., length = 68, keybeg = 0x0, keylim = 0x0} (gdb) p (char *) 0x5824d58 $70 = 0x5824d58 "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"... I printed that last one because cur[7].text=0x7800000005824d58 differs by one byte from this location, and 0x58-0x14=0x9c-0x58=68, so it might be relevant. For interest, here is some gdb output on a core I saved with 8.12: #5 0x00000000004073f5 in compare (a=0x228b5e0, b=0x68ce2d0) at sort.c:2668 2668 diff = xmemcoll0 (a->text, alen + 1, b->text, blen + 1); #6 0x000000000040837b in mergefps (files=0x119e230, ntemps=11, nfiles=11, ofp=0x11978b0, output_file=0x119787d "/home/a/tmp/sort1mESrU", fps=0x1197af0) at sort.c:2995 2995 int cmp = compare (cur[ord0], cur[ord[probe]]); In frame 6: (gdb) p cur[0]@11 $6 = {0x228b5e0, 0x2a9ff30, 0x30dff60, 0x35293b0, 0x4913940, 0x5020050, 0x5660080, 0x5bd0290, 0x68ce2d0, 0x6f60140, 0x75a0170} (gdb) p ord[0]@11 $8 = {0, 8, 4, 9, 1, 5, 10, 2, 6, 3, 7} (gdb) p ord0 $9 = 0 (gdb) p probe $10 = 1 (gdb) p *(const struct line *)0x2a9ff30 $15 = { text = 0x245ff30 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F350598182xxbnode337", length = 77, keybeg = 0x0, keylim = 0x0} (gdb) p *(const struct line *)0x228b5e0 $16 = {text = 0x600000000226d720 <Address 0x600000000226d720 out of bounds>, length = 14843864371813154892, keybeg = 0x756566736f4e2f72 <Address 0x756566736f4e2f72 out of bounds>, keylim = 0x66626f5f6f746c61 <Address 0x66626f5f6f746c61 out of bounds>} (gdb) p *(const struct line *)0x75a0170 $18 = { text = 0x6f60170 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F492419832xxbnode215", length = 77, keybeg = 0x0, keylim = 0x0} (gdb) p *buffer $33 = { buf = 0x1e1ff00 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F104700830xxbnode271", used = 4596991, nlines = 61144, alloc = 6553632, left = 62, line_bytes = 32, eof = false} -- Andras Salamon andras <at> dns.net
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.