GNU bug report logs - #9321
repeated segfaults sorting large files in 8.12

Previous Next

Package: coreutils;

Reported by: Andras Salamon <andras <at> dns.net>

Date: Thu, 18 Aug 2011 16:11:01 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9321 in the body.
You can then email your comments to 9321 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Thu, 18 Aug 2011 16:11:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andras Salamon <andras <at> dns.net>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 18 Aug 2011 16:11:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andras Salamon <andras <at> dns.net>
To: bug-coreutils <at> gnu.org
Subject: repeated segfaults sorting large files in 8.12
Date: Thu, 18 Aug 2011 15:30:05 +0100
I am seeing repeated (but not reliably repeatable) segmentation faults
sorting datasets in the 100MB-100GB range on a 64-bit Debian system
using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
problems during the merge phase, usually when the temporary files
are being combined.

This may or may not be related to the recent discussion about
#9307, but I am definitely using 8.12, rebuilt with CFLAGS=-g since
several indicative values were otherwise optimised out, configured
with --disable-nls --disable-threads, and am running with a fixed
buffer -S 100M and also --parallel=1 to try to isolate problems from
possible threading issues.  I was seeing these crashes with a vanilla
build also.

At least one crash occurred when comparing the very last entry in
the memory buffer to a non-existent entry, when merging large files.

There was also a crash with total_lines=851122 in mergelines_node,
which leads to node->hi containing what appears to be garbage, with
length=2882303761517117516.

The repository changelog seems to indicate that the current development
release of sort has not changed since 8.12.  Will attempting to track
the problem down with 8.12 be useful?  If so I can post stack traces
and values of relevant variables from the core dump, or post a new
issue in the tracker, or reopen #9307.  If not, please suggest some
specific actions I should take to generate useful information.

Thanks,

-- Andras Salamon                   andras <at> dns.net




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Thu, 18 Aug 2011 17:17:02 GMT) Full text and rfc822 format available.

Message #8 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Andras Salamon <andras <at> dns.net>
Cc: 9321 <at> debbugs.gnu.org
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Thu, 18 Aug 2011 10:14:20 -0700
On 08/18/2011 07:30 AM, Andras Salamon wrote:
> The repository changelog seems to indicate that the current development
> release of sort has not changed since 8.12.  Will attempting to track
> the problem down with 8.12 be useful?

Yes, I think so; thanks.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Fri, 19 Aug 2011 22:58:01 GMT) Full text and rfc822 format available.

Message #11 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Andras Salamon <andras <at> dns.net>
Cc: 9321 <at> debbugs.gnu.org
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Fri, 19 Aug 2011 23:54:46 +0100
On 08/18/2011 03:30 PM, Andras Salamon wrote:
> I am seeing repeated (but not reliably repeatable) segmentation faults
> sorting datasets in the 100MB-100GB range on a 64-bit Debian system
> using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
> problems during the merge phase, usually when the temporary files
> are being combined.
> 
> This may or may not be related to the recent discussion about
> #9307, but I am definitely using 8.12, rebuilt with CFLAGS=-g since
> several indicative values were otherwise optimised out, configured
> with --disable-nls --disable-threads, and am running with a fixed
> buffer -S 100M and also --parallel=1 to try to isolate problems from
> possible threading issues.  I was seeing these crashes with a vanilla
> build also.
> 
> At least one crash occurred when comparing the very last entry in
> the memory buffer to a non-existent entry, when merging large files.
> 
> There was also a crash with total_lines=851122 in mergelines_node,
> which leads to node->hi containing what appears to be garbage, with
> length=2882303761517117516.
> 
> The repository changelog seems to indicate that the current development
> release of sort has not changed since 8.12.  Will attempting to track
> the problem down with 8.12 be useful?  If so I can post stack traces
> and values of relevant variables from the core dump, or post a new
> issue in the tracker, or reopen #9307.  If not, please suggest some
> specific actions I should take to generate useful information.
> 
> Thanks,
> 
> -- Andras Salamon                   andras <at> dns.net
> 
> 
> 
> 

Andras, could you give the exact command line your having issue with,
and perhaps make sort inputs available too?
Also could you try to bisect the issue?
You say it happens even with --parallel=1, but could you try to
reproduce without the threading changes at all. I.E. with:
ftp://ftp.gnu.org/gnu/coreutils/coreutils-8.5.tar.gz
Also there were temp file handling changes made in 7.2 so could you try:
ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz
Do the --batch-size=NMERGE or --compress-program=PROG options change anything?

cheers,
Pádraig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Sat, 20 Aug 2011 06:34:02 GMT) Full text and rfc822 format available.

Message #14 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Andras Salamon <andras <at> dns.net>
Cc: 9321 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sat, 20 Aug 2011 08:31:46 +0200
Andras Salamon wrote:

> I am seeing repeated (but not reliably repeatable) segmentation faults
> sorting datasets in the 100MB-100GB range on a 64-bit Debian system
> using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
> problems during the merge phase, usually when the temporary files
> are being combined.
>
> This may or may not be related to the recent discussion about
> #9307, but I am definitely using 8.12, rebuilt with CFLAGS=-g since
> several indicative values were otherwise optimised out, configured
> with --disable-nls --disable-threads, and am running with a fixed
> buffer -S 100M and also --parallel=1 to try to isolate problems from
> possible threading issues.  I was seeing these crashes with a vanilla
> build also.
>
> At least one crash occurred when comparing the very last entry in
> the memory buffer to a non-existent entry, when merging large files.
>
> There was also a crash with total_lines=851122 in mergelines_node,
> which leads to node->hi containing what appears to be garbage, with
> length=2882303761517117516.
>
> The repository changelog seems to indicate that the current development
> release of sort has not changed since 8.12.  Will attempting to track
> the problem down with 8.12 be useful?

Yes, most definitely.
As Pádraig already mentioned, most useful would be instructions
showing how to reproduce the failure, even if part of that is something
like "run this command 30 times" to provoke the rare failure.

> If so I can post stack traces
> and values of relevant variables from the core dump, or post a new
> issue in the tracker, or reopen #9307.  If not, please suggest some
> specific actions I should take to generate useful information.

Thanks for the detailed report and investigation.
Have you reproduced the problem on more than one system?
If not, have you recently run any tests of your system's hardware?
It would be a shame to invest a lot of debugging effort,
if it ends up being a hardware problem with one specific system.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Sat, 20 Aug 2011 21:02:01 GMT) Full text and rfc822 format available.

Message #17 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Andras Salamon <andras <at> dns.net>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 9321 <at> debbugs.gnu.org
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sat, 20 Aug 2011 21:58:57 +0100
On Fri, Aug 19, 2011 at 11:54:46PM +0100, Pádraig Brady wrote:
>On 08/18/2011 03:30 PM, Andras Salamon wrote:
>> I am seeing repeated (but not reliably repeatable) segmentation faults
>> sorting datasets in the 100MB-100GB range on a 64-bit Debian system
>> using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
>> problems during the merge phase, usually when the temporary files
>> are being combined.

>Andras, could you give the exact command line your having issue with,
>and perhaps make sort inputs available too?

The sort inputs are several-gigabyte-range files containing strings,
each typically 60 to 140 bytes long, one per line.  There are
many duplicates, and the first reason to sort is to establish the
distribution of duplicates.  I would be happy to make available data
if I could find a reasonably sized file that causes a reproducible
segfault.  The problem seems easier to reproduce with larger files,
unfortunately.

>Do the --batch-size=NMERGE or --compress-program=PROG options change anything?

Thanks for the suggestion, I will try forcing smaller batches.

Compressing batches was something I tried early on with no apparent
change in likelihood of failure, but it led to much slower runtimes.

>Also there were temp file handling changes made in 7.2 so could you try:
>ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz

Here are some of the relevant-seeming parts of a gdb session for
coreutils-7.1.  Here ?.xz is a compressed file which has already been
sorted, around 35MB in size.

Built with: configure CFLAGS=-g --disable-nls

Commandline:
% nohup xzcat 1.xz 2.xz 3.xz 4.xz | sort -S 100M -T /home/a/tmp | xz > o.xz &
Segmentation fault  ../bin/sort -T /home/a/tmp -S 100M | (core dumped)

During the run there were 435 temp files active at one point.
There may have been more at a later stage, but these were reduced
to a final 32 which remained after the crash.  There is around 600GB
free disk space on this volume.

% du -smc sort* | tail -1
29556   total

% ls -sktr sort*
  62776 sortR07gPu
  62056 sortS3H1Mu
  10848 sortECN8Nx
 951020 sortlk9Xd1
1001668 sortrDhnFQ
1001420 sortItDvPu
1001216 sortIBlIVY
1001500 sortDWg5Vj
1012504 sortOulxqu
 916424 sortOTNgnn
 907976 sortRlRPsA
 997840 sortuQbWXj
1001328 sortoWTS4K
1001436 sort3GpGf2
1001544 sortVudEk7
1009412 sortJou3Y3
 926628 sortL2SeVF
 950584 sortSTuAkJ
1001376 sortX9rCaf
1000928 sortAjXZkz
1001120 sortQzXcgK
1001412 sortLwoe9K
1012704 sortM4WHnD
 955044 sort1c8ja8
 981680 sortJhX3rd
1001040 sortqGq4yV
1000596 sort7obBHs
1000540 sortW4fLHR
1000800 sortSzB3s6
 999624 sortMD7K0b
 305892 sortqSxpe4
3183480 sortcOqzkh

(gdb) bt
#0  0x000000000040e6bc in memcoll (
    s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, 
    s1len=15564440312192434243, 
    s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68)
    at memcoll.c:50
#1  0x000000000040af4c in xmemcoll (
    s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, 
    s1len=15564440312192434243, 
    s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68)
    at xmemcoll.c:43
#2  0x00000000004059ee in compare (a=0x5b4a7f0, b=0x301dfc0) at sort.c:2059
#3  0x0000000000406815 in mergefps (files=0x24063e0, ntemps=15, nfiles=15, 
    ofp=0x23ff8e0, output_file=0x24062ec "/home/a/tmp/sortcOqzkh")
    at sort.c:2326
#4  0x000000000040708f in merge (files=0x24063e0, ntemps=16, nfiles=32, 
    output_file=0x0) at sort.c:2567
#5  0x000000000040766a in sort (files=0x61c660, nfiles=0, output_file=0x0)
    at sort.c:2699
#6  0x000000000040908c in main (argc=5, argv=0x7fff149247a8) at sort.c:3425

In context, line 2326 marked with ***:

      {
        size_t lo = 1;
        size_t hi = nfiles;
        size_t probe = lo;
        size_t ord0 = ord[0];
        size_t count_of_smaller_lines;

        while (lo < hi)
          {
***         int cmp = compare (cur[ord0], cur[ord[probe]]);  ***
            if (cmp < 0 || (cmp == 0 && ord0 < ord[probe]))
              hi = probe;
            else
              lo = probe + 1;
            probe = (lo + hi) / 2;
          }

        count_of_smaller_lines = lo - 1;
        for (j = 0; j < count_of_smaller_lines; j++)
          ord[j] = ord[j + 1];
        ord[count_of_smaller_lines] = ord0;
      }

In stack frame 3:

(gdb) p ord[0]@15
$51 = {7, 0, 14, 8, 1, 2, 9, 3, 10, 4, 11, 12, 5, 13, 6}
(gdb) print *cur[7]
$52 = {text = 0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, 
  length = 15564440312192434244, keybeg = 0x0, keylim = 0x0}
(gdb) print *(cur[7]-1)
$54 = {
 text = 0x5824d9c "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"..., 
 length = 68, 
 keybeg = 0xa500000000000000 <Address 0xa500000000000000 out of bounds>, 
 keylim = 0x8900000000000000 <Address 0x8900000000000000 out of bounds>}
(gdb) print *(cur[7]+1)
$55 = {
 text = 0x5824d14 "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"..., 
 length = 68, keybeg = 0x0, keylim = 0x0}
(gdb) p (char *) 0x5824d58
$70 = 0x5824d58 "<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-05>\n<http://scinets.org/item/cria214s2ria214u225704i?update=2010-10-"...

I printed that last one because cur[7].text=0x7800000005824d58
differs by one byte from this location, and 0x58-0x14=0x9c-0x58=68,
so it might be relevant.

For interest, here is some gdb output on a core I saved with 8.12:

#5  0x00000000004073f5 in compare (a=0x228b5e0, b=0x68ce2d0) at
sort.c:2668
2668      diff = xmemcoll0 (a->text, alen + 1, b->text, blen + 1);
#6  0x000000000040837b in mergefps (files=0x119e230, ntemps=11, nfiles=11, 
    ofp=0x11978b0, output_file=0x119787d "/home/a/tmp/sort1mESrU", 
    fps=0x1197af0) at sort.c:2995
2995            int cmp = compare (cur[ord0], cur[ord[probe]]);

In frame 6:

(gdb) p cur[0]@11
$6 = {0x228b5e0, 0x2a9ff30, 0x30dff60, 0x35293b0, 0x4913940, 0x5020050, 
  0x5660080, 0x5bd0290, 0x68ce2d0, 0x6f60140, 0x75a0170}
(gdb) p ord[0]@11
$8 = {0, 8, 4, 9, 1, 5, 10, 2, 6, 3, 7}
(gdb) p ord0
$9 = 0
(gdb) p probe
$10 = 1
(gdb) p *(const struct line *)0x2a9ff30
$15 = {
  text = 0x245ff30 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F350598182xxbnode337", length = 77, keybeg = 0x0, keylim = 0x0}
(gdb) p *(const struct line *)0x228b5e0
$16 = {text = 0x600000000226d720 <Address 0x600000000226d720 out of bounds>, 
 length = 14843864371813154892, 
 keybeg = 0x756566736f4e2f72 <Address 0x756566736f4e2f72 out of bounds>, 
 keylim = 0x66626f5f6f746c61 <Address 0x66626f5f6f746c61 out of bounds>}
(gdb) p *(const struct line *)0x75a0170
$18 = {
 text = 0x6f60170 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F492419832xxbnode215", length = 77, keybeg = 0x0, keylim = 0x0}
(gdb) p *buffer
$33 = {
 buf = 0x1e1ff00 "_:httpx3Ax2Fx2Fapix2Ehi5x2Ecomx2Frestx2Fprofilex2Ffoafx2F104700830xxbnode271", used = 4596991, nlines = 61144, alloc = 6553632, left = 62, 
 line_bytes = 32, eof = false}

-- Andras Salamon                   andras <at> dns.net




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Sat, 20 Aug 2011 21:06:01 GMT) Full text and rfc822 format available.

Message #20 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Andras Salamon <andras <at> dns.net>
To: Jim Meyering <jim <at> meyering.net>
Cc: 9321 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sat, 20 Aug 2011 22:03:10 +0100
On Sat, Aug 20, 2011 at 08:31:46AM +0200, Jim Meyering wrote:
>As Pádraig already mentioned, most useful would be instructions
>showing how to reproduce the failure, even if part of that is something
>like "run this command 30 times" to provoke the rare failure.

I'm seeing roughly 1 in 5 failures with the larger runs.

>Have you reproduced the problem on more than one system?
>If not, have you recently run any tests of your system's hardware?
>It would be a shame to invest a lot of debugging effort,
>if it ends up being a hardware problem with one specific system.

Good point, thanks for the suggestion.  I hope to have access next
week to a different system with enough free space to try to reproduce.
Will run some hardware tests in the meantime.

-- Andras Salamon                   andras <at> dns.net




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Sat, 20 Aug 2011 23:32:01 GMT) Full text and rfc822 format available.

Message #23 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Andras Salamon <andras <at> dns.net>
Cc: 9321 <at> debbugs.gnu.org
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sun, 21 Aug 2011 00:28:43 +0100
On 08/20/2011 09:58 PM, Andras Salamon wrote:
> On Fri, Aug 19, 2011 at 11:54:46PM +0100, Pádraig Brady wrote:
>> On 08/18/2011 03:30 PM, Andras Salamon wrote:
>>> I am seeing repeated (but not reliably repeatable) segmentation faults
>>> sorting datasets in the 100MB-100GB range on a 64-bit Debian system
>>> using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
>>> problems during the merge phase, usually when the temporary files
>>> are being combined.
> 
>> Andras, could you give the exact command line your having issue with,
>> and perhaps make sort inputs available too?
> 
> The sort inputs are several-gigabyte-range files containing strings,
> each typically 60 to 140 bytes long, one per line.  There are
> many duplicates, and the first reason to sort is to establish the
> distribution of duplicates.  I would be happy to make available data
> if I could find a reasonably sized file that causes a reproducible
> segfault.  The problem seems easier to reproduce with larger files,
> unfortunately.
> 
>> Do the --batch-size=NMERGE or --compress-program=PROG options change anything?
> 
> Thanks for the suggestion, I will try forcing smaller batches.
> 
> Compressing batches was something I tried early on with no apparent
> change in likelihood of failure, but it led to much slower runtimes.
> 
>> Also there were temp file handling changes made in 7.2 so could you try:
>> ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz
> 
> Here are some of the relevant-seeming parts of a gdb session for
> coreutils-7.1.

If this happens with 2.5 year old sort, I'd be leaning
towards a local issue.

> (gdb) bt
> #0  0x000000000040e6bc in memcoll (
>     s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>,     s1len=15564440312192434243,     s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68)
>     at memcoll.c:50
> #1  0x000000000040af4c in xmemcoll (
>     s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>,     s1len=15564440312192434243,     s2=0x2b2a1a0 "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., s2len=68)
>     at xmemcoll.c:43
> #2  0x00000000004059ee in compare (a=0x5b4a7f0, b=0x301dfc0) at sort.c:2059
> #3  0x0000000000406815 in mergefps (files=0x24063e0, ntemps=15, nfiles=15,     ofp=0x23ff8e0, output_file=0x24062ec "/home/a/tmp/sortcOqzkh")
>     at sort.c:2326
> #4  0x000000000040708f in merge (files=0x24063e0, ntemps=16, nfiles=32,     output_file=0x0) at sort.c:2567
> #5  0x000000000040766a in sort (files=0x61c660, nfiles=0, output_file=0x0)
>     at sort.c:2699
> #6  0x000000000040908c in main (argc=5, argv=0x7fff149247a8) at sort.c:3425

So the 'a' line struct is corrupted.
a->text =   7800000005824D58
a->length = D800000000000043

Notice the 0x78 and 0xD8.
They should be 0x00.
Now whether this is software or hardware?
It looks like hardware TBH as there are 4
bits incorrectly set in each of those bytes
(which ECC couldn't correct if you have that),
and also each incorrect bit is beside another.

cheers,
Pádraig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9321; Package coreutils. (Sat, 27 Aug 2011 23:52:03 GMT) Full text and rfc822 format available.

Message #26 received at 9321 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Andras Salamon <andras <at> dns.net>
Cc: 9321 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#9321: repeated segfaults sorting large files in 8.12
Date: Sun, 28 Aug 2011 01:48:31 +0200
tags 9321 + notabug
close 9321
thanks

Jim Meyering wrote:
> Andras Salamon wrote:
>
>> I am seeing repeated (but not reliably repeatable) segmentation faults
>> sorting datasets in the 100MB-100GB range on a 64-bit Debian system
>> using GNU sort 8.12 (and also 8.9).  Stack traces seem to indicate
>> problems during the merge phase, usually when the temporary files
>> are being combined.
>>
>> This may or may not be related to the recent discussion about
>> #9307, but I am definitely using 8.12, rebuilt with CFLAGS=-g since
>> several indicative values were otherwise optimised out, configured
>> with --disable-nls --disable-threads, and am running with a fixed
>> buffer -S 100M and also --parallel=1 to try to isolate problems from
>> possible threading issues.  I was seeing these crashes with a vanilla
>> build also.
>>
>> At least one crash occurred when comparing the very last entry in
>> the memory buffer to a non-existent entry, when merging large files.
>>
>> There was also a crash with total_lines=851122 in mergelines_node,
>> which leads to node->hi containing what appears to be garbage, with
>> length=2882303761517117516.
>>
>> The repository changelog seems to indicate that the current development
>> release of sort has not changed since 8.12.  Will attempting to track
>> the problem down with 8.12 be useful?
>
> Yes, most definitely.
> As Pádraig already mentioned, most useful would be instructions
> showing how to reproduce the failure, even if part of that is something
> like "run this command 30 times" to provoke the rare failure.
>
>> If so I can post stack traces
>> and values of relevant variables from the core dump, or post a new
>> issue in the tracker, or reopen #9307.  If not, please suggest some
>> specific actions I should take to generate useful information.
>
> Thanks for the detailed report and investigation.
> Have you reproduced the problem on more than one system?
> If not, have you recently run any tests of your system's hardware?
> It would be a shame to invest a lot of debugging effort,
> if it ends up being a hardware problem with one specific system.

Per http://thread.gmane.org/gmane.comp.gnu.coreutils.general/1527/focus=1551
I'm closing this.




Added tag(s) notabug. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 27 Aug 2011 23:52:04 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 9321 <at> debbugs.gnu.org and Andras Salamon <andras <at> dns.net> Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 27 Aug 2011 23:52:04 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 25 Sep 2011 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 266 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.