GNU bug report logs - #9737
misc/timeout-group: spurious test failure on SLES 10.3 (coreutils 8.14)

Previous Next

Package: coreutils;

Reported by: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>

Date: Wed, 12 Oct 2011 14:06:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3 (coreutils 8.14)
Date: Thu, 03 Nov 2011 02:11:27 +0000
[Message part 1 (text/plain, inline)]
On 10/13/2011 11:27 PM, Voelker, Bernhard wrote:
> Pádraig Brady wrote:
> 
>> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>>> reopen 9737
>>> thanks
>>>
>>> Pádraig Brady wrote:
>>>
>>>> Bah, this is just a racy test I think.
>>>> Hopefully the attached fixes it.
>>>
>>> Thank you for the patch.
>>>
>>> I tried it 16 times:
>>>
>> * 14x PASS, execution time real < 0.4s
>>>
>>> * 1x test failure (in the 5th run)
>>
>> So the command exited without receiving SIGINT.
>> Or perhaps the touch of the 'received.int' file
>> is being done asynch. Anything special about your
>> file system?
> 
> It's a virtual host on a ESX server farm in our data center.
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
> Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lvol0
>                        50G   15G   33G  31% /user
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
> /dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)
> 
>>> * 1x the test lasted 20s (in the 16th run)
>>
>> But this one passed, which means the command
>> did receive the SIGINT, but then didn't exit?
> 
> Sounds like one error is shadowing another.
> 
>> I'm confused, sorry,
>> Pádraig.
> 
> That's strange, indeed.
> 
> I repeated the test with < 0.2 load 100 times:
> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
> and the run #24 and #25 PASSed but took 20 seconds,
> all other PASSed within <=0.3s.

I reproduced this weirdness in OpenSuse 10.3 in a VM.
Much less frequently though.
Delays in 10 out of 2750
Signal handler call failure in 1 out of 2750

The delays might be due to bash, but I updated
to 4.2 and the issue still persists.
I suspect kernel issues too.

Anyway I've attached 2 patches to replace the previous one.
The first hopefully addresses any races in the test.
I don't think you hit any of these TBH.

The second should detect the signal issues and skip the test.

cheers,
Pádraig.
[1-timeout-races.diff (text/plain, attachment)]
[2-timeout-skips.diff (text/plain, attachment)]

This bug report was last modified 13 years and 206 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.