GNU bug report logs - #9737
misc/timeout-group: spurious test failure on SLES 10.3 (coreutils 8.14)

Previous Next

Package: coreutils;

Reported by: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>

Date: Wed, 12 Oct 2011 14:06:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9737 in the body.
You can then email your comments to 9737 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Wed, 12 Oct 2011 14:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 12 Oct 2011 14:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: misc/timeout-group: spurious test failure on SLES 10.3  (coreutils
	8.14)
Date: Wed, 12 Oct 2011 16:05:10 +0200
[Message part 1 (text/plain, inline)]
I don't wanna ruin your day (by reporting on 0-day old 8.14), but ...

On a virtual SLES 10.3 system with 2 Xeons (E5420  @ 2.50GHz),
the above test failed - but just once.

I cannot reproduce the error. I tried ~30 times, once with low system
load and once with a load of about 10.

Fortunately, I saved the logfile before retrying.
Ideas?

Berny
[timeout-group.log (application/octet-stream, attachment)]

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Thu, 13 Oct 2011 14:10:02 GMT) Full text and rfc822 format available.

Notification sent to "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>:
bug acknowledged by developer. (Thu, 13 Oct 2011 14:10:02 GMT) Full text and rfc822 format available.

Message #10 received at 9737-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Cc: 9737-done <at> debbugs.gnu.org
Subject: Re: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3
	(coreutils 8.14)
Date: Thu, 13 Oct 2011 15:08:43 +0100
[Message part 1 (text/plain, inline)]
On 10/12/2011 03:05 PM, Voelker, Bernhard wrote:
> I don't wanna ruin your day (by reporting on 0-day old 8.14), but ...
> 
> On a virtual SLES 10.3 system with 2 Xeons (E5420  @ 2.50GHz),
> the above test failed - but just once.
> 
> I cannot reproduce the error. I tried ~30 times, once with low system
> load and once with a load of about 10.
> 
> Fortunately, I saved the logfile before retrying.
> Ideas?
> 
> Berny

Bah, this is just a racy test I think.
Hopefully the attached fixes it.

cheers,
Pádraig.
[timeout-group-race.diff (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 13 Oct 2011 16:00:01 GMT) Full text and rfc822 format available.

Message #13 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: RE: bug#9737: misc/timeout-group: spurious test failure on SLES
	10.3 (coreutils 8.14)
Date: Thu, 13 Oct 2011 17:58:58 +0200
[Message part 1 (text/plain, inline)]
reopen 9737
thanks

Pádraig Brady wrote:

> Bah, this is just a racy test I think.
> Hopefully the attached fixes it.

Thank you for the patch.

I tried it 16 times:

* 14x PASS, execution time real < 0.4s

* 1x test failure (in the 5th run)

* 1x the test lasted 20s (in the 16th run)
  ... which means at least this is wrong:

  > * tests/misc/timeout-group: Increase the timeouts
  > passed to the timeout command, so that they're
  > effectively not used. <...>

Have a nice day,
Berny
[timeout-group-race.log (application/octet-stream, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 13 Oct 2011 18:45:02 GMT) Full text and rfc822 format available.

Message #16 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: Re: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3
	(coreutils 8.14)
Date: Thu, 13 Oct 2011 19:44:02 +0100
On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
> reopen 9737
> thanks
> 
> Pádraig Brady wrote:
> 
>> Bah, this is just a racy test I think.
>> Hopefully the attached fixes it.
> 
> Thank you for the patch.
> 
> I tried it 16 times:
> 
> * 14x PASS, execution time real < 0.4s
> 
> * 1x test failure (in the 5th run)

So the command exited without receiving SIGINT.
Or perhaps the touch of the 'received.int' file
is being done asynch. Anything special about your
file system?

> * 1x the test lasted 20s (in the 16th run)

But this one passed, which means the command
did receive the SIGINT, but then didn't exit?

I'm confused, sorry,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 13 Oct 2011 22:29:02 GMT) Full text and rfc822 format available.

Message #19 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: RE: bug#9737: misc/timeout-group: spurious test failure on SLES
	10.3 (coreutils 8.14)
Date: Fri, 14 Oct 2011 00:27:39 +0200
Pádraig Brady wrote:

> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>> reopen 9737
>> thanks
>> 
>> Pádraig Brady wrote:
>> 
>>> Bah, this is just a racy test I think.
>>> Hopefully the attached fixes it.
>> 
>> Thank you for the patch.
>> 
>> I tried it 16 times:
>> 
>* 14x PASS, execution time real < 0.4s
>> 
>> * 1x test failure (in the 5th run)
>
> So the command exited without receiving SIGINT.
> Or perhaps the touch of the 'received.int' file
> is being done asynch. Anything special about your
> file system?

It's a virtual host on a ESX server farm in our data center.

ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg01-lvol0
                       50G   15G   33G  31% /user

ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
/dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)

>> * 1x the test lasted 20s (in the 16th run)
>
> But this one passed, which means the command
> did receive the SIGINT, but then didn't exit?

Sounds like one error is shadowing another.

> I'm confused, sorry,
> Pádraig.

That's strange, indeed.

I repeated the test with < 0.2 load 100 times:
the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
and the run #24 and #25 PASSed but took 20 seconds,
all other PASSed within <=0.3s.

Have a nice day,
Berny



Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 03 Nov 2011 02:16:02 GMT) Full text and rfc822 format available.

Message #22 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: Re: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3
	(coreutils 8.14)
Date: Thu, 03 Nov 2011 02:11:27 +0000
[Message part 1 (text/plain, inline)]
On 10/13/2011 11:27 PM, Voelker, Bernhard wrote:
> Pádraig Brady wrote:
> 
>> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>>> reopen 9737
>>> thanks
>>>
>>> Pádraig Brady wrote:
>>>
>>>> Bah, this is just a racy test I think.
>>>> Hopefully the attached fixes it.
>>>
>>> Thank you for the patch.
>>>
>>> I tried it 16 times:
>>>
>> * 14x PASS, execution time real < 0.4s
>>>
>>> * 1x test failure (in the 5th run)
>>
>> So the command exited without receiving SIGINT.
>> Or perhaps the touch of the 'received.int' file
>> is being done asynch. Anything special about your
>> file system?
> 
> It's a virtual host on a ESX server farm in our data center.
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
> Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lvol0
>                        50G   15G   33G  31% /user
> 
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
> /dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)
> 
>>> * 1x the test lasted 20s (in the 16th run)
>>
>> But this one passed, which means the command
>> did receive the SIGINT, but then didn't exit?
> 
> Sounds like one error is shadowing another.
> 
>> I'm confused, sorry,
>> Pádraig.
> 
> That's strange, indeed.
> 
> I repeated the test with < 0.2 load 100 times:
> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
> and the run #24 and #25 PASSed but took 20 seconds,
> all other PASSed within <=0.3s.

I reproduced this weirdness in OpenSuse 10.3 in a VM.
Much less frequently though.
Delays in 10 out of 2750
Signal handler call failure in 1 out of 2750

The delays might be due to bash, but I updated
to 4.2 and the issue still persists.
I suspect kernel issues too.

Anyway I've attached 2 patches to replace the previous one.
The first hopefully addresses any races in the test.
I don't think you hit any of these TBH.

The second should detect the signal issues and skip the test.

cheers,
Pádraig.
[1-timeout-races.diff (text/plain, attachment)]
[2-timeout-skips.diff (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 03 Nov 2011 09:49:01 GMT) Full text and rfc822 format available.

Message #25 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>, "Voelker,
	Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Subject: Re: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3
	(coreutils 8.14)
Date: Thu, 03 Nov 2011 10:46:25 +0100
Pádraig Brady wrote:
...
>> I repeated the test with < 0.2 load 100 times:
>> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
>> and the run #24 and #25 PASSed but took 20 seconds,
>> all other PASSed within <=0.3s.
>
> I reproduced this weirdness in OpenSuse 10.3 in a VM.
> Much less frequently though.
> Delays in 10 out of 2750
> Signal handler call failure in 1 out of 2750

Wow.  Congratulations on managing to reproduce those.

> The delays might be due to bash, but I updated
> to 4.2 and the issue still persists.
> I suspect kernel issues too.
>
> Anyway I've attached 2 patches to replace the previous one.
> The first hopefully addresses any races in the test.
> I don't think you hit any of these TBH.
>
> The second should detect the signal issues and skip the test.

Those look good to me.

Thanks.




Information forwarded to bug-coreutils <at> gnu.org:
bug#9737; Package coreutils. (Thu, 03 Nov 2011 17:14:01 GMT) Full text and rfc822 format available.

Message #28 received at 9737 <at> debbugs.gnu.org (full text, mbox):

From: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: "9737 <at> debbugs.gnu.org" <9737 <at> debbugs.gnu.org>
Subject: RE: bug#9737: misc/timeout-group: spurious test failure on SLES
	10.3 (coreutils 8.14)
Date: Thu, 3 Nov 2011 18:11:10 +0100
Pádraig Brady wrote:

> I reproduced this weirdness in OpenSuse 10.3 in a VM.
> Much less frequently though.
> Delays in 10 out of 2750
> Signal handler call failure in 1 out of 2750

Sorry for my late reply. Thanks.

> The delays might be due to bash, but I updated
> to 4.2 and the issue still persists.

I have:  bash --version
GNU bash, version 3.1.17(1)-release (x86_64-suse-linux)

> I suspect kernel issues too.
> 
> Anyway I've attached 2 patches to replace the previous one.
> 
> The first hopefully addresses any races in the test.
> I don't think you hit any of these TBH.
> The second should detect the signal issues and skip the test.


I was using this:

  for i in $( seq 1000 ) ; do
    rm   misc/timeout-group.log
    make misc/timeout-group.log
  done | grep -v PASS

Without the patches, I got:

* 153 FAILures.

With the patches, I get:

* 6x:
  timeout-group: skipped test: timeout.cmd didn't receive a signal until after sleep?
  SKIP: misc/timeout-group

* 62x:
  timeout-group: skipped test: timeout returned 142. SIGALRM not handled?
  SKIP: misc/timeout-group

Does this help?
If you need access to that machine, it's possible but not easy.

Have a nice day,
Berny



Message #29 received at 9737-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: "Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
Cc: 9737-done <at> debbugs.gnu.org
Subject: Re: bug#9737: misc/timeout-group: spurious test failure on SLES 10.3
	(coreutils 8.14)
Date: Thu, 03 Nov 2011 18:13:41 +0000
On 11/03/2011 05:11 PM, Voelker, Bernhard wrote:
> Without the patches, I got:
> 
> * 153 FAILures.
> 
> With the patches, I get:
> 
> * 6x:
>   timeout-group: skipped test: timeout.cmd didn't receive a signal until after sleep?
>   SKIP: misc/timeout-group
> 
> * 62x:
>   timeout-group: skipped test: timeout returned 142. SIGALRM not handled?
>   SKIP: misc/timeout-group
> 
> Does this help?
> If you need access to that machine, it's possible but not easy.

That confirms things. Thanks for testing.
So I'm seeing the same only less frequent.
I'm attributing to system signal handling bugs for the moment.
By adding the skips, we can see if this is hitting modern systems.

cheers,
Pádraig.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 02 Dec 2011 12:24:02 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 206 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.