GNU bug report logs -
#9737
misc/timeout-group: spurious test failure on SLES 10.3 (coreutils 8.14)
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9737 in the body.
You can then email your comments to 9737 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Wed, 12 Oct 2011 14:06:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 12 Oct 2011 14:06:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I don't wanna ruin your day (by reporting on 0-day old 8.14), but ...
On a virtual SLES 10.3 system with 2 Xeons (E5420 @ 2.50GHz),
the above test failed - but just once.
I cannot reproduce the error. I tried ~30 times, once with low system
load and once with a load of about 10.
Fortunately, I saved the logfile before retrying.
Ideas?
Berny
[timeout-group.log (application/octet-stream, attachment)]
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Thu, 13 Oct 2011 14:10:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
"Voelker, Bernhard" <bernhard.voelker <at> siemens-enterprise.com>
:
bug acknowledged by developer.
(Thu, 13 Oct 2011 14:10:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 9737-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 10/12/2011 03:05 PM, Voelker, Bernhard wrote:
> I don't wanna ruin your day (by reporting on 0-day old 8.14), but ...
>
> On a virtual SLES 10.3 system with 2 Xeons (E5420 @ 2.50GHz),
> the above test failed - but just once.
>
> I cannot reproduce the error. I tried ~30 times, once with low system
> load and once with a load of about 10.
>
> Fortunately, I saved the logfile before retrying.
> Ideas?
>
> Berny
Bah, this is just a racy test I think.
Hopefully the attached fixes it.
cheers,
Pádraig.
[timeout-group-race.diff (text/plain, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 13 Oct 2011 16:00:01 GMT)
Full text and
rfc822 format available.
Message #13 received at 9737 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
reopen 9737
thanks
Pádraig Brady wrote:
> Bah, this is just a racy test I think.
> Hopefully the attached fixes it.
Thank you for the patch.
I tried it 16 times:
* 14x PASS, execution time real < 0.4s
* 1x test failure (in the 5th run)
* 1x the test lasted 20s (in the 16th run)
... which means at least this is wrong:
> * tests/misc/timeout-group: Increase the timeouts
> passed to the timeout command, so that they're
> effectively not used. <...>
Have a nice day,
Berny
[timeout-group-race.log (application/octet-stream, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 13 Oct 2011 18:45:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 9737 <at> debbugs.gnu.org (full text, mbox):
On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
> reopen 9737
> thanks
>
> Pádraig Brady wrote:
>
>> Bah, this is just a racy test I think.
>> Hopefully the attached fixes it.
>
> Thank you for the patch.
>
> I tried it 16 times:
>
> * 14x PASS, execution time real < 0.4s
>
> * 1x test failure (in the 5th run)
So the command exited without receiving SIGINT.
Or perhaps the touch of the 'received.int' file
is being done asynch. Anything special about your
file system?
> * 1x the test lasted 20s (in the 16th run)
But this one passed, which means the command
did receive the SIGINT, but then didn't exit?
I'm confused, sorry,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 13 Oct 2011 22:29:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 9737 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>> reopen 9737
>> thanks
>>
>> Pádraig Brady wrote:
>>
>>> Bah, this is just a racy test I think.
>>> Hopefully the attached fixes it.
>>
>> Thank you for the patch.
>>
>> I tried it 16 times:
>>
>* 14x PASS, execution time real < 0.4s
>>
>> * 1x test failure (in the 5th run)
>
> So the command exited without receiving SIGINT.
> Or perhaps the touch of the 'received.int' file
> is being done asynch. Anything special about your
> file system?
It's a virtual host on a ESX server farm in our data center.
ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg01-lvol0
50G 15G 33G 31% /user
ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
/dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)
>> * 1x the test lasted 20s (in the 16th run)
>
> But this one passed, which means the command
> did receive the SIGINT, but then didn't exit?
Sounds like one error is shadowing another.
> I'm confused, sorry,
> Pádraig.
That's strange, indeed.
I repeated the test with < 0.2 load 100 times:
the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
and the run #24 and #25 PASSed but took 20 seconds,
all other PASSed within <=0.3s.
Have a nice day,
Berny
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 03 Nov 2011 02:16:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 9737 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 10/13/2011 11:27 PM, Voelker, Bernhard wrote:
> Pádraig Brady wrote:
>
>> On 10/13/2011 04:58 PM, Voelker, Bernhard wrote:
>>> reopen 9737
>>> thanks
>>>
>>> Pádraig Brady wrote:
>>>
>>>> Bah, this is just a racy test I think.
>>>> Hopefully the attached fixes it.
>>>
>>> Thank you for the patch.
>>>
>>> I tried it 16 times:
>>>
>> * 14x PASS, execution time real < 0.4s
>>>
>>> * 1x test failure (in the 5th run)
>>
>> So the command exited without receiving SIGINT.
>> Or perhaps the touch of the 'received.int' file
>> is being done asynch. Anything special about your
>> file system?
>
> It's a virtual host on a ESX server farm in our data center.
>
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> uname -a
> Linux mchp320a 2.6.16.60-0.74.7-smp #1 SMP Fri Nov 26 09:16:10 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> df -h .
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg01-lvol0
> 50G 15G 33G 31% /user
>
> ecs <at> mchp320a:~/berny/depot/coreutils-8.14/tests> mount | grep /user
> /dev/mapper/vg01-lvol0 on /user type ext3 (rw,acl,user_xattr)
>
>>> * 1x the test lasted 20s (in the 16th run)
>>
>> But this one passed, which means the command
>> did receive the SIGINT, but then didn't exit?
>
> Sounds like one error is shadowing another.
>
>> I'm confused, sorry,
>> Pádraig.
>
> That's strange, indeed.
>
> I repeated the test with < 0.2 load 100 times:
> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
> and the run #24 and #25 PASSed but took 20 seconds,
> all other PASSed within <=0.3s.
I reproduced this weirdness in OpenSuse 10.3 in a VM.
Much less frequently though.
Delays in 10 out of 2750
Signal handler call failure in 1 out of 2750
The delays might be due to bash, but I updated
to 4.2 and the issue still persists.
I suspect kernel issues too.
Anyway I've attached 2 patches to replace the previous one.
The first hopefully addresses any races in the test.
I don't think you hit any of these TBH.
The second should detect the signal issues and skip the test.
cheers,
Pádraig.
[1-timeout-races.diff (text/plain, attachment)]
[2-timeout-skips.diff (text/plain, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 03 Nov 2011 09:49:01 GMT)
Full text and
rfc822 format available.
Message #25 received at 9737 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
...
>> I repeated the test with < 0.2 load 100 times:
>> the run #5, #18, #28, #53, #58 and #71 resulted in FAIL as above,
>> and the run #24 and #25 PASSed but took 20 seconds,
>> all other PASSed within <=0.3s.
>
> I reproduced this weirdness in OpenSuse 10.3 in a VM.
> Much less frequently though.
> Delays in 10 out of 2750
> Signal handler call failure in 1 out of 2750
Wow. Congratulations on managing to reproduce those.
> The delays might be due to bash, but I updated
> to 4.2 and the issue still persists.
> I suspect kernel issues too.
>
> Anyway I've attached 2 patches to replace the previous one.
> The first hopefully addresses any races in the test.
> I don't think you hit any of these TBH.
>
> The second should detect the signal issues and skip the test.
Those look good to me.
Thanks.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#9737
; Package
coreutils
.
(Thu, 03 Nov 2011 17:14:01 GMT)
Full text and
rfc822 format available.
Message #28 received at 9737 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> I reproduced this weirdness in OpenSuse 10.3 in a VM.
> Much less frequently though.
> Delays in 10 out of 2750
> Signal handler call failure in 1 out of 2750
Sorry for my late reply. Thanks.
> The delays might be due to bash, but I updated
> to 4.2 and the issue still persists.
I have: bash --version
GNU bash, version 3.1.17(1)-release (x86_64-suse-linux)
> I suspect kernel issues too.
>
> Anyway I've attached 2 patches to replace the previous one.
>
> The first hopefully addresses any races in the test.
> I don't think you hit any of these TBH.
> The second should detect the signal issues and skip the test.
I was using this:
for i in $( seq 1000 ) ; do
rm misc/timeout-group.log
make misc/timeout-group.log
done | grep -v PASS
Without the patches, I got:
* 153 FAILures.
With the patches, I get:
* 6x:
timeout-group: skipped test: timeout.cmd didn't receive a signal until after sleep?
SKIP: misc/timeout-group
* 62x:
timeout-group: skipped test: timeout returned 142. SIGALRM not handled?
SKIP: misc/timeout-group
Does this help?
If you need access to that machine, it's possible but not easy.
Have a nice day,
Berny
Message #29 received at 9737-done <at> debbugs.gnu.org (full text, mbox):
On 11/03/2011 05:11 PM, Voelker, Bernhard wrote:
> Without the patches, I got:
>
> * 153 FAILures.
>
> With the patches, I get:
>
> * 6x:
> timeout-group: skipped test: timeout.cmd didn't receive a signal until after sleep?
> SKIP: misc/timeout-group
>
> * 62x:
> timeout-group: skipped test: timeout returned 142. SIGALRM not handled?
> SKIP: misc/timeout-group
>
> Does this help?
> If you need access to that machine, it's possible but not easy.
That confirms things. Thanks for testing.
So I'm seeing the same only less frequent.
I'm attributing to system signal handling bugs for the moment.
By adding the skips, we can see if this is hitting modern systems.
cheers,
Pádraig.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 02 Dec 2011 12:24:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 206 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.