GNU bug report logs - #21061
coreutils-8.24 - Partially reproducible failures of tests/misc/timeout-parameters.sh

Previous Next

Package: coreutils;

Reported by: Peter Bray <pdb_ml <at> yahoo.com.au>

Date: Wed, 15 Jul 2015 09:24:01 UTC

Severity: normal

To reply to this bug, email your comments to 21061 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Wed, 15 Jul 2015 09:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Peter Bray <pdb_ml <at> yahoo.com.au>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 15 Jul 2015 09:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Peter Bray <pdb_ml <at> yahoo.com.au>
To: bug-coreutils <at> gnu.org
Subject: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Wed, 15 Jul 2015 19:22:52 +1000
[Message part 1 (text/plain, inline)]
Greetings,

N.B. This bug report is for reference only, and documents only a
     partially reproducible check failure. No Action Requested.

On Solaris 10 (Update 8 and Update 11) and Solaris 11.2 X86 VMs, and
one Solaris 10 Update 10 (Non-VM) system, I see random "gmake check"
failures for "tests/misc/timeout-parameters.sh".

Running the test by itself (with the command line below) on the same
VMs / real system will sometimes succeed and sometimes fail.

  gmake check TESTS=tests/misc/timeout-parameters.sh VERBOSE=yes SUBDIRS=.

Looking through the attached "failure.log" file, I extracted the
following command line test, which may exhibit the failure without all
the make(1) and test infrastructure code:

  failures=0
  for i in `./src/seq 1 100`
  do
    ./src/timeout 2.34e+5d sleep 0 \
      || { echo fail; failures=`expr ${failures} + 1` }
  done
  echo "Total Failures: ${failures}"

On a real hardware system (Xeon E3-1245v2) with a 64-bit kernel,
failures are very rare (only 1 test harness failure seen, no failures
of the sample code above even with 1..1000 runs).

On virtual machines (also using Xeon E3-1245v2 running VMware ESXi
5.5d (latest patches) - two identical ESXi systems running similarly
configured VMs), test harness failures and failures in the above
command line check are rare for the 64-bit Solaris kernels.

Failures on Solaris 10 32-bit kernels (on both of these ESXi servers),
are easily reproduced and vary between 5% (common) and 45% (rare).

Regards,

Peter Bray
Sydney, Australia

PS: Again, no action is requested, this is intended as a file for
later reference bug report.
[failure.log (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Wed, 15 Jul 2015 10:32:01 GMT) Full text and rfc822 format available.

Message #8 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Peter Bray <pdb_ml <at> yahoo.com.au>, 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Wed, 15 Jul 2015 11:30:59 +0100
On 15/07/15 10:22, Peter Bray wrote:
> Greetings,
> 
> N.B. This bug report is for reference only, and documents only a
>       partially reproducible check failure. No Action Requested.
> 
> On Solaris 10 (Update 8 and Update 11) and Solaris 11.2 X86 VMs, and
> one Solaris 10 Update 10 (Non-VM) system, I see random "gmake check"
> failures for "tests/misc/timeout-parameters.sh".
> 
> Running the test by itself (with the command line below) on the same
> VMs / real system will sometimes succeed and sometimes fail.
> 
>    gmake check TESTS=tests/misc/timeout-parameters.sh VERBOSE=yes SUBDIRS=.
> 
> Looking through the attached "failure.log" file, I extracted the
> following command line test, which may exhibit the failure without all
> the make(1) and test infrastructure code:
> 
>    failures=0
>    for i in `./src/seq 1 100`
>    do
>      ./src/timeout 2.34e+5d sleep 0 \
>        || { echo fail; failures=`expr ${failures} + 1` }
>    done
>    echo "Total Failures: ${failures}"
> 
> On a real hardware system (Xeon E3-1245v2) with a 64-bit kernel,
> failures are very rare (only 1 test harness failure seen, no failures
> of the sample code above even with 1..1000 runs).
> 
> On virtual machines (also using Xeon E3-1245v2 running VMware ESXi
> 5.5d (latest patches) - two identical ESXi systems running similarly
> configured VMs), test harness failures and failures in the above
> command line check are rare for the 64-bit Solaris kernels.
> 
> Failures on Solaris 10 32-bit kernels (on both of these ESXi servers),
> are easily reproduced and vary between 5% (common) and 45% (rare).

Interesting. I'm not reproducing that in 5000 loops in the above test script
on 32 bit baremetal solaris 10 update 10.

I presume the large timeout value is causing early timer firing
on your systems for some reason? What does this return?

  time src/timeout 2.34e+5d sleep inf

Note on 32 bit, the 234000 days will be truncated to a itimerspec of:
  { {0,0}, {2147483647,999999999} }

A wild guess is that perhaps ntp is adjusting the system time
which causes the above timer to be adjusted in the kernel
and roll over to 0, thus triggering early?

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Thu, 16 Jul 2015 06:11:02 GMT) Full text and rfc822 format available.

Message #11 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Peter Bray <pdb_ml <at> yahoo.com.au>
To: Pádraig Brady <P <at> draigBrady.com>, 
 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Thu, 16 Jul 2015 16:09:46 +1000
[Message part 1 (text/plain, inline)]
On 15/07/15 08:30 PM, Pádraig Brady wrote:
> On 15/07/15 10:22, Peter Bray wrote:
>> Greetings,
>>
>> N.B. This bug report is for reference only, and documents only a
>>        partially reproducible check failure. No Action Requested.
>>
>> On Solaris 10 (Update 8 and Update 11) and Solaris 11.2 X86 VMs, and
>> one Solaris 10 Update 10 (Non-VM) system, I see random "gmake check"
>> failures for "tests/misc/timeout-parameters.sh".
>>
>> Running the test by itself (with the command line below) on the same
>> VMs / real system will sometimes succeed and sometimes fail.
>>
>>     gmake check TESTS=tests/misc/timeout-parameters.sh VERBOSE=yes SUBDIRS=.
>>
>> Looking through the attached "failure.log" file, I extracted the
>> following command line test, which may exhibit the failure without all
>> the make(1) and test infrastructure code:
>>
>>     failures=0
>>     for i in `./src/seq 1 100`
>>     do
>>       ./src/timeout 2.34e+5d sleep 0 \
>>         || { echo fail; failures=`expr ${failures} + 1` }
>>     done
>>     echo "Total Failures: ${failures}"
>>
>> On a real hardware system (Xeon E3-1245v2) with a 64-bit kernel,
>> failures are very rare (only 1 test harness failure seen, no failures
>> of the sample code above even with 1..1000 runs).
>>
>> On virtual machines (also using Xeon E3-1245v2 running VMware ESXi
>> 5.5d (latest patches) - two identical ESXi systems running similarly
>> configured VMs), test harness failures and failures in the above
>> command line check are rare for the 64-bit Solaris kernels.
>>
>> Failures on Solaris 10 32-bit kernels (on both of these ESXi servers),
>> are easily reproduced and vary between 5% (common) and 45% (rare).
>
> Interesting. I'm not reproducing that in 5000 loops in the above test script
> on 32 bit baremetal solaris 10 update 10.
>
> I presume the large timeout value is causing early timer firing
> on your systems for some reason? What does this return?
>
>    time src/timeout 2.34e+5d sleep inf
>
> Note on 32 bit, the 234000 days will be truncated to a itimerspec of:
>    { {0,0}, {2147483647,999999999} }
>
> A wild guess is that perhaps ntp is adjusting the system time
> which causes the above timer to be adjusted in the kernel
> and roll over to 0, thus triggering early?
>
> thanks,
> Pádraig.
>

Pádraig,

The additional information you requested, but unfortunately I have yet
to install gdb(1), so I using system tools for this response. The
installation of coreutils-8.24 has been completed on all compile
server VMs, so the commands now have a 'g' prefix.

% gtimeout 2.34e+5d gsleep inf

No output and exit status of 124 [$?=124] (32-bit kernel S10U11 / GCC 4.9.3)

% truss gtimeout 2.34e+5d gsleep inf 2>&1 | tee truss.out

File Attached "truss.out" (also exits with 124)

Note: Adding the -v option below on a separate run, did not yield a
great deal of information on the data provided to the timer*() calls.

% truss -tall -v timer_create,timer_settime gtimeout 2.34e+5d gsleep inf 
2>&1

timer_create(3, 0x00000000, 0x080471AC)         = 0
timer_settime(0, 0, 0x080471B0, 0x00000000)     = 0
    Received signal #14, SIGALRM, in waitid() [caught]
      siginfo: SIGALRM pid=12118 uid=100 code=-3
waitid(P_PID, 12119, 0x08047130, WEXITED|WTRAPPED) Err#91 ERESTART

Also captured truss -l output via:

% truss -l -tall -v timer_create,timer_settime gtimeout 2.34e+5d gsleep 
inf \
  2>&1 | tee truss-l.out

Normal apptrace show nothing of great value (does not show actual data
just addresses):

% apptrace gtimeout 2.34e+5d gsleep inf

but it is attached as "apptrace.out".

Note that the following apptrace command coredumps on each invocation:

% apptrace -v timer_settime gtimeout 2.34e+5d gsleep inf
-> gtimeout -> librt.so.1:int timer_settime(timer_t = 0x0, int = 0x0, 
const struct itimerspec * = 0x8047130, struct itimerspec * = 0x0)
        arg0 = (timer_t) 0x0
        arg1 = (int) 0x0
        arg2 = (const struct itimerspec *) 0x8047130 (struct itimerspec) {
        it_interval: (struct timespec) {
                tv_sec: (time_t) 0
                tv_nsec: (long) 0
        it_value: (struct timespec) {
                tv_sec: (time_t) 0x7fffffff
                tv_nsec: (long) 0x3b9ac9ff
        }
        arg3 = (struct itimerspec *) 0x0 (struct itimerspec) {
        it_interval: (struct timespec) {
                tv_sec: (time_t)
apptrace: gtimeout: Segmentation Fault(Core dump)

This coredump occurs even on 64-bit systems where the gtimeout command
waits for the sleep command (to finish - which it won't).

The timer_settime(3RT) manual page states that the last argument is
permitted to be NULL, so that does not seem to be a problem.

And regarding the NTP question, all compile server VMs have NTP disabled.

% svcs -a | grep -i ntp

disabled       Jul_13   svc:/network/ntp:default
disabled       Jul_13   svc:/network/ntp4:default

That is, NTP is disabled (since last boot) but actually disabled since
installation. Though NTP is runnning on both ESXi 5.5 hosts.

Regards,
Peter

PS: Investigating with my limited mdb(1) skills shows that its
    apptrace(1) coredumping not gtimeout(1).

% mdb =gtimeout core
> $C
08046a98 LMc9bfeea8`apptrace.so.1`print_int+0xbd(7, 0, 8046cc0)
08046bc4 LMc9bfeea8`apptrace.so.1`elt_print+0x137(c91e47b6, 2f, 0, 2, 
8046cc0)
08046bf4 LMc9bfeea8`libctf.so.1`ctf_type_rvisit+0x56(c91f3930, 2f, 
c9b63588, 8046cc0, c91e47b6, 0)
08046c2c LMc9bfeea8`libctf.so.1`ctf_type_rvisit+0x15e(c91f3930, 30, 
c9b63588, 8046cc0, c91e489b, 0)
08046c64 LMc9bfeea8`libctf.so.1`ctf_type_rvisit+0x15e(c91f3930, 44, 
c9b63588, 8046cc0, c9ad9048, 0)
08046c8c LMc9bfeea8`libctf.so.1`ctf_type_visit+0x2c(c91f3930, 44, 
c9b63588, 8046cc0)
08046ce0 LMc9bfeea8`apptrace.so.1`print_value+0x127(c91f3930, 45, 0)
08046fac LMc9bfeea8`apptrace.so.1`la_i86_pltenter+0x3d1(8047030, 38, 
c9af06f0, c9af0e48, 8047084, c9940304)
08047000 ld.so.1`_audit_pltenter+0x11e(c9af06c0, c9bfea18, c9af0ac0, 
8047030, 38, 8047084)
08047050 ld.so.1`audit_pltenter+0x98(c9bfea18, c9af0ac0, c9940308, 38, 
8047084, c9940304)
080470d8 ld.so.1`elf_plt_trace+0x4d(0, 0, 8047130, 0)
08047158 settimeout+0x11d(60000000, 4212d440, 8047218, 8047218)
080471e8 main+0x2c6(8052e50, 4, 8047218)
0804720c _start+0x80(4, 80473d0, 80473d9, 80473e2, 80473e9, 0)
[apptrace.out (text/plain, attachment)]
[truss.out (text/plain, attachment)]
[truss-l.out (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Thu, 16 Jul 2015 07:57:01 GMT) Full text and rfc822 format available.

Message #14 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Peter Bray <pdb_ml <at> yahoo.com.au>, 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Thu, 16 Jul 2015 08:56:37 +0100
On 16/07/15 07:09, Peter Bray wrote:
> On 15/07/15 08:30 PM, Pádraig Brady wrote:
>> On 15/07/15 10:22, Peter Bray wrote:
>>> Greetings,
>>>
>>> N.B. This bug report is for reference only, and documents only a
>>>        partially reproducible check failure. No Action Requested.
>>>
>>> On Solaris 10 (Update 8 and Update 11) and Solaris 11.2 X86 VMs, and
>>> one Solaris 10 Update 10 (Non-VM) system, I see random "gmake check"
>>> failures for "tests/misc/timeout-parameters.sh".
>>>
>>> Running the test by itself (with the command line below) on the same
>>> VMs / real system will sometimes succeed and sometimes fail.
>>>
>>>     gmake check TESTS=tests/misc/timeout-parameters.sh VERBOSE=yes SUBDIRS=.
>>>
>>> Looking through the attached "failure.log" file, I extracted the
>>> following command line test, which may exhibit the failure without all
>>> the make(1) and test infrastructure code:
>>>
>>>     failures=0
>>>     for i in `./src/seq 1 100`
>>>     do
>>>       ./src/timeout 2.34e+5d sleep 0 \
>>>         || { echo fail; failures=`expr ${failures} + 1` }
>>>     done
>>>     echo "Total Failures: ${failures}"
>>>
>>> On a real hardware system (Xeon E3-1245v2) with a 64-bit kernel,
>>> failures are very rare (only 1 test harness failure seen, no failures
>>> of the sample code above even with 1..1000 runs).
>>>
>>> On virtual machines (also using Xeon E3-1245v2 running VMware ESXi
>>> 5.5d (latest patches) - two identical ESXi systems running similarly
>>> configured VMs), test harness failures and failures in the above
>>> command line check are rare for the 64-bit Solaris kernels.
>>>
>>> Failures on Solaris 10 32-bit kernels (on both of these ESXi servers),
>>> are easily reproduced and vary between 5% (common) and 45% (rare).
>>
>> Interesting. I'm not reproducing that in 5000 loops in the above test script
>> on 32 bit baremetal solaris 10 update 10.
>>
>> I presume the large timeout value is causing early timer firing
>> on your systems for some reason? What does this return?
>>
>>    time src/timeout 2.34e+5d sleep inf
>>
>> Note on 32 bit, the 234000 days will be truncated to a itimerspec of:
>>    { {0,0}, {2147483647,999999999} }
>>
>> A wild guess is that perhaps ntp is adjusting the system time
>> which causes the above timer to be adjusted in the kernel
>> and roll over to 0, thus triggering early?
>>
>> thanks,
>> Pádraig.
>>
> 
> Pádraig,
> 
> The additional information you requested, but unfortunately I have yet
> to install gdb(1), so I using system tools for this response. The
> installation of coreutils-8.24 has been completed on all compile
> server VMs, so the commands now have a 'g' prefix.
> 
> % gtimeout 2.34e+5d gsleep inf
> 
> No output and exit status of 124 [$?=124] (32-bit kernel S10U11 / GCC 4.9.3)

I was looking to get an indication of how long it waits
(I have a fair idea about what is sent to timer_settime().
To that end, the command above has a leading 'time' to
return how long it actually took. I.E. something like
this would be informative:

  for i in $(seq 20); do time gtimeout 2.34e+5d sleep inf; done

thanks,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Fri, 17 Jul 2015 03:23:02 GMT) Full text and rfc822 format available.

Message #17 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Peter Bray <pdb_ml <at> yahoo.com.au>
To: Pádraig Brady <P <at> draigBrady.com>, 
 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Fri, 17 Jul 2015 13:21:38 +1000
On 16/07/15 05:56 PM, Pádraig Brady wrote:
> for i in $(seq 20); do time gtimeout 2.34e+5d sleep inf; done

gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.015 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.009 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total

Sorry, I didn't make that clear, when the command works (like it does 
mostly on 64-bit on 64-bit systems, as I poorly conveyed in the last 
post) it never returns and it returns "immediately" on the failure, 
which (as the above shows) was 20 out of 20 test cases.

> This coredump occurs even on 64-bit systems where the gtimeout command
> waits for the sleep command (to finish - which it won't).

Sorry for the confusion,

Peter




Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Fri, 17 Jul 2015 08:57:02 GMT) Full text and rfc822 format available.

Message #20 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Peter Bray <pdb_ml <at> yahoo.com.au>, 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Fri, 17 Jul 2015 09:56:07 +0100
On 17/07/15 04:21, Peter Bray wrote:
> On 16/07/15 05:56 PM, Pádraig Brady wrote:
>> for i in $(seq 20); do time gtimeout 2.34e+5d sleep inf; done
> 
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.015 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.009 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> gtimeout 2.34e+5d gsleep inf  0.00s user 0.00s system 0% cpu 0.010 total
> 
> Sorry, I didn't make that clear, when the command works (like it does 
> mostly on 64-bit on 64-bit systems, as I poorly conveyed in the last 
> post) it never returns and it returns "immediately" on the failure, 
> which (as the above shows) was 20 out of 20 test cases.

Is there ever a failure with non maximal values?
For example is there ever an early return with:

  while true; do time gtimeout 0.1 sleep inf; test $? != 124 && break; done

I'm guessing not, and the kernel is having rounding issues?
Do you still get the problem if you change the clock in timeout.c
from CLOCK_REALTIME to say CLOCK_MONOTONIC?

cheers,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Sat, 18 Jul 2015 06:36:02 GMT) Full text and rfc822 format available.

Message #23 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Peter Bray <pdb_ml <at> yahoo.com.au>
To: Pádraig Brady <P <at> draigBrady.com>, 
 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Sat, 18 Jul 2015 13:09:18 +1000
[Message part 1 (text/plain, inline)]
On 17/07/15 06:56 PM, Pádraig Brady wrote:
> while true; do time gtimeout 0.1 sleep inf; test $? != 124 && break; done

Run this code on S10U8/32 for almost 12 hours without it stopping 
itself. An almost 10 hours on S10U11/32 with the same result (not 
stopping itself).

Made the requested change to src/timeout.c

Backup  : src/timeout.c.orig
Editted : src/timeout.c

Command : gdiff

139c139
<   if (timer_create (CLOCK_REALTIME, NULL, &timerid) == 0)
---
>   if (timer_create (CLOCK_MONOTONIC, NULL, &timerid) == 0)

Run as an unprivileged user, gives the following error summary:

% egrep '(FAIL|ERROR) ' tests/test-suite.log | sed -e 's/FAIL /FAIL  /'
FAIL  tests/misc/tee.sh (exit status: 1)
FAIL  tests/misc/timeout-blocked.pl (exit status: 1)
FAIL  tests/tail-2/wait.sh (exit status: 1)
FAIL  tests/tail-2/retry.sh (exit status: 1)
FAIL  tests/tail-2/symlink.sh (exit status: 1)
FAIL  tests/ls/infloop.sh (exit status: 1)
FAIL  tests/tail-2/follow-name.sh (exit status: 1)
FAIL  tests/tail-2/follow-stdin.sh (exit status: 1)

On Solaris 10 Update 8 with Solaris 10 Update 11 and Solaris 11.2 have 
the additional tmpfs build problems mentioned in BUG-21062

With the following executive summary:

% grep "timer_create: Not owner" ./tests/test-suite.log | sort | uniq -c
   1 + timeout: warning: timer_create: Not owner
   7 +timeout: warning: timer_create: Not owner
  29 timeout: warning: timer_create: Not owner

When run as "root" on Solaris 10 Update 8, the attached 
"test-suite.log-monotonic-root" shows that two tests FAIL:

% egrep '(FAIL|ERROR) ' test-suite.log-*-root | sed -e 's/FAIL /FAIL  /'
FAIL  tests/misc/tee.sh (exit status: 1)
FAIL  tests/misc/timeout-parameters.sh (exit status: 1)

With the following section of the attached log, showing a sequence of 
failures (exit status = 124) on the root execution of the checks:

+++ date +%s
++ expr 2147483647 - 1437184440 + 100
+ KERNEL_OVERFLOW_LIMIT=710299307
+ timeout 710299307 sleep 0
+ test 0 '!=' 124
+ timeout 4294967296 sleep 0
+ test 124 = 0
+ fail=1
++ expr 4294967295 / 86400 + 1
+ timeout 49711d sleep 0
+ test 124 = 0
+ fail=1
+ timeout 999999999999999999999999999999999999999999999999999999999999d 
sleep 0
+ test 124 = 0
+ fail=1
+ timeout 2.34e+5d sleep 0
+ test 124 = 0
+ fail=1
+ timeout 2.34 sleep 0
+ test 0 = 0
+ timeout .999999999 sleep 0
+ timeout --signal=invalid 1 sleep 0
timeout: invalid: invalid signal
Try 'timeout --help' for more information.

On Solaris 10 Update 11 as root, attached as 
"test-suite.log-monotonic-root-s10u11" shows a more promising result but 
still failure:

+++ date +%s
++ expr 2147483647 - 1437187249 + 100
+ KERNEL_OVERFLOW_LIMIT=710296498
+ timeout 710296498 sleep 0
+ test 0 '!=' 124
+ timeout 4294967296 sleep 0
+ test 124 = 0
+ fail=1
++ expr 4294967295 / 86400 + 1
+ timeout 49711d sleep 0
+ test 124 = 0
+ fail=1
+ timeout 999999999999999999999999999999999999999999999999999999999999d 
sleep 0
+ test 124 = 0
+ fail=1
+ timeout 2.34e+5d sleep 0
+ test 124 = 0
+ fail=1
+ timeout 2.34 sleep 0
+ test 0 = 0
+ timeout .999999999 sleep 0
+ timeout --signal=invalid 1 sleep 0
timeout: invalid: invalid signal
Try 'timeout --help' for more information.
+ test 125 = 125

The requirement for "root" privileges to use MONOTONIC is a non-starter 
in my books as the utility can not lower privileges for the command it 
runs - I hope you would agree. It seems that the tests fail even with 
the CLOCK_MONOTONIC option.

The following test run in the monotonic change builds on S10U8 and 
S10U11, as root for only a minute or so seem to be a repetition of 
previously seen results, on both systems:

 # while true; do time ./src/timeout 0.1 ./src/sleep inf; test $? != 
124 && break; done
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.105 
total
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.107 
total

But the same code (monotonic build tree) as a non-privileged user, gives 
approximately a one second result from "time" in addition to the error 
message.

% while true; do time ./src/timeout 0.1 ./src/sleep inf; test $? != 124 
&& break; done
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.996 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.999 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 1.000 
total
./src/timeout: warning: timer_create: Not owner
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.999 
total
./src/timeout: warning: timer_create: Not owner
^C
./src/timeout 0.1 ./src/sleep inf  0.00s user 0.00s system 0% cpu 0.971 
total

I hope I am provided the requested / required information this time :-)

Regards,
Peter

PS: The tmpfs bug (21062) affects commands as simple as "./src/cp -a 
tests .." (a command I issued before running the "gmake check" as root 
on S10U11)

[test-suite.log-monotonic-normal_user (text/plain, attachment)]
[test-suite.log-monotonic-root (text/plain, attachment)]
[test-suite.log-monotonic-root-s10u11 (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21061; Package coreutils. (Sat, 18 Jul 2015 11:11:01 GMT) Full text and rfc822 format available.

Message #26 received at 21061 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Peter Bray <pdb_ml <at> yahoo.com.au>, 21061 <at> debbugs.gnu.org
Subject: Re: bug#21061: coreutils-8.24 - Partially reproducible failures of
 tests/misc/timeout-parameters.sh
Date: Sat, 18 Jul 2015 12:10:43 +0100
On 18/07/15 04:09, Peter Bray wrote:
> On 17/07/15 06:56 PM, Pádraig Brady wrote:
>> while true; do time gtimeout 0.1 sleep inf; test $? != 124 && break; done
> 
> Run this code on S10U8/32 for almost 12 hours without it stopping 
> itself. An almost 10 hours on S10U11/32 with the same result (not 
> stopping itself).

Cool. That suggests it's a kernel issue with large timeout values.

> 
> Made the requested change to src/timeout.c
> 
> Backup  : src/timeout.c.orig
> Editted : src/timeout.c
> 
> Command : gdiff
> 
> 139c139
> <   if (timer_create (CLOCK_REALTIME, NULL, &timerid) == 0)
> ---
>  >   if (timer_create (CLOCK_MONOTONIC, NULL, &timerid) == 0)
> 
> Run as an unprivileged user, gives the following error summary:
> 
> % egrep '(FAIL|ERROR) ' tests/test-suite.log | sed -e 's/FAIL /FAIL  /'
> FAIL  tests/misc/tee.sh (exit status: 1)
> FAIL  tests/misc/timeout-blocked.pl (exit status: 1)
> FAIL  tests/tail-2/wait.sh (exit status: 1)
> FAIL  tests/tail-2/retry.sh (exit status: 1)
> FAIL  tests/tail-2/symlink.sh (exit status: 1)
> FAIL  tests/ls/infloop.sh (exit status: 1)
> FAIL  tests/tail-2/follow-name.sh (exit status: 1)
> FAIL  tests/tail-2/follow-stdin.sh (exit status: 1)

That's a strange restriction.
Linux doesn't have that at least.

> 
> On Solaris 10 Update 8 with Solaris 10 Update 11 and Solaris 11.2 have 
> the additional tmpfs build problems mentioned in BUG-21062
> 
> With the following executive summary:
> 
> % grep "timer_create: Not owner" ./tests/test-suite.log | sort | uniq -c
>     1 + timeout: warning: timer_create: Not owner
>     7 +timeout: warning: timer_create: Not owner
>    29 timeout: warning: timer_create: Not owner
> 
> When run as "root" on Solaris 10 Update 8, the attached 
> "test-suite.log-monotonic-root" shows that two tests FAIL:
> 
> % egrep '(FAIL|ERROR) ' test-suite.log-*-root | sed -e 's/FAIL /FAIL  /'
> FAIL  tests/misc/tee.sh (exit status: 1)
> FAIL  tests/misc/timeout-parameters.sh (exit status: 1)
> 
> With the following section of the attached log, showing a sequence of 
> failures (exit status = 124) on the root execution of the checks:
> 
> +++ date +%s
> ++ expr 2147483647 - 1437184440 + 100
> + KERNEL_OVERFLOW_LIMIT=710299307
> + timeout 710299307 sleep 0
> + test 0 '!=' 124
> + timeout 4294967296 sleep 0
> + test 124 = 0
> + fail=1
> ++ expr 4294967295 / 86400 + 1
> + timeout 49711d sleep 0
> + test 124 = 0
> + fail=1
> + timeout 999999999999999999999999999999999999999999999999999999999999d 
> sleep 0
> + test 124 = 0
> + fail=1
> + timeout 2.34e+5d sleep 0
> + test 124 = 0
> + fail=1
> + timeout 2.34 sleep 0
> + test 0 = 0
> + timeout .999999999 sleep 0
> + timeout --signal=invalid 1 sleep 0
> timeout: invalid: invalid signal
> Try 'timeout --help' for more information.
> 
> On Solaris 10 Update 11 as root, attached as 
> "test-suite.log-monotonic-root-s10u11" shows a more promising result but 
> still failure:
> 
> +++ date +%s
> ++ expr 2147483647 - 1437187249 + 100
> + KERNEL_OVERFLOW_LIMIT=710296498

Note we already check for similar 32 bit Linux/HPPA kernel issues
in this test. If we found a maximal value that was problematic
for solaris and ran the gating test in a loop 10 times say.
The non determinism of this is awkward though.

> The requirement for "root" privileges to use MONOTONIC is a non-starter 
> in my books as the utility can not lower privileges for the command it 
> runs - I hope you would agree. It seems that the tests fail even with 
> the CLOCK_MONOTONIC option.

+!

> I hope I am provided the requested / required information this time :-)

Yes thanks for the extensive testing.

Pádraig.





This bug report was last modified 9 years and 342 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.