GNU bug report logs - #41824
Dejagnu's unknown proc aborts testsuite run when triggered in test-case

Previous Next

Package: dejagnu;

Reported by: Tom de Vries <tdevries <at> suse.de>

Date: Fri, 12 Jun 2020 08:36:01 UTC

Owned by: jcb62281 <at> gmail.com

Severity: normal

Done: Jacob Bachmeyer <jcb62281 <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jacob Bachmeyer <jcb62281 <at> gmail.com>
To: Pedro Alves <palves <at> redhat.com>
Cc: Tom de Vries <tdevries <at> suse.de>, 41824 <at> debbugs.gnu.org
Subject: bug#41824: Dejagnu's unknown proc aborts testsuite run when triggered in test-case
Date: Wed, 24 Jun 2020 21:43:56 -0500
Pedro Alves wrote:
> On 6/24/20 12:36 AM, Jacob Bachmeyer wrote:
>   
>> Pedro Alves wrote:
>>     
>>> On 6/17/20 4:26 AM, Jacob Bachmeyer wrote:
>>>       
>>> I'd think that tracking ERROR and WARNING in the summary would sort
>>> out this "test run blew up but no one noticed" issue.
>>>       
>> The problem is that the DejaGnu internals often generate ERROR and WARNING "directly" without using the procedures that update counters.  
>>     
>
> I understand that that's what DejaGnu does currently, but couldn't it be changed?
> Would it not be desirable?
>   

It could be changed, but those are much larger changes to the code and 
overall behavior than simply adding an UNRESOLVED result to report the 
failure, and if I understand correctly, the presence of UNRESOLVED 
results marks the test results as invalid or partially-valid at best, 
which is correct if a test script failed with a Tcl error.

Also, the current warning and error counters are not global, but are 
reset after each test result.  Exceeding thresholds for warnings or 
errors causes the next test result to be changed to UNRESOLVED before 
the counters are reset.

>> I am currently considering also adding a '--no-keep-going'/'--no_keep_going' option and making '--keep-going' the default for the rest of the 1.6 series.  Aborting on uncaught Tcl errors will be the default in 1.7, but I am trying to decide whether changing it in 1.6 counts more as fixing a bug (DejaGnu quietly sweeping Tcl errors under the proverbial rug) or introducing an environment change (needing '--keep-going' to reach the end of some testsuites that "worked" before).  This is further complicated by the fact that any testsuite needing that option to reach the end means that, technically, that testsuite is invalid -- it *did* throw a Tcl error during testing.
>>     
> It seems your designing this under the assumption that if a testsuite
> run "blows up", it's not a big deal to just fix it and rerun the testsuite.
>   

For most packages, this is correct.  For GDB, perhaps it is a bigger 
problem.

On the other hand, "blowing up" at least indicates clearly that 
something is wrong and prevents "whistling past the graveyard" where 
tests are not actually running, but everyone thinks the situation is fine.

> What that misses is that in some environments, particularly with slow embedded
> system boards, running the full testsuite can take _days_.   This is all
> compounded by the fact that the testsuite must be run in a large set of
> different modes and boards to cover everything for a single target.  Even on
> a modern GNU/Linux system, running all combinations of modes for full coverage
> can take hours.
>
> Having the testrun stop midway because of some specific testcase
> blowing up can waste a lot of time.
>
> The fact that the testsuite can take a while to fully run results in automated
> testing systems, like GDB's buildbot, struggling to keep up with testing
> all the commits that land in git.  On some buildbot setups, it's just not
> possible to test all commits.  So if some testcase blows up the whole
> test run, the next time you run the testsuite, it'll likely be that a
> large number of unrelated commits went into the sources.  Losing a large
> chunk of the testsuite results because of some small typo in one
> testcase is thus undesirable, because noticing the typo in the first
> place can take a while, and also you may be stuck with a "last good run"
> for many of the testcases that gets older and older until
> the typo is finally fixed.
>   

I presume buildbot can be easily configured to pass a '--keep_going' 
option, perhaps as 'RUNTESTFLAGS=--keep_going' if needed?

Still, needing a new option to get old behavior is changing the 
environment interface, so I will make --keep_going the default, at least 
for the rest of 1.6.  That default is likely to change for 1.7.  For 
automated systems like buildbot, passing '--keep_going' should be 
perfectly reasonable, which is why the option exists.  (While the 
documented options will be changed to use hyphens to align with the rest 
of the GNU system in 1.7, the current option names will continue to be 
accepted, since the aliases are obvious.)

This has been pushed on the temporary PR41918 branch because the fix for 
PR41824 has been rolled into the fix for PR41918, as the bugs are 
closely related.

> Then there's the issue about testsuite order.  DejaGnu runs testcases
> in alphabetical filename order, in sequence.  If the testrun blows up in
> the first testcase run, say, a.exp, the whole testrun stops before b.exp,
> etc. are run.  If the testcase that blows up is the last one, say named
> zzz.exp, most of the testsuite runs.  This is not "fair".  If I rename
> a.exp to zzzzz.exp, then suddenly that testcase gains more "importance" in
> the sense that it can no longer blow up the whole test run.  A
> testcase's name should not be significant like that.
>
> On the same vein, consider parallel mode.  This is not native to DejaGnu,
> but both GDB and GCC have parallel modes, where their makefiles set up a
> make check such that they spawn a number of runtest processes running in
> parallel, each exercising a subset the whole testsuite.  Imagine a testsuite
> composed of exactly 32 testcases.  And further imagine that you run it
> under "make check -j32".  That is, you'll end up with 32 runtest instances,
> each running exactly one testcase.  Now, if one of the testcases blows up, all
> the other 31 will still run.  Again, this shows that testcase name, and
> testrun order should not be important.  There is no real reason that a
> sequential run in that scenario ends up with different results (testcase a.exp
> blew up, so no result for the other tests) compared to a parallel
> run (all tests ran successfully but test a.exp).  Instead, I advocate
> that the testresults summary should indicate that a problem happened,
> so that "blow up event" isn't lost.  But do let the whole testsuite run.
>   

There are plans to eventually implement a native parallel testing mode.  
You are correct that the order in which testcases are run should be 
insignificant, and for valid testsuites that do not crash, it is 
insignificant -- all tests are run.  The problems start when a testcase 
*crashes* and aborts with a Tcl error.

The entire reason for stopping the testsuite immediately when a Tcl 
error occurs is that it is undeniable that something is *very* *wrong* 
when the testsuite stops early.  Software testing is hard enough when 
you know your testsuite is good, but it is far worse when you have bugs 
in the testsuite masking bugs in the actual program because a test 
script is crashing and part of the program that is supposed to be tested 
is not being exercised.


-- Jacob




This bug report was last modified 4 years and 314 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.