GNU bug report logs -
#25386
guile-2.0 and 2.2 thread leakage+crash; very simple test attached.
Previous Next
Reported by: linasvepstas <at> gmail.com
Date: Sat, 7 Jan 2017 21:49:01 UTC
Severity: normal
Tags: fixed
Done: Andy Wingo <wingo <at> pobox.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 25386 in the body.
You can then email your comments to 25386 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Sat, 07 Jan 2017 21:49:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
linasvepstas <at> gmail.com
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Sat, 07 Jan 2017 21:49:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The (very simple) program below leaks ... something, very rapidly, and
then crashes after about 15-30 seconds. Last thing printed before
crash:
rate=194.80519560944032
num threads=2
((gc-time-taken . 2791348254) (heap-size . 7532883968) (heap-free-size
. 2449408) (heap-total-allocated . 23912882640)
(heap-allocated-since-gc . 1073995264) (protected-objects . 90)
(gc-times . 87))
Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
Aborted
Similar issue in guile-2.2 except it takes longer (8 minutes) and
crashes in gc somewhere. I assume that some sort of
continuation is left lying about, even though the thread has
exited.
(define junk 0)
(define halt #f)
(define (wtf-thr)
(define start (- (current-time) 0.1))
; Create thread that does junk and exits. Yes, the increment
; of `junk` is not protected, and its racey, but so what.
(define (mkthr v) (call-with-new-thread (lambda () (set! junk (+ junk
1)))))
; thread arguments
(define thrarg (make-list 10 0))
(define cnt 0)
(define (mke)
; Create a limited number of threads
(define thr-list (map mkthr thrarg))
; (display (length (all-threads)))
(map join-thread thr-list)
; Some handy debug printing.
(set! cnt (+ cnt 1))
(if (eq? 0 (modulo cnt 500))
(begin
(display "rate=")
(display (/ cnt (- (current-time) start))) (newline)
(display "num threads=")
(display (length (all-threads))) (newline)
(display (gc-stats)) (newline) (newline)
)))
; tail recursive infinite loop.
(define (aloop) (mke) (if (not halt) (aloop)))
; while forever.
(aloop)
)
; Run elsewhere, so that we have a shell prompt
; (not required for the bug)
(call-with-new-thread wtf-thr)
; halt if desired.
; (set! halt #t)
Information forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Sun, 08 Jan 2017 02:58:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 25386 <at> debbugs.gnu.org (full text, mbox):
I did a fairly through review of the thread-creation and thread-join
code in the git master branch, and it looks to be just fine. Thus,
some experimentation is in order:
Going back to guile-2.0, I see this behavior:
guile -v
guile (GNU Guile) 2.0.11
Packaged by Debian (2.0.11-deb+1-10)
If I add a manual gc to the exit of the thread, like so:
(define (mkthr v) (call-with-new-thread (lambda ()
(set! junk (+ junk 1)) (gc) )))
then the heap blows up, in minutes, to about 180MB but then stops
growing, even after hours and millions of thread creates:
(heap-size . 183734272) (gc-times . 1957954)
If I gc only every third thread create, it quickly blows up to about
400MB, and then stablizes, for hours:
(heap-size . 428638208) (gc-times . 1292663)
If I gc every 17th thread, it blows up to about 1.8GB and then is stable:
(heap-size . 1875902464) (gc-times . 327462)
This last one after about 5.5 million thread creates and joins.
The counting is done like so:
(define (mkthr v) (call-with-new-thread (lambda ()
(lock-mutex mtx)
(if (eq? 0 (modulo junk 17)) (gc))
(set! junk (+ junk 1))
(unlock-mutex mtx)
)))
In each case, it seems to hit a plateau at about (n+1)*100MB when gc
is done on one out of every n threads. This seems quite bizarre to
me: why does this inverted relation on number of gcs vs number of
thread creates? What's magic about 100MB? Clearly 100MB is wayyy too
large for this very simple program. I mean, even if I gc at *every*
thread-exit ...
(I have not yet explored above in guile-2.2)
Since I cannot find any 'obvious' bugs in guile, this suggests some
strange stochastic behavior in bdw-gc?
Information forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Sun, 08 Jan 2017 14:58:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 25386 <at> debbugs.gnu.org (full text, mbox):
Ran the three test case overnight; saw mostly no increase in mem usage.
-- gc before every thread exit: up to 9M thread exits, no change in heap-size
-- gc before every third thread exit: 25M thread exits, no change.
-- gc before every 17th thread exit: 44M thread exits, relatively
small increase, from (heap-size . 1875902464) to (heap-size .
2068840448)
Information forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Sun, 08 Jan 2017 21:57:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 25386 <at> debbugs.gnu.org (full text, mbox):
Since above works so swimmingly in the bove example, I tried it in a
production system. Calling gc shortly before thread exit results in a
crash, always the same crash, always in under 20 minutes:
guile: hashtab.c:137: vacuum_weak_hash_table: Assertion `removed <= len'
failed.
Aborted
again, this is for guile -v
guile (GNU Guile) 2.0.11
Packaged by Debian (2.0.11-deb+1-10)
Perhaps this is fixed in 2.0.13 ???
guile: hashtab.c:137: vacuum_weak_hash_table: Assertion `removed <= len' failed.
Thread 1416 "guile" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffe7b7fe700 (LWP 29883)]
0x00007ffff749e428 in __GI_raise (sig=sig <at> entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff749e428 in __GI_raise (sig=sig <at> entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff74a002a in __GI_abort () at abort.c:89
#2 0x00007ffff7496bd7 in __assert_fail_base (fmt=<optimized out>,
assertion=assertion <at> entry=0x7ffff7b5f7a2 "removed <= len",
file=file <at> entry=0x7ffff7b5f798 "hashtab.c", line=line <at> entry=137,
function=function <at> entry=0x7ffff7b5ff60 "vacuum_weak_hash_table")
at assert.c:92
#3 0x00007ffff7496c82 in __GI___assert_fail (
assertion=0x7ffff7b5f7a2 "removed <= len",
file=0x7ffff7b5f798 "hashtab.c", line=137,
function=0x7ffff7b5ff60 "vacuum_weak_hash_table") at assert.c:101
#4 0x00007ffff7ac3108 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#5 0x00007ffff7ac31af in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#6 0x00007ffff7ac5b1c in scm_c_hook_run ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#7 0x00007ffff7207ff5 in GC_try_to_collect_inner ()
from /usr/lib/x86_64-linux-gnu/libgc.so.1
#8 0x00007ffff72082aa in GC_try_to_collect_general ()
from /usr/lib/x86_64-linux-gnu/libgc.so.1
#9 0x00007ffff720838d in GC_gcollect ()
from /usr/lib/x86_64-linux-gnu/libgc.so.1
#10 0x00007ffff7ab9109 in scm_gc ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#11 0x00007ffff7b3402b in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#12 0x00007ffff7aab107 in scm_call_1 ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#13 0x00007ffff7b34093 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#14 0x00007ffff7aab21e in scm_call_3 ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#15 0x00007ffff7b34093 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#16 0x00007ffff7aab283 in scm_call_4 ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#17 0x00007fffefb90d79 in
opencog::SchemeEval::do_eval(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) (
this=0x7ffe74000980, expr=...)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:564
#18 0x00007fffefb90e2a in opencog::SchemeEval::c_wrap_eval(void*) (
p=0x7ffe74000980)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:493
#19 0x00007ffff7aa158a in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#20 0x00007ffff7b34093 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#21 0x00007ffff7aab283 in scm_call_4 ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#22 0x00007ffff7aa1d21 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#23 0x00007ffff7aa1e05 in scm_c_with_continuation_barrier ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#24 0x00007ffff72198e7 in GC_call_with_gc_active ()
from /usr/lib/x86_64-linux-gnu/libgc.so.1
#25 0x00007ffff7b21c01 in ?? ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#26 0x00007ffff7213952 in GC_call_with_stack_base ()
from /usr/lib/x86_64-linux-gnu/libgc.so.1
#27 0x00007ffff7b21fe8 in scm_with_guile ()
from /usr/lib/x86_64-linux-gnu/libguile-2.0.so.22
#28 0x00007fffefb90eae in
opencog::SchemeEval::eval_expr(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) (
this=0x7ffe74000980, expr=...)
at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:465
#29 0x00007fffe386cc86 in opencog::GenericShell::eval_loop (
this=0x7ffe4c001380)
at /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:446
#30 0x00007fffee768c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#31 0x00007ffff78396ba in start_thread (arg=0x7ffe7b7fe700)
at pthread_create.c:333
#32 0x00007ffff756f82d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Information forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Mon, 09 Jan 2017 01:29:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 25386 <at> debbugs.gnu.org (full text, mbox):
Sorry, please ignore the previous stack trace. I updated to version
2.0.13 (manually compiled) and it crashed with a zillion messages:
guile: warning: weak hash table corruption (https://bugs.gnu.org/19180)
so that's a done deal, then.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#25386
; Package
guile
.
(Wed, 11 Jan 2017 06:15:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 25386 <at> debbugs.gnu.org (full text, mbox):
This can be closed as 'fixed'; I tested on today's git
guile (GNU Guile) 2.1.5.19-7e9395
and the worst of it seems to be over. mem usage growth on the
original test case:
(heap-size . 7921664) (gc-times . 40)
(heap-size . 14344192) (gc-times . 953)
(heap-size . 14344192) (gc-times . 5219) ; after 4 minutes CPU
(heap-size . 26419200) (gc-times . 64975) ; after 77 minutes CPU
(heap-size . 26419200) (gc-times . 133346) ; after 154 mins CPU
(heap-size . 26419200) (gc-times . 170083) ; after 192 mins CPU
(heap-size . 26419200) (gc-times . 249102) ; after 283 mins cpu
(heap-size . 26419200) (gc-times . 420031) ; after 468 min cpu
(heap-size . 26419200) (gc-times . 557039) ; after 804 mins CPU
i.e. 26MBytes - larger than it needs to be, but acceptable.
The last entry was, in full,
((gc-time-taken . 355210357) (heap-size . 26419200) (heap-free-size .
20336640) (heap-total-allocated . 2522568563312)
(heap-allocated-since-gc . 57648) (protected-objects . 0) (gc-times .
557619))
i.e. of the 26MB, only 6MB is in use, the rest is free. The 6MB is
close to what it starts with. 2522 GB were chewed through in the
process, so this is OK, I guess.
A variant test case, create 510 threads before calling join: (change
10 to 510 in above test)
(heap-size . 10604544) (gc-times . 32)
(heap-size . 19505152) (gc-times . 484)
(heap-size . 35926016) (gc-times . 1761)
(heap-size . 48238592) (gc-times . 4217) ; after 8 minutes cpu time
(heap-size . 48238592) (gc-times . 47902) ; after 76 mins CPU
(heap-size . 48238592) (gc-times . 73063) ; after 114 mins CPU
(heap-size . 65540096) (gc-times . 128094) ; after 209 mins cpu
(heap-size . 65540096) (gc-times . 248321) ; after 399 mins
(heap-size . 65540096) (gc-times . 344197) ; after 546 min
i.e. 65MBytes .. acceptable, I guess.
The last one was:
((gc-time-taken . 218714374) (heap-size . 65540096) (heap-free-size .
54419456) (heap-total-allocated . 2057186203744)
(heap-allocated-since-gc . 4553872) (protected-objects . 0) (gc-times
. 344799))
so of the 65MB, only 11MB is in-use.
My production server is doing this:
(heap-size . 652918784) (gc-times . 233) ; about 8 mins CPU
(heap-size . 737722368) (gc-times . 339) ; 12 mins CPU
(heap-size . 1332973568) (gc-times . 1797) ; 120 mins CPU
(heap-size . 1441443840) (gc-times . 2221) ; 151 min CPU
(heap-size . 1521213440) (gc-times . 2441) ; 168 min cpu
(heap-size . 1595101184) (gc-times . 3061) ; 218 min cpu
(heap-size . 1726119936) (gc-times . 3292) ; 237 min
(heap-size . 1960865792) (gc-times . 6698) ; 510 minn
(heap-size . 1960865792) (gc-times . 10383) ; 805 min
(heap-size . 2931556352) (gc-times . 14211) ; 1199 min
about 3GB --
Last one is, in full:
(gc-stats)
((gc-time-taken . 19818394581722) (heap-size . 2931556352)
(heap-free-size . 1767579648) (heap-total-allocated . 731393991040)
(heap-allocated-since-gc . 4063680) (protected-objects . 318)
(gc-times . 14211))
so of the 3GB, 1.8GB is free, and 1.2GB in use which is surprisingly
high for my app, but I can live with that.
Thanks!
--linas
Added tag(s) fixed.
Request was from
ludo <at> gnu.org (Ludovic Courtès)
to
control <at> debbugs.gnu.org
.
(Sun, 12 Feb 2017 21:20:01 GMT)
Full text and
rfc822 format available.
Reply sent
to
Andy Wingo <wingo <at> pobox.com>
:
You have taken responsibility.
(Wed, 01 Mar 2017 15:33:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
linasvepstas <at> gmail.com
:
bug acknowledged by developer.
(Wed, 01 Mar 2017 15:33:02 GMT)
Full text and
rfc822 format available.
Message #27 received at 25386-done <at> debbugs.gnu.org (full text, mbox):
On Wed 11 Jan 2017 07:14, Linas Vepstas <linasvepstas <at> gmail.com> writes:
> This can be closed as 'fixed'; I tested on today's git
>
> guile (GNU Guile) 2.1.5.19-7e9395
>
> and the worst of it seems to be over. mem usage growth on the
> original test case:
OK, closing :)
Andy
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 30 Mar 2017 11:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 79 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.