GNU bug report logs - #27234
Hang in GC, inf loop while walking frame pointers

Previous Next

Package: guile;

Reported by: linasvepstas <at> gmail.com

Date: Sun, 4 Jun 2017 14:59:01 UTC

Severity: normal

To reply to this bug, email your comments to 27234 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#27234; Package guile. (Sun, 04 Jun 2017 14:59:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to linasvepstas <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sun, 04 Jun 2017 14:59:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Linas Vepstas <linasvepstas <at> gmail.com>
To: bug-guile <at> gnu.org
Subject: Hang in GC, inf loop while walking frame pointers
Date: Sun, 4 Jun 2017 09:57:17 -0500
[Message part 1 (text/plain, inline)]
what: guile-2.2-stable, from git.

I've got a large, complex, heavily multi-threaded guile program that
hangs during garbage collection; usually after running for half a day.
It hangs in a tight loop in scm_i_vm_mark_stack, spinning at 100% of CPU.

This is due to the for-loop line fp = SCM_FRAME_DYNAMIC_LINK (fp))
at libguile/vm.c line 975 failing to advance the frame pointer.
There's no "obvious" corruption in the stack; it simply looks like
the frame was incompletely set up, and so incrementing to the next
fp does not go anywhere.

I have recompiled guile with VM_ENABLE_ASSERTIONS and am trying to
reproduce the bug now.  The rest of this email is a record of a long
debug session isolating the problem, and showing that, overall, the
thread and stack data look more-or-less correct and uncorrupoted,
except for the inability to walk forward in the frame.

-- linas

(gdb) bt
#0  scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
#1  0x00007f3e9db8835e in GC_mark_from (mark_stack_top=0x7f3e9b783ee0,
    mark_stack_top <at> entry=0x7f3e9b783f00,
    mark_stack=mark_stack <at> entry=0x7f3e9b783eb0,
    mark_stack_limit=mark_stack_limit <at> entry=0x7f3e9b793eb0) at ../mark.c:772
#2  0x00007f3e9db8897e in GC_do_local_mark (local_mark_stack=0x7f3e9b783eb0,
    local_top=0x7f3e9b783f00) at ../mark.c:1037
#3  0x00007f3e9db88b98 in GC_mark_local (
    local_mark_stack=local_mark_stack <at> entry=0x7f3e9b783eb0, id=id <at> entry=4)
    at ../mark.c:1170
#4  0x00007f3e9db88eaa in GC_help_marker (my_mark_no=my_mark_no <at> entry=80003)
    at ../mark.c:1238
#5  0x00007f3e9db92e3c in GC_mark_thread (id=<optimized out>)
    at ../pthread_support.c:380
#6  0x00007f3e9e2ae6ba in start_thread (arg=0x7f3e9b794700)
    at pthread_create.c:333
#7  0x00007f3e9dfdd82d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) step

#0  find_slot_map (cache=0x7f3e9b783ab0, ip=0x1) at
../../libguile/vm.c:935
935       if (cache->entries[slot].ip == ip)
(gdb) print slot
$1 = 0
(gdb) print cache->entries[slot].ip
$2 = (scm_t_uint32 *) 0x1
(gdb) print cache->entries[slot].map
$4 = (const scm_t_uint8 *) 0x0
(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
1011        slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
&cache);
(gdb) print fp
$5 = (union scm_vm_stack_element *) 0x7f3e94cdee38

#define SCM_FRAME_RETURN_ADDRESS(fp)    ((fp)[0].as_ip)
(gdb) print (fp)[0].as_ip
$6 = (scm_t_uint32 *) 0x1

OK that looks weird ... is this corrupted ?? but whatever,
because the returned slot_map is never used ... because ...

(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:975
975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
(gdb) print fp
$7 = (union scm_vm_stack_element *) 0x7f3e94cdee38

frames.h:#define SCM_FRAME_DYNAMIC_LINK(fp)      ((fp) + (fp)[1].as_uint)
(gdb) print (fp)[1].as_uint
$8 = 0

OK, that seems bad, because now fp never advances, it just repeats
over and over with this same value.

(gdb)
979           for (slot = nlocals - 1; sp < fp; sp++, slot--)
(gdb) print nlocals
$10 = -2
(gdb) print sp
$11 = (union scm_vm_stack_element *) 0x7f3e94cdee48
#define SCM_FRAME_NUM_LOCALS(fp, sp)    ((fp) - (sp))
(gdb) print fp
$12 = (union scm_vm_stack_element *) 0x7f3e94cdee38
(gdb) print  ((fp) - (sp))
$13 = -2
Ohh .. its not -16 because its -2 * sizeof (union scm_vm_stack_element *)
so that's OK.
So for loops is skipped, it should go to.
      sp = SCM_FRAME_PREVIOUS_SP (fp);
frames.h:#define SCM_FRAME_PREVIOUS_SP(fp)    ((fp) + 2)

and so now it loops around and repeats.
(gdb) print cache
$19 = {entries = {{ip = 0x1, map = 0x0}, {ip = 0x0,
      map = 0x0} <repeats 23 times>, {ip = 0x1a1b5e0, map = 0x0}, {ip =
0x0,
      map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
      map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
      map = 0x0}}}
(gdb) print &cache
$20 = (struct slot_map_cache *) 0x7f3e9b783ab0

and so the loop repeates forever, because
fp = SCM_FRAME_DYNAMIC_LINK (fp) never advances fp, because
(fp)[1].as_uint  is zero.

So where is fp pointing to?  recall fp == 0x7f3e94cdee38

(gdb) x/20x 0x7f3e94cdee00
0x7f3e94cdee00: 0x0000000000000904      0x000000000000200c
0x7f3e94cdee10: 0x0000000001a1b7e0      0x0000000001a1b5e0
0x7f3e94cdee20: 0x0000000000000004      0x000000000000200c
0x7f3e94cdee30: 0x0000000000169bd6      0x0000000000000001
0x7f3e94cdee40: 0x0000000000000000      0x000000000192acd0
0x7f3e94cdee50: 0x0000000001fa4bc0      0x0000000001a1c6d0
0x7f3e94cdee60: 0x0000000044507950      0x0000000000000002
0x7f3e94cdee70: 0x000000000005a6f5      0x00000000018febd0
0x7f3e94cdee80: 0x000000006d490d00      0x00007f3e9e97241c
0x7f3e94cdee90: 0x0000000000000002      0x00007f3e9e97241c


(gdb) x/s 0x7f3e94cdee60
0x7f3e94cdee60: "PyPD"      <<< ?? is this a meaningul string?
(gdb) x/s 0x7f3e94cdee81
0x7f3e94cdee81: "\rIm"      << how about this ???

So this does not look healthy.  How did we get here?

(gdb) print vp
$21 = (struct scm_vm *) 0x755c1bd0
(gdb) print mark_stack_ptr
$22 = (struct GC_ms_entry *) 0x7f3e9b783f40
(gdb) print mark_stack_limit
$23 = (struct GC_ms_entry *) 0x7f3e9b793eb0
(gdb) print upper
$24 = (void *) 0x7ffdd986
(gdb) print GC_greatest_plausible_heap_addr
$25 = (void *) 0x7ffdd986
(gdb) print lower
$26 = (void *) 0x17a1ff8

(gdb) print vp->fp
$27 = (union scm_vm_stack_element *) 0x7f3e94cdee18
(gdb) print vp->sp
$28 = (union scm_vm_stack_element *) 0x7f3e94cdee00
(gdb) print  vp->stack_top
$29 = (union scm_vm_stack_element *) 0x7f3e94cdf000

So lets walk manually.
frames.h:#define SCM_FRAME_DYNAMIC_LINK(fp)      ((fp) + (fp)[1].as_uint)

(gdb) print (vp->fp) + (vp->fp)[1].as_uint
$31 = (union scm_vm_stack_element *) 0x7f3e94cdee38
So we are instanly in trouble! once through the loop, and we've got the
broken fp

So how did we get here?

(gdb) print vp
$32 = (struct scm_vm *) 0x755c1bd0
(gdb) print *vp
$33 = {ip = 0x7f3e9e5a6378 <subr_stub_code+56>, sp = 0x7f3e94cdee00,
  fp = 0x7f3e94cdee18, stack_limit = 0x7f3e94cde000, trace_level = 0,
  sp_min_since_gc = 0x7f3e94cdee00, stack_size = 512,
  stack_bottom = 0x7f3e94cde000, stack_top = 0x7f3e94cdf000,
  overflow_handler_stack = 0x304, hooks = {0x4, 0x4, 0x4, 0x4, 0x4},
  resumable_prompt_cookie = 0x7f3d397f9440, engine = 1}

Looks plausible, not corrupted. Why is engine=1 ??
#define SCM_VM_REGULAR_ENGINE 0
#define SCM_VM_DEBUG_ENGINE 1
#define SCM_VM_NUM_ENGINES 2
#  define SCM_I_CURRENT_THREAD (scm_i_current_thread)

(gdb) print scm_i_current_thread
$34 = (scm_i_thread *) 0x0

OK, that's not excellent.
(gdb) print thread_count
$35 = 161

(gdb) bt
#0  scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
#1  0x00007f3e9db8835e in GC_mark_from (mark_stack_top=0x7f3e9b783ee0,
    mark_stack_top <at> entry=0x7f3e9b783f00,
    mark_stack=mark_stack <at> entry=0x7f3e9b783eb0,
    mark_stack_limit=mark_stack_limit <at> entry=0x7f3e9b793eb0) at ../mark.c:772
#2  0x00007f3e9db8897e in GC_do_local_mark
(local_mark_stack=0x7f3e9b783eb0,
    local_top=0x7f3e9b783f00) at ../mark.c:1037
#3  0x00007f3e9db88b98 in GC_mark_local (
    local_mark_stack=local_mark_stack <at> entry=0x7f3e9b783eb0, id=id <at> entry=4)
    at ../mark.c:1170
#4  0x00007f3e9db88eaa in GC_help_marker (my_mark_no=my_mark_no <at> entry=80003)
    at ../mark.c:1238
#5  0x00007f3e9db92e3c in GC_mark_thread (id=<optimized out>)
    at ../pthread_support.c:380
#6  0x00007f3e9e2ae6ba in start_thread (arg=0x7f3e9b794700)
    at pthread_create.c:333
#7  0x00007f3e9dfdd82d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

so bdwgc is version gc7_6_0

So ../mark.c:772 is GET_HDR and GET_HDR is in
include/private/gc_hdrs.h

define GET_HDR(p, hhdr)  ... ((hhdr) = HDR_INNER(p)
# define HDR_INNER(p) HDR_FROM_BI(BI(p),p)
#define HDR_FROM_BI(bi, p) \
                ((bi)->index[((word)(p) >> LOG_HBLKSIZE) & (BOTTOM_SZ - 1)])

This seems all wrong.. wtf ....

is it called by thread_mark ??
in libguile/thread.c:
  thread_gc_kind =
    GC_new_kind (GC_new_free_list (),
       GC_MAKE_PROC (GC_new_proc (thread_mark), 0),
       0, 1)
in bdwgc/misc.c  GC_new_proc
    GC_mark_procs[result] = proc
#define GC_mark_procs GC_arrays._mark_procs
#define PROC(descr) \
      (GC_mark_procs[((descr) >> GC_DS_TAG_BITS) & (GC_MAX_MARK_PROCS-1)])

      GC_obj_kinds[result].ok_descriptor = descr;  <<< this is where
MAKE_PROC goes.

anyway ... in mark.c line 733
        case GC_DS_PROC:
          mark_stack_top--;
          credit -= GC_PROC_BYTES;
          mark_stack_top = (*PROC(descr))((word *)current_p, mark_stack_top,
                                          mark_stack_limit, ENV(descr));
          continue;
But that's the wrong line number.  line 772 is in
        case GC_DS_PER_OBJECT:
          if ((signed_word)descr >= 0) {
            /* Descriptor is in the object.     */
            descr = *(word *)(current_p + descr - GC_DS_PER_OBJECT);

Oh, it goes to retry, which loops around and calls PROC which is thread_mark
which is called with args
          mark_stack_top = (*PROC(descr))((word *)current_p, mark_stack_top,
                                          mark_stack_limit, ENV(descr));

(gdb) print current_p
$39 = (ptr_t) 0x7225d380 "@\325%r"

so back in libguile/threads.c line 84:
  const struct scm_i_thread *t = (struct scm_i_thread *) addr;

(gdb) print *(struct scm_i_thread *) current_p
$40 = {next_thread = 0x7225d540, handle = 0x5763f3b0, pthread =
139900934399744,
  result = 0x4, exited = 0, guile_mode = 1, needs_unregister = 1, wake =
0x0,
  sleep_cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0,
      __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0,
      __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align =
0},
  sleep_pipe = {151, 197}, freelists = 0x71cfe800,
  pointerless_freelists = 0x71cfe780, dynamic_state = 0x71de7900, dynstack
= {
    base = 0x698c3800, top = 0x698c3940, limit = 0x698c3a00},
  pending_asyncs = 0x304, block_asyncs = 0, continuation_root = 0x5763f340,
  continuation_base = 0x7f3d397f9c90, vp = 0x755c1bd0, base =
0x7f3d397f9da0,
  regs = {{__jmpbuf = {139905204221696, 139906923931021, 1,
139894281647232,
        139906913276512, 0, 139906918327551, 8388608},
      __mask_was_saved = 964664752, __saved_mask = {__val = {0,
139900934399744,
          139900934399744, 139900934397344, 139906923929400,
139893936840720,
          1804170656, 139906913276512, 139900934399744, 139900934397344,
          139900934397504, 139893936840720, 139906910859134,
139906913276484,
          139906910857112, 27401856}}}}}

(gdb) print ((struct scm_i_thread *) current_p)->vp
$41 = (struct scm_vm *) 0x755c1bd0

which is exactly the vp we had before, so this is the correct thread.

So  ... all the values in *(struct scm_i_thread *) current_p look more or
less OK;
its not corrupted. And all the values in *vp look more or less OK, its not
corrrupted
except that engine=1 looks wrong.  So what's up with the weird stack
looping?
Again:
From before, we had:

(gdb) print *vp
$42 = {ip = 0x7f3e9e5a6378 <subr_stub_code+56>, sp = 0x7f3e94cdee00,
  fp = 0x7f3e94cdee18, stack_limit = 0x7f3e94cde000, trace_level = 0,
  sp_min_since_gc = 0x7f3e94cdee00, stack_size = 512,
  stack_bottom = 0x7f3e94cde000, stack_top = 0x7f3e94cdf000,
  overflow_handler_stack = 0x304, hooks = {0x4, 0x4, 0x4, 0x4, 0x4},
  resumable_prompt_cookie = 0x7f3d397f9440, engine = 1}

so fp, sp, stack_bottom, stack_top all are self-consistent with each other.
The problem is that
    fp = SCM_FRAME_DYNAMIC_LINK (fp)
goes nowhere, after the first iteration.

Is the frame corrupted? How? what does the rest of the stack look like?

(gdb) x/32x 0x7f3e94cdee00
0x7f3e94cdee00: 0x0000000000000904      0x000000000000200c
0x7f3e94cdee10: 0x0000000001a1b7e0      0x0000000001a1b5e0
0x7f3e94cdee20: 0x0000000000000004      0x000000000000200c  <<< theres the 4
0x7f3e94cdee30: 0x0000000000169bd6      0x0000000000000001 <<< there's the
new fp
0x7f3e94cdee40: 0x0000000000000000      0x000000000192acd0 <<< there's the
zero.
0x7f3e94cdee50: 0x0000000001fa4bc0      0x0000000001a1c6d0
0x7f3e94cdee60: 0x0000000044507950      0x0000000000000002
0x7f3e94cdee70: 0x000000000005a6f5      0x00000000018febd0
0x7f3e94cdee80: 0x000000006d490d00      0x00007f3e9e97241c
0x7f3e94cdee90: 0x0000000000000002      0x00007f3e9e97241c
0x7f3e94cdeea0: 0x0000000000000002      0x00007f3e8d797630
0x7f3e94cdeeb0: 0x0000000000000002      0x00007f3e86effacc
0x7f3e94cdeec0: 0x0000000000000003      0x000000006d490d00
0x7f3e94cdeed0: 0x00007f3e86eb8698      0x0000000000000002
0x7f3e94cdeee0: 0x00007f3e8e51f450      0x0000000000000008
0x7f3e94cdeef0: 0x0000000001a37c10      0x000000000192acd0


So... who sets up thread->vp ?
/* #define VM_ENABLE_ASSERTIONS */
However, defining this breaks the build.

diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
index 6c88ebf..f0f8164 100644
--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -2050,11 +2050,11 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,

           if (!scm_module_system_booted_p)
             {
-              ASSERT (scm_is_true
+              ASSERT (scm_is_true(
                       scm_equal_p (modname,
                                    scm_list_2
                                    (SCM_BOOL_T,
-                                    scm_from_utf8_symbol ("guile"))));
+                                    scm_from_utf8_symbol ("guile")))));
               var = scm_lookup (sym);
             }
           else if (scm_is_true (SCM_CAR (modname)))
diff --git a/libguile/vm.c b/libguile/vm.c
index 18f2192..c26066e 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -63,7 +63,7 @@ static size_t page_size;

 /* The VM has a number of internal assertions that shouldn't normally be
    necessary, but might be if you think you found a bug in the VM. */
-/* #define VM_ENABLE_ASSERTIONS */
+#define VM_ENABLE_ASSERTIONS

 static void vm_expand_stack (struct scm_vm *vp,
                              union scm_vm_stack_element *new_sp)
SCM_NOINLINE;


 GEN      guile-procedures.texi
/bin/bash: line 1: 24078 Broken pipe

 ./meta/guile
Aborted
./libguile/guile
Aborted
gdb libguile/.libs/lt-guile

#0  0x00007ffff74f7428 in __GI_raise (sig=sig <at> entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff74f902a in __GI_abort () at abort.c:89
#2  0x00007ffff7b707c4 in vm_regular_engine (thread=0x7282, vp=0x806f30,
    registers=0x6, resume=-145787864) at ../../libguile/vm-engine.c:2053
#3  0x00007ffff7b711d2 in scm_call_n (proc=0x832b30, argv=argv <at> entry=0x0,
    nargs=nargs <at> entry=0) at ../../libguile/vm.c:1257
#4  0x00007ffff7af14e9 in scm_call_0 (proc=<optimized out>)
    at ../../libguile/eval.c:481
#5  0x00007ffff7b10568 in scm_primitive_load_path (args=<optimized out>)
    at ../../libguile/load.c:1248
#6  0x00007ffff7b6e976 in vm_regular_engine (thread=0x7282, vp=0x806f30,
    registers=0x6, resume=-145787864) at ../../libguile/vm-engine.c:784

Oh. wtf.

(define last-gc (gc-stats))
(define start-time (get-internal-real-time))
(define run-time (get-internal-run-time))
(define (avg-time-taken)
    (define now (get-internal-real-time))
    (define run (get-internal-run-time))
    (define cur (gc-stats))
    (define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
    (define elapsed-time (* 1.0e-9 (- now start-time)))
    (define cpu-time (* 1.0e-9 (- run run-time)))
    (format #t "Elapsed time: ~5f secs. GC-time-taken: ~5f s or ~5f%
cpu-usage: ~5f%\n"
        elapsed-time gc-time-taken  (* 100 (/ gc-time-taken elapsed-time))
        (* 100 (/ cpu-time elapsed-time))
    )
    (set! last-gc cur)
    (set! start-time now)
    (set! run-time run)
)


linas <at> fanny: ~/src/novamente/src/opencog-git/opencog/nlp/learn $ vi x
linas <at> fanny: ~/src/novamente/src/opencog-git/opencog/nlp/learn $ cat x

#0  find_slot_map (cache=0x7f3e9b783ab0, ip=0x1) at
../../libguile/vm.c:935
935       if (cache->entries[slot].ip == ip)
(gdb) print slot
$1 = 0
(gdb) print cache->entries[slot].ip
$2 = (scm_t_uint32 *) 0x1
(gdb) print cache->entries[slot].map
$4 = (const scm_t_uint8 *) 0x0
(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
1011        slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp),
&cache);
(gdb) print fp
$5 = (union scm_vm_stack_element *) 0x7f3e94cdee38

#define SCM_FRAME_RETURN_ADDRESS(fp)    ((fp)[0].as_ip)
(gdb) print (fp)[0].as_ip
$6 = (scm_t_uint32 *) 0x1

OK that looks weird ... is this corrupted ?? but whatever,
because the returned slot_map is never used ... because ...

(gdb) step
scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:975
975            fp = SCM_FRAME_DYNAMIC_LINK (fp))
(gdb) print fp
$7 = (union scm_vm_stack_element *) 0x7f3e94cdee38

frames.h:#define SCM_FRAME_DYNAMIC_LINK(fp)      ((fp) + (fp)[1].as_uint)
(gdb) print (fp)[1].as_uint
$8 = 0

OK, that seems bad, because now fp never advances, it just repeats
over and over with this same value.

(gdb)
979           for (slot = nlocals - 1; sp < fp; sp++, slot--)
(gdb) print nlocals
$10 = -2
(gdb) print sp
$11 = (union scm_vm_stack_element *) 0x7f3e94cdee48
#define SCM_FRAME_NUM_LOCALS(fp, sp)    ((fp) - (sp))
(gdb) print fp
$12 = (union scm_vm_stack_element *) 0x7f3e94cdee38
(gdb) print  ((fp) - (sp))
$13 = -2
Ohh .. its not -16 because its -2 * sizeof (union scm_vm_stack_element *)
so that's OK.
So for loops is skipped, it should go to.
      sp = SCM_FRAME_PREVIOUS_SP (fp);
frames.h:#define SCM_FRAME_PREVIOUS_SP(fp)    ((fp) + 2)

and so now it loops around and repeats.
(gdb) print cache
$19 = {entries = {{ip = 0x1, map = 0x0}, {ip = 0x0,
      map = 0x0} <repeats 23 times>, {ip = 0x1a1b5e0, map = 0x0}, {ip =
0x0,
      map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
      map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0, map = 0x0}, {ip = 0x0,
      map = 0x0}}}
(gdb) print &cache
$20 = (struct slot_map_cache *) 0x7f3e9b783ab0

and so the loop repeates forever, because
fp = SCM_FRAME_DYNAMIC_LINK (fp) never advances fp, because
(fp)[1].as_uint  is zero.

So where is fp pointing to?  recall fp == 0x7f3e94cdee38

(gdb) x/20x 0x7f3e94cdee00
0x7f3e94cdee00: 0x0000000000000904      0x000000000000200c
0x7f3e94cdee10: 0x0000000001a1b7e0      0x0000000001a1b5e0
0x7f3e94cdee20: 0x0000000000000004      0x000000000000200c
0x7f3e94cdee30: 0x0000000000169bd6      0x0000000000000001
0x7f3e94cdee40: 0x0000000000000000      0x000000000192acd0
0x7f3e94cdee50: 0x0000000001fa4bc0      0x0000000001a1c6d0
0x7f3e94cdee60: 0x0000000044507950      0x0000000000000002
0x7f3e94cdee70: 0x000000000005a6f5      0x00000000018febd0
0x7f3e94cdee80: 0x000000006d490d00      0x00007f3e9e97241c
0x7f3e94cdee90: 0x0000000000000002      0x00007f3e9e97241c


(gdb) x/s 0x7f3e94cdee60
0x7f3e94cdee60: "PyPD"      <<< ?? is this a meaningul string?
(gdb) x/s 0x7f3e94cdee81
0x7f3e94cdee81: "\rIm"      << how about this ???

So this does not look healthy.  How did we get here?

(gdb) print vp
$21 = (struct scm_vm *) 0x755c1bd0
(gdb) print mark_stack_ptr
$22 = (struct GC_ms_entry *) 0x7f3e9b783f40
(gdb) print mark_stack_limit
$23 = (struct GC_ms_entry *) 0x7f3e9b793eb0
(gdb) print upper
$24 = (void *) 0x7ffdd986
(gdb) print GC_greatest_plausible_heap_addr
$25 = (void *) 0x7ffdd986
(gdb) print lower
$26 = (void *) 0x17a1ff8

(gdb) print vp->fp
$27 = (union scm_vm_stack_element *) 0x7f3e94cdee18
(gdb) print vp->sp
$28 = (union scm_vm_stack_element *) 0x7f3e94cdee00
(gdb) print  vp->stack_top
$29 = (union scm_vm_stack_element *) 0x7f3e94cdf000

So lets walk manually.
frames.h:#define SCM_FRAME_DYNAMIC_LINK(fp)      ((fp) + (fp)[1].as_uint)

(gdb) print (vp->fp) + (vp->fp)[1].as_uint
$31 = (union scm_vm_stack_element *) 0x7f3e94cdee38
So we are instanly in trouble! once through the loop, and we've got the
broken fp

So how did we get here?

(gdb) print vp
$32 = (struct scm_vm *) 0x755c1bd0
(gdb) print *vp
$33 = {ip = 0x7f3e9e5a6378 <subr_stub_code+56>, sp = 0x7f3e94cdee00,
  fp = 0x7f3e94cdee18, stack_limit = 0x7f3e94cde000, trace_level = 0,
  sp_min_since_gc = 0x7f3e94cdee00, stack_size = 512,
  stack_bottom = 0x7f3e94cde000, stack_top = 0x7f3e94cdf000,
  overflow_handler_stack = 0x304, hooks = {0x4, 0x4, 0x4, 0x4, 0x4},
  resumable_prompt_cookie = 0x7f3d397f9440, engine = 1}

Looks plausible, not corrupted. Why is engine=1 ??
#define SCM_VM_REGULAR_ENGINE 0
#define SCM_VM_DEBUG_ENGINE 1
#define SCM_VM_NUM_ENGINES 2
#  define SCM_I_CURRENT_THREAD (scm_i_current_thread)

(gdb) print scm_i_current_thread
$34 = (scm_i_thread *) 0x0

OK, that's not excellent.
(gdb) print thread_count
$35 = 161

(gdb) bt
#0  scm_i_vm_mark_stack (vp=0x755c1bd0, mark_stack_ptr=0x7f3e9b783f40,
    mark_stack_limit=0x7f3e9b793eb0) at ../../libguile/vm.c:1011
#1  0x00007f3e9db8835e in GC_mark_from (mark_stack_top=0x7f3e9b783ee0,
    mark_stack_top <at> entry=0x7f3e9b783f00,
    mark_stack=mark_stack <at> entry=0x7f3e9b783eb0,
    mark_stack_limit=mark_stack_limit <at> entry=0x7f3e9b793eb0) at ../mark.c:772
#2  0x00007f3e9db8897e in GC_do_local_mark
(local_mark_stack=0x7f3e9b783eb0,
    local_top=0x7f3e9b783f00) at ../mark.c:1037
#3  0x00007f3e9db88b98 in GC_mark_local (
    local_mark_stack=local_mark_stack <at> entry=0x7f3e9b783eb0, id=id <at> entry=4)
    at ../mark.c:1170
#4  0x00007f3e9db88eaa in GC_help_marker (my_mark_no=my_mark_no <at> entry=80003)
    at ../mark.c:1238
#5  0x00007f3e9db92e3c in GC_mark_thread (id=<optimized out>)
    at ../pthread_support.c:380
#6  0x00007f3e9e2ae6ba in start_thread (arg=0x7f3e9b794700)
    at pthread_create.c:333
#7  0x00007f3e9dfdd82d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

so bdwgc is version gc7_6_0

So ../mark.c:772 is GET_HDR and GET_HDR is in
include/private/gc_hdrs.h

define GET_HDR(p, hhdr)  ... ((hhdr) = HDR_INNER(p)
# define HDR_INNER(p) HDR_FROM_BI(BI(p),p)
#define HDR_FROM_BI(bi, p) \
                ((bi)->index[((word)(p) >> LOG_HBLKSIZE) & (BOTTOM_SZ - 1)])

This seems all wrong.. wtf ....

is it called by thread_mark ??
in libguile/thread.c:
  thread_gc_kind =
    GC_new_kind (GC_new_free_list (),
       GC_MAKE_PROC (GC_new_proc (thread_mark), 0),
       0, 1)
in bdwgc/misc.c  GC_new_proc
    GC_mark_procs[result] = proc
#define GC_mark_procs GC_arrays._mark_procs
#define PROC(descr) \
      (GC_mark_procs[((descr) >> GC_DS_TAG_BITS) & (GC_MAX_MARK_PROCS-1)])

      GC_obj_kinds[result].ok_descriptor = descr;  <<< this is where
MAKE_PROC goes.

anyway ... in mark.c line 733
        case GC_DS_PROC:
          mark_stack_top--;
          credit -= GC_PROC_BYTES;
          mark_stack_top = (*PROC(descr))((word *)current_p, mark_stack_top,
                                          mark_stack_limit, ENV(descr));
          continue;
But that's the wrong line number.  line 772 is in
        case GC_DS_PER_OBJECT:
          if ((signed_word)descr >= 0) {
            /* Descriptor is in the object.     */
            descr = *(word *)(current_p + descr - GC_DS_PER_OBJECT);

Oh, it goes to retry, which loops around and calls PROC which is thread_mark
which is called with args
          mark_stack_top = (*PROC(descr))((word *)current_p, mark_stack_top,
                                          mark_stack_limit, ENV(descr));

(gdb) print current_p
$39 = (ptr_t) 0x7225d380 "@\325%r"

so back in libguile/threads.c line 84:
  const struct scm_i_thread *t = (struct scm_i_thread *) addr;

(gdb) print *(struct scm_i_thread *) current_p
$40 = {next_thread = 0x7225d540, handle = 0x5763f3b0, pthread =
139900934399744,
  result = 0x4, exited = 0, guile_mode = 1, needs_unregister = 1, wake =
0x0,
  sleep_cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0,
      __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0,
      __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align =
0},
  sleep_pipe = {151, 197}, freelists = 0x71cfe800,
  pointerless_freelists = 0x71cfe780, dynamic_state = 0x71de7900, dynstack
= {
    base = 0x698c3800, top = 0x698c3940, limit = 0x698c3a00},
  pending_asyncs = 0x304, block_asyncs = 0, continuation_root = 0x5763f340,
  continuation_base = 0x7f3d397f9c90, vp = 0x755c1bd0, base =
0x7f3d397f9da0,
  regs = {{__jmpbuf = {139905204221696, 139906923931021, 1,
139894281647232,
        139906913276512, 0, 139906918327551, 8388608},
      __mask_was_saved = 964664752, __saved_mask = {__val = {0,
139900934399744,
          139900934399744, 139900934397344, 139906923929400,
139893936840720,
          1804170656, 139906913276512, 139900934399744, 139900934397344,
          139900934397504, 139893936840720, 139906910859134,
139906913276484,
          139906910857112, 27401856}}}}}

(gdb) print ((struct scm_i_thread *) current_p)->vp
$41 = (struct scm_vm *) 0x755c1bd0

which is exactly the vp we had before, so this is the correct thread.

So  ... all the values in *(struct scm_i_thread *) current_p look more or
less OK;
its not corrupted. And all the values in *vp look more or less OK, its not
corrrupted
except that engine=1 looks wrong.  So what's up with the weird stack
looping?
Again:
From before, we had:

(gdb) print *vp
$42 = {ip = 0x7f3e9e5a6378 <subr_stub_code+56>, sp = 0x7f3e94cdee00,
  fp = 0x7f3e94cdee18, stack_limit = 0x7f3e94cde000, trace_level = 0,
  sp_min_since_gc = 0x7f3e94cdee00, stack_size = 512,
  stack_bottom = 0x7f3e94cde000, stack_top = 0x7f3e94cdf000,
  overflow_handler_stack = 0x304, hooks = {0x4, 0x4, 0x4, 0x4, 0x4},
  resumable_prompt_cookie = 0x7f3d397f9440, engine = 1}

so fp, sp, stack_bottom, stack_top all are self-consistent with each other.
The problem is that
    fp = SCM_FRAME_DYNAMIC_LINK (fp)
goes nowhere, after the first iteration.

Is the frame corrupted? How? what does the rest of the stack look like?

(gdb) x/32x 0x7f3e94cdee00
0x7f3e94cdee00: 0x0000000000000904      0x000000000000200c
0x7f3e94cdee10: 0x0000000001a1b7e0      0x0000000001a1b5e0
0x7f3e94cdee20: 0x0000000000000004      0x000000000000200c  <<< theres the 4
0x7f3e94cdee30: 0x0000000000169bd6      0x0000000000000001 <<< there's the
new fp
0x7f3e94cdee40: 0x0000000000000000      0x000000000192acd0 <<< there's the
zero.
0x7f3e94cdee50: 0x0000000001fa4bc0      0x0000000001a1c6d0
0x7f3e94cdee60: 0x0000000044507950      0x0000000000000002
0x7f3e94cdee70: 0x000000000005a6f5      0x00000000018febd0
0x7f3e94cdee80: 0x000000006d490d00      0x00007f3e9e97241c
0x7f3e94cdee90: 0x0000000000000002      0x00007f3e9e97241c
0x7f3e94cdeea0: 0x0000000000000002      0x00007f3e8d797630
0x7f3e94cdeeb0: 0x0000000000000002      0x00007f3e86effacc
0x7f3e94cdeec0: 0x0000000000000003      0x000000006d490d00
0x7f3e94cdeed0: 0x00007f3e86eb8698      0x0000000000000002
0x7f3e94cdeee0: 0x00007f3e8e51f450      0x0000000000000008
0x7f3e94cdeef0: 0x0000000001a37c10      0x000000000192acd0


So... who sets up thread->vp ?
/* #define VM_ENABLE_ASSERTIONS */
However, defining this breaks the build.

diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
index 6c88ebf..f0f8164 100644
--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -2050,11 +2050,11 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp,

           if (!scm_module_system_booted_p)
             {
-              ASSERT (scm_is_true
+              ASSERT (scm_is_true(
                       scm_equal_p (modname,
                                    scm_list_2
                                    (SCM_BOOL_T,
-                                    scm_from_utf8_symbol ("guile"))));
+                                    scm_from_utf8_symbol ("guile")))));
               var = scm_lookup (sym);
             }
           else if (scm_is_true (SCM_CAR (modname)))
diff --git a/libguile/vm.c b/libguile/vm.c
index 18f2192..c26066e 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -63,7 +63,7 @@ static size_t page_size;

 /* The VM has a number of internal assertions that shouldn't normally be
    necessary, but might be if you think you found a bug in the VM. */
-/* #define VM_ENABLE_ASSERTIONS */
+#define VM_ENABLE_ASSERTIONS

 static void vm_expand_stack (struct scm_vm *vp,
                              union scm_vm_stack_element *new_sp)
SCM_NOINLINE;


 GEN      guile-procedures.texi
/bin/bash: line 1: 24078 Broken pipe

 ./meta/guile
Aborted
./libguile/guile
Aborted
gdb libguile/.libs/lt-guile

#0  0x00007ffff74f7428 in __GI_raise (sig=sig <at> entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff74f902a in __GI_abort () at abort.c:89
#2  0x00007ffff7b707c4 in vm_regular_engine (thread=0x7282, vp=0x806f30,
    registers=0x6, resume=-145787864) at ../../libguile/vm-engine.c:2053
#3  0x00007ffff7b711d2 in scm_call_n (proc=0x832b30, argv=argv <at> entry=0x0,
    nargs=nargs <at> entry=0) at ../../libguile/vm.c:1257
#4  0x00007ffff7af14e9 in scm_call_0 (proc=<optimized out>)
    at ../../libguile/eval.c:481
#5  0x00007ffff7b10568 in scm_primitive_load_path (args=<optimized out>)
    at ../../libguile/load.c:1248
#6  0x00007ffff7b6e976 in vm_regular_engine (thread=0x7282, vp=0x806f30,
    registers=0x6, resume=-145787864) at ../../libguile/vm-engine.c:784
[Message part 2 (text/html, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#27234; Package guile. (Sat, 09 Sep 2017 20:45:02 GMT) Full text and rfc822 format available.

Message #8 received at 27234 <at> debbugs.gnu.org (full text, mbox):

From: Linas Vepstas <linasvepstas <at> gmail.com>
To: 27234 <at> debbugs.gnu.org
Subject: Followup.
Date: Sat, 9 Sep 2017 15:44:11 -0500
[Message part 1 (text/plain, inline)]
The original bug report failed to explain how to trigger the bug.  I think
I know of a simple way to trigger the bug, but have not yet created the
simple test-case.  I believe that the bug can be triggered like so:

Create some medium-sized "random" (arbitrary) snippet of guile code. Send
it to the guile repl server.  Do it again and again, as fast as possible.
The bug will trigger.

By "arbitary/random" I mean some code that is simple but non-repeating:
say, concatenate some random strings, count the number of letters in them,
take the square root -- just do assorted, non-repeating computations that
force the REPL to evaluate some new blob of scheme code every time.

Now, once out of every 5 or 50 times, introduce a bug in the scheme code,
causing the REPL to throw an exception.  Using ut8 strings might also be a
required ingredient.

I believe that this will trigger the bug. I believe that the bug is some
unprotected section in the guile interpreter, and a race condition clobbers
the guile stack(s) and it all goes downhill from here.
[Message part 2 (text/html, inline)]

This bug report was last modified 7 years and 278 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.