GNU bug report logs - #70474
Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)

Package: guile;

Reported by: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>

Date: Fri, 19 Apr 2024 12:11:04 UTC

Severity: normal

To reply to this bug, email your comments to 70474 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Fri, 19 Apr 2024 12:11:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Fri, 19 Apr 2024 12:11:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: bug-guile <at> gnu.org
Subject: Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)
Date: Fri, 19 Apr 2024 12:54:32 +0200

Hello all,

I'm seeing some very strange behaviour from `atomic-box-swap!` (but not 
`atomic-box-compare-and-swap!`) on Guile 3.0.9 from Homebrew on OSX 
Sonoma using an M3 Pro cpu. The issue does not seem to manifest on 
x86_64. Could it be some interaction between Guile and M3 CPUs?

Or am I just doing something very silly that shouldn't work at all and 
just happens to look like it works on x86_64?

Here's the program that fails. It will run for a few hundred million 
rounds and then yield "q null in get". Note that using CAS seems to 
work, but plain old swap doesn't.

;;--

;; Eventually this fails with "q null in get" if `atomic-box-swap!` is
;; used where marked (*) below. It takes usually between hundreds of
;; millions and a few billion increments to fail.
;;
;; It does NOT fail if the line marked (*) is commented out and the line
;; below it mentioning `atomic-box-compare-and-swap!` is uncommented and
;; used instead.
;;
;; The failure happens on OSX Sonoma 14.4.1 on a MacBook Pro running an
;; M3 Pro CPU using Guile version 3.0.9 from Homebrew as of 2024-04-17.
;;
;; It does NOT happen on AMD x86_64 Debian linux with Guile 3.0.9 from
;; Debian packaging.

(use-modules (ice-9 atomic))

(define r (make-atomic-box '(0)))

(let loop ()
  (let ((v (let ((q (atomic-box-ref r)))
             (when (null? q) (error "q null in get"))
             (unless (eq? (atomic-box-compare-and-swap! r q (cdr q)) q)
               (error "CAS failed in get"))
             (car q))))

    (when (zero? (remainder v 10000000)) (write v) (newline))

    (unless (null?
             (atomic-box-swap! r (list (+ v 1))) ;; (*)
             ;; (atomic-box-compare-and-swap! r '() (list (+ v 1)))
             )
      (error "swap failed in put"))

    (loop)))

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Fri, 19 Apr 2024 13:21:04 GMT) Full text and rfc822 format available.

Message #8 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Christopher Baines <mail <at> cbaines.net>
To: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
Cc: 70474 <at> debbugs.gnu.org
Subject: Re: bug#70474: Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)
Date: Fri, 19 Apr 2024 14:19:52 +0100

[Message part 1 (text/plain, inline)]

Tony Garnock-Jones <tonyg <at> leastfixedpoint.com> writes:

> I'm seeing some very strange behaviour from `atomic-box-swap!` (but
> not `atomic-box-compare-and-swap!`) on Guile 3.0.9 from Homebrew on
> OSX Sonoma using an M3 Pro cpu. The issue does not seem to manifest on
> x86_64. Could it be some interaction between Guile and M3 CPUs?
>
> Or am I just doing something very silly that shouldn't work at all and
> just happens to look like it works on x86_64?
>
> Here's the program that fails. It will run for a few hundred million
> rounds and then yield "q null in get". Note that using CAS seems to
> work, but plain old swap doesn't.

There are known issue(s) with Guile JIT and atomics on ARM
(e.g. [1]). If the problem doesn't appear when disabling JIT, then
you're probably seeing the same issue.

1: https://github.com/wingo/fibers/issues/83

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Fri, 19 Apr 2024 13:37:03 GMT) Full text and rfc822 format available.

Message #11 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: Also manifests on an M1 running 14.1.1 and with newer Guile versions
Date: Fri, 19 Apr 2024 14:14:31 +0200

A small bit of extra information: it's not just one machine; the problem 
also manifests on an M1 running OSX 14.1.1. In addition, it happens with 
newer Guile versions including `guile-next` from `aconchillo`'s Homebrew 
tap and including a version I built from Guile git main just now.

Regards,
  Tony

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Fri, 19 Apr 2024 20:48:02 GMT) Full text and rfc822 format available.

Message #14 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: [PATCH 1/2] Including the cast makes Apple clang 15.0.0 happy;
 without it, clang is sad
Date: Fri, 19 Apr 2024 22:46:48 +0200

I'm not sure why, exactly, but I needed this to get builds to work on 
OSX Sonoma at all.

---
 libguile/scmsigs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libguile/scmsigs.c b/libguile/scmsigs.c
index 7fd3fd8f1..be96dbd5c 100644
--- a/libguile/scmsigs.c
+++ b/libguile/scmsigs.c
@@ -302,7 +302,7 @@ scm_i_signals_post_fork ()
     }
  #if SCM_USE_PTHREAD_THREADS
-  once = SCM_I_PTHREAD_ONCE_INIT;
+  once = (scm_i_pthread_once_t) SCM_I_PTHREAD_ONCE_INIT;
 #endif
   if (active)
     scm_i_ensure_signal_delivery_thread ();
-- 
2.44.0

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Fri, 19 Apr 2024 20:50:04 GMT) Full text and rfc822 format available.

Message #17 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: [PATCH 2/2] Replace aarch64 CAS and atomic-swap generated JIT code
 with CASAL and SWPAL instructions
Date: Fri, 19 Apr 2024 22:48:53 +0200

This appears to make the problem go away. I'm new at working with 
`lightening` so I'm not confident I've covered everything that needs 
covering, particularly wrt the implementation of cas_atomic. But perhaps 
this can be a foundation for someone who knows more than I do to work from.

---
 libguile/lightening/lightening/aarch64-cpu.c | 41 ++++++--------------
 1 file changed, 11 insertions(+), 30 deletions(-)

diff --git a/libguile/lightening/lightening/aarch64-cpu.c 
b/libguile/lightening/lightening/aarch64-cpu.c
index 13aa351e9..30766652f 100644
--- a/libguile/lightening/lightening/aarch64-cpu.c
+++ b/libguile/lightening/lightening/aarch64-cpu.c
@@ -223,8 +223,8 @@ oxxrs(jit_state_t *_jit, int32_t Op,
 #define A64_UMULH                     0x9bc07c00
 #define A64_LDAR                      0xc8dffc00
 #define A64_STLR                      0xc89ffc00
-#define A64_LDAXR                     0xc85ffc00
-#define A64_STLXR                     0xc800fc00
+#define A64_SWPAL                     0xf8e08000
+#define A64_CASAL                     0xc8e0fc00
 #define A64_STRBI                     0x39000000
 #define A64_LDRBI                     0x39400000
 #define A64_LDRSBI                    0x39800000
@@ -664,15 +664,15 @@ STLR(jit_state_t *_jit, int32_t Rt, int32_t Rn)
 }
  static void
-LDAXR(jit_state_t *_jit, int32_t Rt, int32_t Rn) +SWPAL(jit_state_t 
*_jit, int32_t Rt, int32_t Rn, int32_t Rs)
 {
-  return o_xx(_jit, A64_LDAXR, Rt, Rn);
+  return oxxx(_jit, A64_SWPAL, Rt, Rn, Rs);
 }
  static void
-STLXR(jit_state_t *_jit, int32_t Rt, int32_t Rn, int32_t Rm)
+CASAL(jit_state_t *_jit, int32_t Rt, int32_t Rn, int32_t Rs)
 {
-  return oxxx(_jit, A64_STLXR, Rt, Rn, Rm);
+  return oxxx(_jit, A64_CASAL, Rt, Rn, Rs);
 }
  static void
@@ -2532,36 +2532,17 @@ str_atomic(jit_state_t *_jit, int32_t loc, 
int32_t val)
 static void
 swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
 {
-  void *retry = jit_address(_jit);
-  int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
-  int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit)) : 
val;
-  movr(_jit, val_or_tmp, val);
-  LDAXR(_jit, dst, loc);
-  STLXR(_jit, val_or_tmp, loc, result);
-  jit_patch_there(_jit, bnei(_jit, result, 0), retry);
-  if (dst == val) unget_temp_gpr(_jit);
-  unget_temp_gpr(_jit);
+  SWPAL(_jit, dst, loc, val);
 }
  static void
 cas_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t expected,
            int32_t desired)
 {
-  int32_t dst_or_tmp;
-  if (dst == loc || dst == expected || dst == expected)
-    dst_or_tmp = jit_gpr_regno(get_temp_gpr(_jit));
-  else
-    dst_or_tmp = dst;
-  void *retry = jit_address(_jit);
-  LDAXR(_jit, dst_or_tmp, loc);
-  jit_reloc_t bad = bner(_jit, dst_or_tmp, expected);
-  int result = jit_gpr_regno(get_temp_gpr(_jit));
-  STLXR(_jit, desired, loc, result);
-  jit_patch_there(_jit, bnei(_jit, result, 0), retry);
-  unget_temp_gpr(_jit);
-  jit_patch_here(_jit, bad);
-  movr(_jit, dst, dst_or_tmp);
-  unget_temp_gpr(_jit);
+  if (dst != expected) {
+    movr(_jit, dst, expected);
+  }
+  CASAL(_jit, desired, loc, dst);
 }
  static void
-- 
2.44.0

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Mon, 22 Apr 2024 07:54:01 GMT) Full text and rfc822 format available.

Message #20 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: Just adding DMB doesn't help
Date: Mon, 22 Apr 2024 09:52:47 +0200

[Message part 1 (text/plain, inline)]

As an alternative to changing the JIT to produce SWPAL/CASAL, because I 
wasn't sure if *all* aarch64 targets support these, I tried adding DMB 
ISH or DMB SY to the end of the generated code sequences. Surprisingly, 
this did not fix the issue! So there's perhaps something fishy about the 
LDAXR-STLXR sequences themselves?

So for now I'll stick on my own machine with SWPAL/CASAL, since this 
does seem to work well enough to let both my own code and fibers run.

Tony

[add-dmb-does-not-help.patch (text/plain, attachment)]

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Mon, 22 Apr 2024 08:20:02 GMT) Full text and rfc822 format available.

Message #23 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: [PATCH] Move the spin loop target to the LDAXR instruction.
Date: Mon, 22 Apr 2024 10:18:46 +0200

Oh man. This little patch all by itself makes the problem behaviour go 
away. No switching to SWPAL/CASAL, just tightening the spinloop. (And no 
changes at all to the CAS code, so nothing to do with the fibers bug I 
guess.)

With the patch, the spinloop goes LDAXR-STLXR-CBNZ (which is what GCC 
does when SWPAL isn't there) instead of potentially MOV-LDAXR-STLXR-CBNZ 
(which isn't).

Could the machine really be so sensitive to the target of the CBNZ?

---
 libguile/lightening/lightening/aarch64-cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libguile/lightening/lightening/aarch64-cpu.c 
b/libguile/lightening/lightening/aarch64-cpu.c
index 13aa351e9..4df712a0e 100644
--- a/libguile/lightening/lightening/aarch64-cpu.c
+++ b/libguile/lightening/lightening/aarch64-cpu.c
@@ -2532,10 +2532,10 @@ str_atomic(jit_state_t *_jit, int32_t loc, 
int32_t val)
 static void
 swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
 {
-  void *retry = jit_address(_jit);
   int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
   int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit)) 
: val;
   movr(_jit, val_or_tmp, val);
+  void *retry = jit_address(_jit);
   LDAXR(_jit, dst, loc);
   STLXR(_jit, val_or_tmp, loc, result);
   jit_patch_there(_jit, bnei(_jit, result, 0), retry);
-- 
2.44.0

Information forwarded to bug-guile <at> gnu.org:
bug#70474; Package guile. (Mon, 22 Apr 2024 11:24:02 GMT) Full text and rfc822 format available.

Message #26 received at 70474 <at> debbugs.gnu.org (full text, mbox):

From: Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
To: 70474 <at> debbugs.gnu.org
Subject: Re: bug#70474: [PATCH] Move the spin loop target to the LDAXR
 instruction.
Date: Mon, 22 Apr 2024 13:23:13 +0200

Andy Wingo in IRC pointed out that the reason the patch appears to work 
is that the `movr` isn't idempotent! By the time it comes round again, 
`val` has already been overwritten by LDAXR in the case that `dst == val`.

On 22/04/2024 10:18, Tony Garnock-Jones wrote:
> Oh man. This little patch all by itself makes the problem behaviour go 
> away. No switching to SWPAL/CASAL, just tightening the spinloop. (And no 
> changes at all to the CAS code, so nothing to do with the fibers bug I 
> guess.)
> 
> With the patch, the spinloop goes LDAXR-STLXR-CBNZ (which is what GCC 
> does when SWPAL isn't there) instead of potentially MOV-LDAXR-STLXR-CBNZ 
> (which isn't).
> 
> Could the machine really be so sensitive to the target of the CBNZ?
> 
> ---
>   libguile/lightening/lightening/aarch64-cpu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/libguile/lightening/lightening/aarch64-cpu.c 
> b/libguile/lightening/lightening/aarch64-cpu.c
> index 13aa351e9..4df712a0e 100644
> --- a/libguile/lightening/lightening/aarch64-cpu.c
> +++ b/libguile/lightening/lightening/aarch64-cpu.c
> @@ -2532,10 +2532,10 @@ str_atomic(jit_state_t *_jit, int32_t loc, 
> int32_t val)
>   static void
>   swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
>   {
> -  void *retry = jit_address(_jit);
>     int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
>     int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit)) 
> : val;
>     movr(_jit, val_or_tmp, val);
> +  void *retry = jit_address(_jit);
>     LDAXR(_jit, dst, loc);
>     STLXR(_jit, val_or_tmp, loc, result);
>     jit_patch_there(_jit, bnei(_jit, result, 0), retry);

This bug report was last modified 1 year and 57 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #70474 Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)

GNU bug report logs - #70474
Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)