GNU bug report logs -
#70474
Possible bug with `atomic-box-swap!` on OSX/M3 (?!?!)
Previous Next
To reply to this bug, email your comments to 70474 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Fri, 19 Apr 2024 12:11:04 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Tony Garnock-Jones <tonyg <at> leastfixedpoint.com>
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Fri, 19 Apr 2024 12:11:04 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello all,
I'm seeing some very strange behaviour from `atomic-box-swap!` (but not
`atomic-box-compare-and-swap!`) on Guile 3.0.9 from Homebrew on OSX
Sonoma using an M3 Pro cpu. The issue does not seem to manifest on
x86_64. Could it be some interaction between Guile and M3 CPUs?
Or am I just doing something very silly that shouldn't work at all and
just happens to look like it works on x86_64?
Here's the program that fails. It will run for a few hundred million
rounds and then yield "q null in get". Note that using CAS seems to
work, but plain old swap doesn't.
;;--
;; Eventually this fails with "q null in get" if `atomic-box-swap!` is
;; used where marked (*) below. It takes usually between hundreds of
;; millions and a few billion increments to fail.
;;
;; It does NOT fail if the line marked (*) is commented out and the line
;; below it mentioning `atomic-box-compare-and-swap!` is uncommented and
;; used instead.
;;
;; The failure happens on OSX Sonoma 14.4.1 on a MacBook Pro running an
;; M3 Pro CPU using Guile version 3.0.9 from Homebrew as of 2024-04-17.
;;
;; It does NOT happen on AMD x86_64 Debian linux with Guile 3.0.9 from
;; Debian packaging.
(use-modules (ice-9 atomic))
(define r (make-atomic-box '(0)))
(let loop ()
(let ((v (let ((q (atomic-box-ref r)))
(when (null? q) (error "q null in get"))
(unless (eq? (atomic-box-compare-and-swap! r q (cdr q)) q)
(error "CAS failed in get"))
(car q))))
(when (zero? (remainder v 10000000)) (write v) (newline))
(unless (null?
(atomic-box-swap! r (list (+ v 1))) ;; (*)
;; (atomic-box-compare-and-swap! r '() (list (+ v 1)))
)
(error "swap failed in put"))
(loop)))
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Fri, 19 Apr 2024 13:21:04 GMT)
Full text and
rfc822 format available.
Message #8 received at 70474 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Tony Garnock-Jones <tonyg <at> leastfixedpoint.com> writes:
> I'm seeing some very strange behaviour from `atomic-box-swap!` (but
> not `atomic-box-compare-and-swap!`) on Guile 3.0.9 from Homebrew on
> OSX Sonoma using an M3 Pro cpu. The issue does not seem to manifest on
> x86_64. Could it be some interaction between Guile and M3 CPUs?
>
> Or am I just doing something very silly that shouldn't work at all and
> just happens to look like it works on x86_64?
>
> Here's the program that fails. It will run for a few hundred million
> rounds and then yield "q null in get". Note that using CAS seems to
> work, but plain old swap doesn't.
There are known issue(s) with Guile JIT and atomics on ARM
(e.g. [1]). If the problem doesn't appear when disabling JIT, then
you're probably seeing the same issue.
1: https://github.com/wingo/fibers/issues/83
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Fri, 19 Apr 2024 13:37:03 GMT)
Full text and
rfc822 format available.
Message #11 received at 70474 <at> debbugs.gnu.org (full text, mbox):
A small bit of extra information: it's not just one machine; the problem
also manifests on an M1 running OSX 14.1.1. In addition, it happens with
newer Guile versions including `guile-next` from `aconchillo`'s Homebrew
tap and including a version I built from Guile git main just now.
Regards,
Tony
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Fri, 19 Apr 2024 20:48:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 70474 <at> debbugs.gnu.org (full text, mbox):
I'm not sure why, exactly, but I needed this to get builds to work on
OSX Sonoma at all.
---
libguile/scmsigs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libguile/scmsigs.c b/libguile/scmsigs.c
index 7fd3fd8f1..be96dbd5c 100644
--- a/libguile/scmsigs.c
+++ b/libguile/scmsigs.c
@@ -302,7 +302,7 @@ scm_i_signals_post_fork ()
}
#if SCM_USE_PTHREAD_THREADS
- once = SCM_I_PTHREAD_ONCE_INIT;
+ once = (scm_i_pthread_once_t) SCM_I_PTHREAD_ONCE_INIT;
#endif
if (active)
scm_i_ensure_signal_delivery_thread ();
--
2.44.0
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Fri, 19 Apr 2024 20:50:04 GMT)
Full text and
rfc822 format available.
Message #17 received at 70474 <at> debbugs.gnu.org (full text, mbox):
This appears to make the problem go away. I'm new at working with
`lightening` so I'm not confident I've covered everything that needs
covering, particularly wrt the implementation of cas_atomic. But perhaps
this can be a foundation for someone who knows more than I do to work from.
---
libguile/lightening/lightening/aarch64-cpu.c | 41 ++++++--------------
1 file changed, 11 insertions(+), 30 deletions(-)
diff --git a/libguile/lightening/lightening/aarch64-cpu.c
b/libguile/lightening/lightening/aarch64-cpu.c
index 13aa351e9..30766652f 100644
--- a/libguile/lightening/lightening/aarch64-cpu.c
+++ b/libguile/lightening/lightening/aarch64-cpu.c
@@ -223,8 +223,8 @@ oxxrs(jit_state_t *_jit, int32_t Op,
#define A64_UMULH 0x9bc07c00
#define A64_LDAR 0xc8dffc00
#define A64_STLR 0xc89ffc00
-#define A64_LDAXR 0xc85ffc00
-#define A64_STLXR 0xc800fc00
+#define A64_SWPAL 0xf8e08000
+#define A64_CASAL 0xc8e0fc00
#define A64_STRBI 0x39000000
#define A64_LDRBI 0x39400000
#define A64_LDRSBI 0x39800000
@@ -664,15 +664,15 @@ STLR(jit_state_t *_jit, int32_t Rt, int32_t Rn)
}
static void
-LDAXR(jit_state_t *_jit, int32_t Rt, int32_t Rn) +SWPAL(jit_state_t
*_jit, int32_t Rt, int32_t Rn, int32_t Rs)
{
- return o_xx(_jit, A64_LDAXR, Rt, Rn);
+ return oxxx(_jit, A64_SWPAL, Rt, Rn, Rs);
}
static void
-STLXR(jit_state_t *_jit, int32_t Rt, int32_t Rn, int32_t Rm)
+CASAL(jit_state_t *_jit, int32_t Rt, int32_t Rn, int32_t Rs)
{
- return oxxx(_jit, A64_STLXR, Rt, Rn, Rm);
+ return oxxx(_jit, A64_CASAL, Rt, Rn, Rs);
}
static void
@@ -2532,36 +2532,17 @@ str_atomic(jit_state_t *_jit, int32_t loc,
int32_t val)
static void
swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
{
- void *retry = jit_address(_jit);
- int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
- int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit)) :
val;
- movr(_jit, val_or_tmp, val);
- LDAXR(_jit, dst, loc);
- STLXR(_jit, val_or_tmp, loc, result);
- jit_patch_there(_jit, bnei(_jit, result, 0), retry);
- if (dst == val) unget_temp_gpr(_jit);
- unget_temp_gpr(_jit);
+ SWPAL(_jit, dst, loc, val);
}
static void
cas_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t expected,
int32_t desired)
{
- int32_t dst_or_tmp;
- if (dst == loc || dst == expected || dst == expected)
- dst_or_tmp = jit_gpr_regno(get_temp_gpr(_jit));
- else
- dst_or_tmp = dst;
- void *retry = jit_address(_jit);
- LDAXR(_jit, dst_or_tmp, loc);
- jit_reloc_t bad = bner(_jit, dst_or_tmp, expected);
- int result = jit_gpr_regno(get_temp_gpr(_jit));
- STLXR(_jit, desired, loc, result);
- jit_patch_there(_jit, bnei(_jit, result, 0), retry);
- unget_temp_gpr(_jit);
- jit_patch_here(_jit, bad);
- movr(_jit, dst, dst_or_tmp);
- unget_temp_gpr(_jit);
+ if (dst != expected) {
+ movr(_jit, dst, expected);
+ }
+ CASAL(_jit, desired, loc, dst);
}
static void
--
2.44.0
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Mon, 22 Apr 2024 07:54:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 70474 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
As an alternative to changing the JIT to produce SWPAL/CASAL, because I
wasn't sure if *all* aarch64 targets support these, I tried adding DMB
ISH or DMB SY to the end of the generated code sequences. Surprisingly,
this did not fix the issue! So there's perhaps something fishy about the
LDAXR-STLXR sequences themselves?
So for now I'll stick on my own machine with SWPAL/CASAL, since this
does seem to work well enough to let both my own code and fibers run.
Tony
[add-dmb-does-not-help.patch (text/plain, attachment)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Mon, 22 Apr 2024 08:20:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 70474 <at> debbugs.gnu.org (full text, mbox):
Oh man. This little patch all by itself makes the problem behaviour go
away. No switching to SWPAL/CASAL, just tightening the spinloop. (And no
changes at all to the CAS code, so nothing to do with the fibers bug I
guess.)
With the patch, the spinloop goes LDAXR-STLXR-CBNZ (which is what GCC
does when SWPAL isn't there) instead of potentially MOV-LDAXR-STLXR-CBNZ
(which isn't).
Could the machine really be so sensitive to the target of the CBNZ?
---
libguile/lightening/lightening/aarch64-cpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libguile/lightening/lightening/aarch64-cpu.c
b/libguile/lightening/lightening/aarch64-cpu.c
index 13aa351e9..4df712a0e 100644
--- a/libguile/lightening/lightening/aarch64-cpu.c
+++ b/libguile/lightening/lightening/aarch64-cpu.c
@@ -2532,10 +2532,10 @@ str_atomic(jit_state_t *_jit, int32_t loc,
int32_t val)
static void
swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
{
- void *retry = jit_address(_jit);
int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit))
: val;
movr(_jit, val_or_tmp, val);
+ void *retry = jit_address(_jit);
LDAXR(_jit, dst, loc);
STLXR(_jit, val_or_tmp, loc, result);
jit_patch_there(_jit, bnei(_jit, result, 0), retry);
--
2.44.0
Information forwarded
to
bug-guile <at> gnu.org
:
bug#70474
; Package
guile
.
(Mon, 22 Apr 2024 11:24:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 70474 <at> debbugs.gnu.org (full text, mbox):
Andy Wingo in IRC pointed out that the reason the patch appears to work
is that the `movr` isn't idempotent! By the time it comes round again,
`val` has already been overwritten by LDAXR in the case that `dst == val`.
On 22/04/2024 10:18, Tony Garnock-Jones wrote:
> Oh man. This little patch all by itself makes the problem behaviour go
> away. No switching to SWPAL/CASAL, just tightening the spinloop. (And no
> changes at all to the CAS code, so nothing to do with the fibers bug I
> guess.)
>
> With the patch, the spinloop goes LDAXR-STLXR-CBNZ (which is what GCC
> does when SWPAL isn't there) instead of potentially MOV-LDAXR-STLXR-CBNZ
> (which isn't).
>
> Could the machine really be so sensitive to the target of the CBNZ?
>
> ---
> libguile/lightening/lightening/aarch64-cpu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libguile/lightening/lightening/aarch64-cpu.c
> b/libguile/lightening/lightening/aarch64-cpu.c
> index 13aa351e9..4df712a0e 100644
> --- a/libguile/lightening/lightening/aarch64-cpu.c
> +++ b/libguile/lightening/lightening/aarch64-cpu.c
> @@ -2532,10 +2532,10 @@ str_atomic(jit_state_t *_jit, int32_t loc,
> int32_t val)
> static void
> swap_atomic(jit_state_t *_jit, int32_t dst, int32_t loc, int32_t val)
> {
> - void *retry = jit_address(_jit);
> int32_t result = jit_gpr_regno(get_temp_gpr(_jit));
> int32_t val_or_tmp = dst == val ? jit_gpr_regno(get_temp_gpr(_jit))
> : val;
> movr(_jit, val_or_tmp, val);
> + void *retry = jit_address(_jit);
> LDAXR(_jit, dst, loc);
> STLXR(_jit, val_or_tmp, loc, result);
> jit_patch_there(_jit, bnei(_jit, result, 0), retry);
This bug report was last modified 1 year and 57 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.