GNU bug report logs - #56487
xgselect race condition leading to abort when USE_GTK not defined

Previous Next

Package: emacs;

Reported by: Tom Gillespie <tgbugs <at> gmail.com>

Date: Sun, 10 Jul 2022 21:07:02 UTC

Severity: normal

Tags: moreinfo, patch

To reply to this bug, email your comments to 56487 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Sun, 10 Jul 2022 21:07:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tom Gillespie <tgbugs <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 10 Jul 2022 21:07:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Subject: xgselect race condition leading to abort when USE_GTK not defined
Date: Sun, 10 Jul 2022 14:05:58 -0700
[Message part 1 (text/plain, inline)]
There is a race condition in xgselect.c leading to an abort being
raised in glib code. I have attached a patch that fixes the issue.

It took an absurd amount of time to track this down and debug. The
patch and repro file contain more information.

This is not the easiest bug to test for because the race condition is
non-deterministic, I have attached the elisp file that I use to
explore and reproduce the issue.

Ensure that you have glib installed and that you are building --with-x
so that the code in xgselect is enabled. Run configure as ./configure
--with-x-toolkit=lucid (or anything except gtk), and then run make.

Assuming process-thread-bugs.el is in the top level of the emacs repo
define the following function

function loop-abort () { while true; do src/emacs -Q -batch -l \
./process-thread-bugs.el will abort 1 || return $?; done }

You can then run loop-abort; echo $? and when the patch is not applied
Emacs will eventually abort. You can also add a call to fputs in
thread.c at the location of the comment and you will almost
immediately see the abort behavior, even when building with gtk.
[0001-xgselect.c-avoid-race-condition-leading-to-abort.patch (application/x-patch, attachment)]
[process-thread-bugs.el (application/x-lisp, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 01:54:02 GMT) Full text and rfc822 format available.

Message #8 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Po Lu <luangruo <at> yahoo.com>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when
 USE_GTK not defined
Date: Mon, 11 Jul 2022 09:53:22 +0800
Tom Gillespie <tgbugs <at> gmail.com> writes:

> From e120751b9de79be521f490f8c2f99597868b3208 Mon Sep 17 00:00:00 2001
> From: Tom Gillespie <tgbugs <at> gmail.com>
> Date: Tue, 5 Jul 2022 13:43:25 -0700
> Subject: [PATCH] xgselect.c: avoid race condition leading to abort
>
> xgselect.c (xg_select): Remove ifdefs that were previously USE_GTK
> specific to be unconditional. This prevents a race condition caused by
> a call to release_select_lock from triggering when configuring Emacs
> with any toolkit other than gtk.
>
> xgselect.c (release_select_lock): Add a branch to check whether
> threads_holding_glib_lock has gone negative, and if so, restore to zero.
> In the case where there are multiple threads, threads_holding_glib_lock
> could become and stay negative, prevending g_main_context_release from
> ever being called, even when it should have been.
>
> As far as I can tell the way that the thread.c and xgselect.c code was
> written was with the intention of avoiding additional locks. This means
> that this code is exquisitely senstivie to slight changes in timing. A
> comment in thread.c has been added at one location where this happens.
>
> It is entirely possible that the removal of the ifdefs branching on
> USE_GTK resolves this issue by slightly changing the timings when
> using other/no toolkits so that the abort does not trigger. In all
> cases aborts can be induced by adding something like fputs in thread.c
> at the location of the new comment.
>
> For the record, the abort behavior is not present in Emacs 27, and was
> introduced by 9c62ffb08262c82b7e38e6eb5767f2087424aa47 (the bisect was
> quite a pain, so hopefully no one ever needs to do it again).

Thanks.  The commit message does not really comply with our style,
however.  The ChangeLog entries should look like this instead:


* src/xgselect.c (xg_select): Remove ifdefs that were previously
USE_GTK specific to be unconditional. This prevents a race
condition caused by a call to release_select_lock from
triggering when configuring Emacs with any toolkit other than
GTK.
(release_select_lock): Add a branch to check whether
threads_holding_glib_lock has gone negative, and if so, restore
to zero.  In the case where there are multiple threads,
threads_holding_glib_lock could become and stay negative,
prevending g_main_context_release from ever being called, even
when it should have been.


You can use C-c C-w inside a *vc-log* buffer to generate an
appropriately formatted commit message.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 02:51:02 GMT) Full text and rfc822 format available.

Message #11 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: 56487 <at> debbugs.gnu.org
Cc: Po Lu <luangruo <at> yahoo.com>
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Sun, 10 Jul 2022 19:50:08 -0700
[Message part 1 (text/plain, inline)]
Here is (I think) the patch with correctly formatted change log entries.
[0001-xgselect.c-avoid-race-condition-leading-to-abort.patch (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 03:14:01 GMT) Full text and rfc822 format available.

Message #14 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Po Lu <luangruo <at> yahoo.com>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when
 USE_GTK not defined
Date: Mon, 11 Jul 2022 11:13:04 +0800
Tom Gillespie <tgbugs <at> gmail.com> writes:

> -#ifdef USE_GTK
>    bool already_has_events;
> -#endif
>  
>    if (xg_select_suppress_count)
>      return pselect (fds_lim, rfds, wfds, efds, timeout, sigmask);
> @@ -126,8 +132,8 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
>    context = g_main_context_default ();
>    acquire_select_lock (context);
>  
> -#ifdef USE_GTK
>    already_has_events = g_main_context_pending (context);
> +#ifdef USE_GTK
>  #ifndef HAVE_PGTK
>    already_has_events = already_has_events && x_gtk_use_native_input;
>  #endif
> @@ -179,12 +185,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
>  	tmop = &tmo;
>      }
>  
> -#ifndef USE_GTK
> -  fds_lim = max_fds + 1;
> -  nfds = thread_select (pselect, fds_lim,
> -			&all_rfds, have_wfds ? &all_wfds : NULL, efds,
> -			tmop, sigmask);
> -#else
>    /* On PGTK, when you type a key, the key press event are received,
>       and one more key press event seems to be received internally.
>  
> @@ -217,7 +217,6 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
>  	FD_ZERO (efds);
>        our_fds++;
>      }
> -#endif
>  
>    if (nfds < 0)
>      retval = nfds;
> @@ -248,11 +247,7 @@ xg_select (int fds_lim, fd_set *rfds, fd_set *wfds, fd_set *efds,
>  
>    /* If Gtk+ is in use eventually gtk_main_iteration will be called,
>       unless retval is zero.  */
> -#ifdef USE_GTK
>    need_to_dispatch = retval == 0;
> -#else
> -  need_to_dispatch = true;
> -#endif
>  
>    /* xwidgets make heavy use of GLib subprocesses, which add their own
>       SIGCHLD handler at arbitrary locations.  That doesn't play well

Thanks.  Why did the code previously under !USE_GTK have to be removed?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 03:42:02 GMT) Full text and rfc822 format available.

Message #17 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: Po Lu <luangruo <at> yahoo.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Sun, 10 Jul 2022 20:40:43 -0700
> Thanks.  Why did the code previously under !USE_GTK have to be removed?

When the !USE_GTK code is used an abort in glib will happen
stochastically due to an out-of-sync call to release_select_lock
in thread.c. This happens on my system somewhere between
approximately 1 in 10 and 1 in 10000 times that the test file
is run.

As far as I can tell from testing there is no difference in behavior
between the USE_GTK and !USE_GTK code. Also, as far as
I can tell from reading, the behavior should be almost identical.
The only addition is to check for already_has_events before
calling thread_select, which may be enough to shift the timing
to prevent a race.

I have not been able to figure out what the actual underlying
cause is (I tried). All I can say for sure is that there is
something that calls into g_main_context_release and
context->owner_count has a negative overflow to 4294967295.

I do not think that it is because something somehow sneaks
in between the calls to the atomics in acquire_select_lock
and relese_select_lock. If you would like I can send along
a couple of patches that include changes I made to try to
see what is going on.

The real underlying issue would seem to be that there is a
missing lock somewhere and that the use of atomics is not
sufficient, but I could be wrong about that.




Added tag(s) patch. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 11 Jul 2022 10:04:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 10:17:01 GMT) Full text and rfc822 format available.

Message #22 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Po Lu <luangruo <at> yahoo.com>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when
 USE_GTK not defined
Date: Mon, 11 Jul 2022 18:16:36 +0800
Tom Gillespie <tgbugs <at> gmail.com> writes:

>> Thanks.  Why did the code previously under !USE_GTK have to be removed?
>
> When the !USE_GTK code is used an abort in glib will happen
> stochastically due to an out-of-sync call to release_select_lock
> in thread.c. This happens on my system somewhere between
> approximately 1 in 10 and 1 in 10000 times that the test file
> is run.
>
> As far as I can tell from testing there is no difference in behavior
> between the USE_GTK and !USE_GTK code. Also, as far as
> I can tell from reading, the behavior should be almost identical.
> The only addition is to check for already_has_events before
> calling thread_select, which may be enough to shift the timing
> to prevent a race.
>
> I have not been able to figure out what the actual underlying
> cause is (I tried). All I can say for sure is that there is
> something that calls into g_main_context_release and
> context->owner_count has a negative overflow to 4294967295.
>
> I do not think that it is because something somehow sneaks
> in between the calls to the atomics in acquire_select_lock
> and relese_select_lock. If you would like I can send along
> a couple of patches that include changes I made to try to
> see what is going on.
>
> The real underlying issue would seem to be that there is a
> missing lock somewhere and that the use of atomics is not
> sufficient, but I could be wrong about that.

So I suggest that someone find that problem instead.  Any attempt to
"fix" a race condition by moving code around so that the timings are
slightly different is simply deluding oneself.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Mon, 11 Jul 2022 16:10:02 GMT) Full text and rfc822 format available.

Message #25 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: Po Lu <luangruo <at> yahoo.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Mon, 11 Jul 2022 09:09:36 -0700
> So I suggest that someone find that problem instead.  Any attempt to
> "fix" a race condition by moving code around so that the timings are
> slightly different is simply deluding oneself.

I agree. However that means that the current GTK implementation is
itself fundamentally broken/inadequate. Bringing the other toolkits up
to its level of brokenness seems reasonable in this case?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Tue, 12 Jul 2022 02:05:02 GMT) Full text and rfc822 format available.

Message #28 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Po Lu <luangruo <at> yahoo.com>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when
 USE_GTK not defined
Date: Tue, 12 Jul 2022 10:03:43 +0800
Tom Gillespie <tgbugs <at> gmail.com> writes:

> I agree. However that means that the current GTK implementation is
> itself fundamentally broken/inadequate. Bringing the other toolkits up
> to its level of brokenness seems reasonable in this case?

No, because we don't want the different GLib event dispatch behavior
outside of GTK with native input enabled.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Tue, 12 Jul 2022 02:21:02 GMT) Full text and rfc822 format available.

Message #31 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: Po Lu <luangruo <at> yahoo.com>
Cc: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Mon, 11 Jul 2022 19:20:02 -0700
> No, because we don't want the different GLib event dispatch behavior
> outside of GTK with native input enabled.

Ok, got it. I'll see if I can figure out an alternate solution with that
constraint in mind.

Should I split out the release_select_lock change into a separate
patch so that we can get that merged independent of a fix for the
abort behavior?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Tue, 12 Jul 2022 12:45:02 GMT) Full text and rfc822 format available.

Message #34 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: luangruo <at> yahoo.com, 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Tue, 12 Jul 2022 15:44:01 +0300
> Cc: 56487 <at> debbugs.gnu.org
> From: Tom Gillespie <tgbugs <at> gmail.com>
> Date: Mon, 11 Jul 2022 19:20:02 -0700
> 
> > No, because we don't want the different GLib event dispatch behavior
> > outside of GTK with native input enabled.
> 
> Ok, got it. I'll see if I can figure out an alternate solution with that
> constraint in mind.

TIA.

> Should I split out the release_select_lock change into a separate
> patch so that we can get that merged independent of a fix for the
> abort behavior?

Yes, I think that's a good idea.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Fri, 15 Jul 2022 05:11:01 GMT) Full text and rfc822 format available.

Message #37 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Thu, 14 Jul 2022 22:09:59 -0700
[Message part 1 (text/plain, inline)]
For the record, while I was hunting a better repro for
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=56488
I had the following abort happen when built with gtk3,
so it seems that there are other conditions that can
trigger the abort in the gtk3 code path as well. The
sequence of events is attached.
[for-the-record.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Thu, 07 Sep 2023 18:31:01 GMT) Full text and rfc822 format available.

Message #40 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: luangruo <at> yahoo.com, Tom Gillespie <tgbugs <at> gmail.com>, 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Thu, 7 Sep 2023 11:30:25 -0700
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Cc: 56487 <at> debbugs.gnu.org
>> From: Tom Gillespie <tgbugs <at> gmail.com>
>> Date: Mon, 11 Jul 2022 19:20:02 -0700
>>
>> > No, because we don't want the different GLib event dispatch behavior
>> > outside of GTK with native input enabled.
>>
>> Ok, got it. I'll see if I can figure out an alternate solution with that
>> constraint in mind.
>
> TIA.
>
>> Should I split out the release_select_lock change into a separate
>> patch so that we can get that merged independent of a fix for the
>> abort behavior?
>
> Yes, I think that's a good idea.

(That was one year ago.)

Tom, did you get any further here?  For example, the second part (a
separate patch for release_select_lock) sounds useful independently of
the rest.

Thanks in advance.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Sun, 23 Feb 2025 01:36:01 GMT) Full text and rfc822 format available.

Message #43 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: luangruo <at> yahoo.com, Tom Gillespie <tgbugs <at> gmail.com>, 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Sun, 23 Feb 2025 01:35:45 +0000
Stefan Kangas <stefankangas <at> gmail.com> writes:

> Eli Zaretskii <eliz <at> gnu.org> writes:
>
>>> Cc: 56487 <at> debbugs.gnu.org
>>> From: Tom Gillespie <tgbugs <at> gmail.com>
>>> Date: Mon, 11 Jul 2022 19:20:02 -0700
>>>
>>> > No, because we don't want the different GLib event dispatch behavior
>>> > outside of GTK with native input enabled.
>>>
>>> Ok, got it. I'll see if I can figure out an alternate solution with that
>>> constraint in mind.
>>
>> TIA.
>>
>>> Should I split out the release_select_lock change into a separate
>>> patch so that we can get that merged independent of a fix for the
>>> abort behavior?
>>
>> Yes, I think that's a good idea.
>
> (That was one year ago.)
>
> Tom, did you get any further here?  For example, the second part (a
> separate patch for release_select_lock) sounds useful independently of
> the rest.
>
> Thanks in advance.

Ping.  Any news here?




Added tag(s) moreinfo. Request was from Stefan Kangas <stefankangas <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 23 Feb 2025 01:37:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Sun, 23 Feb 2025 08:10:01 GMT) Full text and rfc822 format available.

Message #48 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Tom Gillespie <tgbugs <at> gmail.com>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: luangruo <at> yahoo.com, Eli Zaretskii <eliz <at> gnu.org>, 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Sun, 23 Feb 2025 00:09:01 -0800
> Ping.  Any news here?

I can confirm that the issue is still present in 30.0.93
and that the original patch still works against 30.0.93.

Unfortunately I have not had a chance to dig deeper
or to split out a smaller patch for release_select_lock.

Looking at the original patch again I assume the
portion that would be useful is the part dealing with
__atomic_sub_fetch and threads_holding_glib_lock?

I will have to refresh my memory about the expected
behavior for the release_select_lock changes because
I don't think I had a test case that was able to differentiate
that change independent of the various GTK related changes.

Best,
Tom




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#56487; Package emacs. (Tue, 25 Feb 2025 23:49:02 GMT) Full text and rfc822 format available.

Message #51 received at 56487 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Tom Gillespie <tgbugs <at> gmail.com>
Cc: luangruo <at> yahoo.com, Eli Zaretskii <eliz <at> gnu.org>, 56487 <at> debbugs.gnu.org
Subject: Re: bug#56487: xgselect race condition leading to abort when USE_GTK
 not defined
Date: Tue, 25 Feb 2025 23:47:41 +0000
Tom Gillespie <tgbugs <at> gmail.com> writes:

>> Ping.  Any news here?
>
> I can confirm that the issue is still present in 30.0.93
> and that the original patch still works against 30.0.93.
>
> Unfortunately I have not had a chance to dig deeper
> or to split out a smaller patch for release_select_lock.
>
> Looking at the original patch again I assume the
> portion that would be useful is the part dealing with
> __atomic_sub_fetch and threads_holding_glib_lock?

AFAIU, yes.  IOW, the changes in release_select_lock.

> I will have to refresh my memory about the expected
> behavior for the release_select_lock changes because
> I don't think I had a test case that was able to differentiate
> that change independent of the various GTK related changes.

OK, thanks.  Please let us know when you make any progress.




This bug report was last modified 115 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.