GNU bug report logs -
#22152
fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11)
Previous Next
Reported by: Iwan Aucamp <aucampia <at> gmail.com>
Date: Sat, 12 Dec 2015 19:04:03 UTC
Severity: normal
Done: Mark H Weaver <mhw <at> netris.org>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#22152: fat_mutex owner corruption (fmoc) inside fat_mutex_unlock (guile-v2.0.11)
which was filed against the guile package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 22152 <at> debbugs.gnu.org.
--
22152: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=22152
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
Mark H Weaver <mhw <at> netris.org> writes:
> Iwan Aucamp <aucampia <at> gmail.com> writes:
>> We sporadically get "mutex not locked" and "mutex not locked by current thread"
>> exceptions on Solaris 10u10 with guile-2.0.11.
>
> Thanks very much for your detailed analysis and proposed fix.
>
> I've attached a patch that hopefully fixes this bug and also refactors
> the code to hopefully be somewhat more clear. Can you please test it on
> Solaris and verify that it works for your use cases?
I went ahead and pushed commit 1e86dc32a42af549fc9e4721ad48cdd7d296c042
to stable-2.0, which will soon become guile-2.0.12. I hope it fixes the
issue, although unfortunately my patch was never tested on Solaris. I'm
going to close this bug, but feel free to reopen it if there are still
issues.
Thanks,
Mark
[Message part 3 (message/rfc822, inline)]
[Message part 4 (text/plain, inline)]
Hi
We sporadically get "mutex not locked" and "mutex not locked by current thread"
exceptions on Solaris 10u10 with guile-2.0.11.
This problem can be reproduced with following scheme scripts:
guile-fmoc-test-mnl.scm (for "mutex not locked")
Two threads, one of them (reader-000) waits on a condition variable that
nothing will trigger and the other thread (writer-000) locks and unlocks
the mutex used with the condition variable. This code causes "mutex not
locked" exception with some consistency.
Output for this is in guile-fmoc-test-mnl-problem_output.txt (referenced
as mnl-problem_output.txt)
* owner id for reader-000 is 12593872
* owner id for writer-000 is 12595040
guile-fmoc-test-mnlbct.scm (for "mutex not locked by current thread")
Same as guile-fmoc-test-mnl.scm except here writer-000 signals condition
variable. This code causes "mutex not locked" and "mutex not locked by
current thread" errors.
Output for this, showing "mutex not locked by current thread", is in
guile-fmoc-test-mnlbct-problem_output.txt (referenced as mnlbct-
problem_output.txt)
* owner id for reader-000 is 14535648
* owner id for writer-000 is 14536400
To track down this issue we have added some debug printfs (see guile-2.0.11-
with-debug.patch). Given that this changes the line numbers I have referenced
original line numbers as o:file:line and line numbers with patch as d[:file]:
line. Also, these printfs have resulted in some irrelevant output (for internal
and verbose logging mutexes) which has been filtered out.
There is various scenarios leading to these errors that we have found but all
caused by same problem. The a detailed analysis for "mutex not locked by
current thread" scenario that can be seen in mnlbct-problem_output.txt is
included below and detailed analysis for other scenarios will be shared if
required.
Scenario from mnlbct-problem_output.txt:
1. [writer-000:14536400] unlocks fat_mutex[14512880] and queues reader-000:
14535648
at o:threads.c:1664 - d:1681
before mnlbct-problem_output.txt:4079
2. [reader-000:14535648] locks fat_mutex[14512880] with fat_mutex_lock
at o:threads.c:1394 - d:1401
before mnlbct-problem_output.txt:4080
3. [reader-000:14535648] enters wait-condition-variable and changes fat_mutex
[14512880].owner to writer-000:14536400
at o:threads.c:1616 - d:1631
before mnlbct-problem_output.txt:4083
4. [reader-000:14535648] goes into block_self and starts waiting on ptheead
condition variable inside fat_mutex_unlock > block_self. This unlocks
fat_mutex[14512880].mutex, allowing some other thread to lock fat_mutex
[14512880]
at o:threads.c:452 - d:456
before mnlbct-problem_output.txt:4083
5. [writer-000:14536400] locks fat_mutex[14512880] with fat_mutex_lock
at o:threads.c:1394 - d:1401
before mnlbct-problem_output.txt:4082
6. [reader-000:14535648] spurious wake-up occurs for condition variable which
causes block_self to return EINTR to fat_mutex_unlock
at o:threads.c:1621 - d:1636
before mnlbct-problem_output.txt:4084
7. [reader-000:14535648] loops and sets fat_mutex[14512880].owner=4 (i.e. not
locked) while writer-000:14536400 should still be owner of fat_mutex
[14512880]. Since it was spurious wake-up reader-000:14535648 continues to
wait for condition to be notified again.
at o:threads.c:1616 - d:1631
before mnlbct-problem_output.txt:4086
8. [writer-000:14536400] completes signal-condition-variable
before mnlbct-problem_output.txt:4088
9. [reader-000:14535648] now gets actual notification and block_self returns
0. This causes fat_mutex:14512880 to be locked again - which works cos
fat_mutex:14512880.owner is 4. This changes fat_mutex:14512880.owner from
4 to 14535648.
at o:threads.c:1643 - d:1660
before mnlbct-problem_output.txt:4086
10. [writer-000:14536400] tries to unlock the mutex, this fails though as
reader-000:14535648 now owns the mutex - resulting in "mutex not locked by
current thread" exception.
at o:threads.c:1599 - d:1614
before mnlbct-problem_output.txt:4089
Briefly, for mnl-problem_output.txt:
1. reader-000 locks fat_mutex and unlocks it again as it starts waiting for
condition to be notified.
2. writer-000 locks mutex
3. Spurious wake-up occurs for reader-000 which causes reader to change
fat_mutex.owner from writer-000 id to 4 and then resumes waiting on
condition variable
4. Writer tries to unlock fat_mutex but now owner is 4 and this results in
"mutex not locked" exception
The cause of these problems seems to be related to fat_mutex_unlock changing
fat_mutex.owner inside the while loop that is intended for checking condition
variable predicate which is problematic if spurious wake-ups from
pthread_cond_wait occur. Spurious wake-ups from pthread_cond_wait seems less
common on Linux, which is why we have only been observing the issue on Solaris.
It does however look like this problem will occur on any platform when a
spurious wake-up does occur.
As far as we can tell there is no reason for the fat_mutex.owner assignment to
happen inside the loop. It seems more appropriate that this happens only once
before the loop and not again. To this extent we moved owner reassignment out
of the loop and this seems to have resolved our issues. The patch for this is
in guile-2.0.11-with-fmoc_fix.patch.
We have ran the test suite with this on Linux and everything passes. There is
however other issues with test suite on Solaris that prevents it from
completing (both with and without the patch) which needs further investigation.
* All files related to this can be found at
https://gitlab.com/concurrent-systems/osp-issues-1512/tree/master/guile-fmoc
* Source with guile-2.0.11-with-debug.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-debug
* Source with guile-2.0.11-with-fmoc_fix.patch can be found at
https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-fmoc_fix
Regards
--
Iwan Aucamp
[guile-2.0.11-with-debug.patch (text/x-patch, attachment)]
[guile-2.0.11-with-fmoc_fix.patch (text/x-patch, attachment)]
[guile-fmoc-test-mnl.scm (text/x-scheme, attachment)]
[guile-fmoc-test-mnlbct.scm (text/x-scheme, attachment)]
[guile-fmoc-test-mnlbct-problem_output.txt.bz2 (application/x-bzip2, attachment)]
[guile-fmoc-test-mnl-problem_output.txt.bz2 (application/x-bzip2, attachment)]
This bug report was last modified 8 years and 336 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.