Package: emacs;
Reported by: Stephen Berman <stephen.berman <at> gmx.net>
Date: Thu, 7 Mar 2024 13:44:02 UTC
Severity: normal
Found in version 30.0.50
Done: Stephen Berman <stephen.berman <at> gmx.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69611 in the body.
You can then email your comments to 69611 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-gnu-emacs <at> gnu.org
:bug#69611
; Package emacs
.
(Thu, 07 Mar 2024 13:44:02 GMT) Full text and rfc822 format available.Stephen Berman <stephen.berman <at> gmx.net>
:bug-gnu-emacs <at> gnu.org
.
(Thu, 07 Mar 2024 13:44:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Stephen Berman <stephen.berman <at> gmx.net> To: bug-gnu-emacs <at> gnu.org Subject: 30.0.50; Long bidi line with control characters freezes Emacs Date: Thu, 07 Mar 2024 14:42:37 +0100
[Message part 1 (text/plain, inline)]
This report is spun off from bug#69385 at the request of Eli Zaretskii, because it concerns a problem that seems to be independent of that bug report, though like it involves long lines of bidirectional text. When I visited a certain elisp file generated by a program of mine and type `M-v', it took some time (see below for details) for the display to scroll to 4% from the top (according to the mode line) and then there was no further change and Emacs froze, using 100% of a CPU core. I found no way to unfreeze it within Emacs and after about 15 minutes terminated the emacs process from the shell. This is reliably reproducible with this file. The file in question is only about 50k bytes long, but it contains one line of more than 37k characters, consisting of a mix of ASCII and non-ASCII characters, including properly shaped Arabic script. The file itself has base paragraph direction LTR. Most of the Arabic words in this file are enclosed in the bidirectional control characters POP DIRECTIONAL FORMATTING (#x202c) and RIGHT-TO-LEFT EMBEDDING (#x202b). I did not add these characters, but I had copy-&-pasted most of the Arabic from a PDF file I did not create. I don't know if PDFs of Arabic text normally contain these control characters, but the consequences for Emacs were dramatic. When I simply visited this file in Emacs (started with -Q) there was an immediate slowdown, and in top I could see Emacs using 100% of a CPU thread. I ran `M-: (benchmark-run nil (end-of-buffer))' on this file, and the result was: (27.962602113 2 0.0226042269999999977) This timing is from a build from master including the patch Eli posted in bug#69385 (see https://lists.gnu.org/archive/html/bug-gnu-emacs/2024-03/msg00101.html). On a build without that patch, the benchmark timing is very much longer. The display of the benchmark result only appeared in the echo area after more than a minute (I timed it with a stopwatch). At that point the mode line showed the buffer at 4% from the top, and the display remained frozen afterwards. After several minutes during which Emacs consumed 100% CPU, and I had switched the focus away from the Emacs frame, the CPU consumption stopped, but as soon as I switch focus back to that frame, it went back to 100%. The display never changed from showing the buffer at 4%, apparently being in some kind of infinite loop. After about 15 minutes I started gdb, attached the Emacs process and produced a backtrace, which I've attached, in the hope it helps to diagnose the problem. The problem seems to be certainly related the the bidirectional control characters, because I made a copy of the file and removed all occurrences of these control characters from it, and then ran the end-of-buffer benchmark, getting this result (with Eli's patch): (0.716104165 4 0.04223660400000001) And the display updated normally and CPU consumption was normal. Nevertheless, there seems to be something else besides the control characters involved in this issue, because as a further test, I created a buffer consisting of more than 1000 copies of the test string concatenating the Arabic example in etc/HELLO and "Hello" (see bug#69385 for more on such test buffers), and manually enclosed each Arabic word in the above control characters, but the benchmark result in this buffer was not significantly different from the result without the control characters (and similar to the above result for the copy of the problematic file without the control characters), and the display did not freeze. (I have emailed a copy of the problematic file to Eli, at his request. I do not want to post it publicly, because it contains hundreds of text snippets from a PDF of a copyrighted book. Each snippet is certainly within the bounds of fair use for distribution, but in the sum probably not.) In GNU Emacs 30.0.50 (build 2, x86_64-pc-linux-gnu, GTK+ Version 3.24.38, cairo version 1.18.0) of 2024-03-04 built on strobelfs2 Repository revision: b3eb49a4661e31306555e82bdf24db6c36d67ad2 Repository branch: master Windowing system distributor 'The X.Org Foundation', version 11.0.12101009 System Description: Linux From Scratch r12.0-112 Configured using: 'configure -C --with-xwidgets 'CFLAGS=-Og -g3' PKG_CONFIG_PATH=/opt/qt5/lib/pkgconfig' Configured features: ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG JSON LCMS2 LIBSYSTEMD LIBXML2 MODULES NATIVE_COMP NOTIFY INOTIFY PDUMPER PNG RSVG SECCOMP SOUND SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS TREE_SITTER WEBP X11 XDBE XIM XINPUT2 XPM XWIDGETS GTK3 ZLIB Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix
[Message part 2 (text/plain, attachment)]
bug-gnu-emacs <at> gnu.org
:bug#69611
; Package emacs
.
(Thu, 07 Mar 2024 15:44:01 GMT) Full text and rfc822 format available.Message #8 received at 69611 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Stephen Berman <stephen.berman <at> gmx.net> Cc: 69611 <at> debbugs.gnu.org Subject: Re: bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs Date: Thu, 07 Mar 2024 17:42:44 +0200
> Date: Thu, 07 Mar 2024 14:42:37 +0100 > From: Stephen Berman via "Bug reports for GNU Emacs, > the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org> > > When I visited a certain elisp file generated by a program of mine and > type `M-v', it took some time (see below for details) for the display to > scroll to 4% from the top (according to the mode line) and then there > was no further change and Emacs froze, using 100% of a CPU core. I > found no way to unfreeze it within Emacs and after about 15 minutes > terminated the emacs process from the shell. This is reliably > reproducible with this file. > > The file in question is only about 50k bytes long, but it contains one > line of more than 37k characters, consisting of a mix of ASCII and > non-ASCII characters, including properly shaped Arabic script. The file > itself has base paragraph direction LTR. > > Most of the Arabic words in this file are enclosed in the bidirectional > control characters POP DIRECTIONAL FORMATTING (#x202c) and RIGHT-TO-LEFT > EMBEDDING (#x202b). I did not add these characters, but I had > copy-&-pasted most of the Arabic from a PDF file I did not create. I > don't know if PDFs of Arabic text normally contain these control > characters, but the consequences for Emacs were dramatic. When I simply > visited this file in Emacs (started with -Q) there was an immediate > slowdown, and in top I could see Emacs using 100% of a CPU thread. I > ran `M-: (benchmark-run nil (end-of-buffer))' on this file, and the > result was: > > (27.962602113 2 0.0226042269999999977) This is a crazy file. UBA, the Unicode Bidirectional Algorithm, allows the RLE..PDF embeddings to nest. The nesting is allowed to be up to 125 deep(!), but I have never seen a text file using more than a couple of nested embeddings. This file goes up to 111 nested embedding levels! Moreover, quite a few embeddings are invalid: there are 1021 RLE control characters in this file, but only 971 PDF controls, so they don't pair as they should. This causes the reordering algorithm to examine extremely long stretches of characters each time we need to redisplay even a small portion of the window, because reordering must always find where each nested level ends to do its job. My suggestion is to remove all the RLE and PDF controls from the file. They are not needed, not in Emacs anyway. I'm guessing the program which created this file uses bidi controls because it wants to be compatible with incomplete implementations of the UBA, which don't support implicit embedding levels (those cause by bidirectional properties of characters, as opposed to explicit bidi controls like RLE and PDF). With full UBA implementations, the bidi controls are needed only when the reordering using implicit levels produces wrong results, which is quite rare. > The display of the benchmark result only appeared in the echo area after > more than a minute (I timed it with a stopwatch). At that point the > mode line showed the buffer at 4% from the top, and the display remained > frozen afterwards. After several minutes during which Emacs consumed > 100% CPU, and I had switched the focus away from the Emacs frame, the > CPU consumption stopped, but as soon as I switch focus back to that > frame, it went back to 100%. The display never changed from showing the > buffer at 4%, apparently being in some kind of infinite loop. After > about 15 minutes I started gdb, attached the Emacs process and produced > a backtrace, which I've attached, in the hope it helps to diagnose the > problem. The extremely deep nesting of embeddings in the file, coupled with the fact that the first embedding starts near the beginning of the file, but ends very near its end, causes the algorithm that finds where to position the cursor to fail, because it cannot cope with the situation where, after C-f or C-b, the position of point is very far outside of the window. I guess this causes some infloop (even though I don't see it here, I just see that the cursor doesn't move although point does move). It could also be just a very long calculation, not an infloop, because finding where to place the window-start point in this case is also very expensive. > Nevertheless, there seems to be something else besides the control > characters involved in this issue, because as a further test, I created > a buffer consisting of more than 1000 copies of the test string > concatenating the Arabic example in etc/HELLO and "Hello" (see bug#69385 > for more on such test buffers), and manually enclosed each Arabic word > in the above control characters, but the benchmark result in this buffer > was not significantly different from the result without the control > characters (and similar to the above result for the copy of the > problematic file without the control characters), and the display did > not freeze. Yes, because you never tried such deeply-nested embeddings, and didn't make your embedding levels include so many characters long as this file does. This file is an interesting curiosity, as far as I'm concerned, but I doubt whether I will find enough time and motivation to try to speed up Emacs when such crazy files are visited.
bug-gnu-emacs <at> gnu.org
:bug#69611
; Package emacs
.
(Thu, 07 Mar 2024 17:54:01 GMT) Full text and rfc822 format available.Message #11 received at 69611 <at> debbugs.gnu.org (full text, mbox):
From: Stephen Berman <stephen.berman <at> gmx.net> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 69611 <at> debbugs.gnu.org Subject: Re: bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs Date: Thu, 07 Mar 2024 18:52:20 +0100
On Thu, 07 Mar 2024 17:42:44 +0200 Eli Zaretskii <eliz <at> gnu.org> wrote: >> Date: Thu, 07 Mar 2024 14:42:37 +0100 >> From: Stephen Berman via "Bug reports for GNU Emacs, >> the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org> >> >> When I visited a certain elisp file generated by a program of mine and >> type `M-v', it took some time (see below for details) for the display to >> scroll to 4% from the top (according to the mode line) and then there >> was no further change and Emacs froze, using 100% of a CPU core. I >> found no way to unfreeze it within Emacs and after about 15 minutes >> terminated the emacs process from the shell. This is reliably >> reproducible with this file. >> >> The file in question is only about 50k bytes long, but it contains one >> line of more than 37k characters, consisting of a mix of ASCII and >> non-ASCII characters, including properly shaped Arabic script. The file >> itself has base paragraph direction LTR. >> >> Most of the Arabic words in this file are enclosed in the bidirectional >> control characters POP DIRECTIONAL FORMATTING (#x202c) and RIGHT-TO-LEFT >> EMBEDDING (#x202b). I did not add these characters, but I had >> copy-&-pasted most of the Arabic from a PDF file I did not create. I >> don't know if PDFs of Arabic text normally contain these control >> characters, but the consequences for Emacs were dramatic. When I simply >> visited this file in Emacs (started with -Q) there was an immediate >> slowdown, and in top I could see Emacs using 100% of a CPU thread. I >> ran `M-: (benchmark-run nil (end-of-buffer))' on this file, and the >> result was: >> >> (27.962602113 2 0.0226042269999999977) > > This is a crazy file. UBA, the Unicode Bidirectional Algorithm, > allows the RLE..PDF embeddings to nest. The nesting is allowed to be > up to 125 deep(!), but I have never seen a text file using more than a > couple of nested embeddings. This file goes up to 111 nested > embedding levels! Moreover, quite a few embeddings are invalid: there > are 1021 RLE control characters in this file, but only 971 PDF > controls, so they don't pair as they should. This causes the > reordering algorithm to examine extremely long stretches of characters > each time we need to redisplay even a small portion of the window, > because reordering must always find where each nested level ends to do > its job. > > My suggestion is to remove all the RLE and PDF controls from the file. > They are not needed, not in Emacs anyway. I'm guessing the program > which created this file uses bidi controls because it wants to be > compatible with incomplete implementations of the UBA, which don't > support implicit embedding levels (those cause by bidirectional > properties of characters, as opposed to explicit bidi controls like > RLE and PDF). With full UBA implementations, the bidi controls are > needed only when the reordering using implicit levels produces wrong > results, which is quite rare. Indeed, I had already come to the conclusion that I don't need those controls before I decided to raise the problem I encountered with them. I've now checked a number of PDFs I have that contain Arabic script, and in all of those from which I was able to yank Arabic script from the PDF as Arabic script into Emacs (with some PDFs that wasn't possible), each Arabic word was enclosed in the control characters. So that appears to be standard or at least common with PDF. Being now aware of this, I can take care to remove any control characters from yanked text in future. In the case of the file I sent you, I may be to blame for the unbalanced control characters: after yanking the Arabic into Emacs, I did some editing of it and may well have unintentionally deleted some of the control characters. At the time I wasn't even aware of these; only after (re)reading the section on bidirectional display in the Elisp manual did I enable glyphless-display-mode and saw the characters, but I didn't bother to check if they paired up properly. >> The display of the benchmark result only appeared in the echo area after >> more than a minute (I timed it with a stopwatch). At that point the >> mode line showed the buffer at 4% from the top, and the display remained >> frozen afterwards. After several minutes during which Emacs consumed >> 100% CPU, and I had switched the focus away from the Emacs frame, the >> CPU consumption stopped, but as soon as I switch focus back to that >> frame, it went back to 100%. The display never changed from showing the >> buffer at 4%, apparently being in some kind of infinite loop. After >> about 15 minutes I started gdb, attached the Emacs process and produced >> a backtrace, which I've attached, in the hope it helps to diagnose the >> problem. > > The extremely deep nesting of embeddings in the file, coupled with the > fact that the first embedding starts near the beginning of the file, > but ends very near its end, causes the algorithm that finds where to > position the cursor to fail, because it cannot cope with the situation > where, after C-f or C-b, the position of point is very far outside of > the window. I guess this causes some infloop (even though I don't see > it here, I just see that the cursor doesn't move although point does > move). It could also be just a very long calculation, not an infloop, > because finding where to place the window-start point in this case is > also very expensive. Ok. But this is only an issue in conjunction with long lines, right? Because there is no slowdown or display issue with the file from which this elisp file was generated: that is the file into which I yanked the Arabic script from the PDF and subsequently edited, so it contains unpaired control characters, but only a few of its lines are longer than 80 characters, and I think none longer than 150 or so. >> Nevertheless, there seems to be something else besides the control >> characters involved in this issue, because as a further test, I created >> a buffer consisting of more than 1000 copies of the test string >> concatenating the Arabic example in etc/HELLO and "Hello" (see bug#69385 >> for more on such test buffers), and manually enclosed each Arabic word >> in the above control characters, but the benchmark result in this buffer >> was not significantly different from the result without the control >> characters (and similar to the above result for the copy of the >> problematic file without the control characters), and the display did >> not freeze. > > Yes, because you never tried such deeply-nested embeddings, and didn't > make your embedding levels include so many characters long as this > file does. Indeed, I simply wrapped each Arabic word in the paired control characters, so there's no nesting at all. Now the difference makes sense. > This file is an interesting curiosity, as far as I'm concerned, but I > doubt whether I will find enough time and motivation to try to speed > up Emacs when such crazy files are visited. Given the special circumstances of this file's creation I think there's no need to spend any more time it, so unless you decide you do want to, as far as I'm concerned this bug can be closed. It might be beneficial to others to document the issue briefly, either in the Elisp manual under Bidirectional Display or just in etc/PROBLEMS, but maybe this is such an unusual case that even that isn't worth the effort. Thanks for looking into this and explaining it. Steve Berman
bug-gnu-emacs <at> gnu.org
:bug#69611
; Package emacs
.
(Thu, 07 Mar 2024 19:21:02 GMT) Full text and rfc822 format available.Message #14 received at 69611 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Stephen Berman <stephen.berman <at> gmx.net> Cc: 69611 <at> debbugs.gnu.org Subject: Re: bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs Date: Thu, 07 Mar 2024 21:19:31 +0200
> From: Stephen Berman <stephen.berman <at> gmx.net> > Cc: 69611 <at> debbugs.gnu.org > Date: Thu, 07 Mar 2024 18:52:20 +0100 > > In the case of the file I sent you, I may be to blame for the unbalanced > control characters: after yanking the Arabic into Emacs, I did some > editing of it and may well have unintentionally deleted some of the > control characters. At the time I wasn't even aware of these; only > after (re)reading the section on bidirectional display in the Elisp > manual did I enable glyphless-display-mode and saw the characters, but I > didn't bother to check if they paired up properly. In that case, it could be that the original file wouldn't be so expensive, if each short Arabic string was included in a right-to-left embedding, i.e. RLE before it and PDF after it. Then we wouldn't have nested embeddings, and each embedding would be quite short. This should produce quite "normal" display speed, not different from when displaying bidirectional text without the control characters at all. IOW, it could be that by deleting some of the controls you created the nested embeddings that were not there in the first place. > > The extremely deep nesting of embeddings in the file, coupled with the > > fact that the first embedding starts near the beginning of the file, > > but ends very near its end, causes the algorithm that finds where to > > position the cursor to fail, because it cannot cope with the situation > > where, after C-f or C-b, the position of point is very far outside of > > the window. I guess this causes some infloop (even though I don't see > > it here, I just see that the cursor doesn't move although point does > > move). It could also be just a very long calculation, not an infloop, > > because finding where to place the window-start point in this case is > > also very expensive. > > Ok. But this is only an issue in conjunction with long lines, right? Yes, because all the embedding levels are reset at the newline. So having a newline not far away guarantees that the reordering code doesn't need to look too far for where the embedding ends. > > This file is an interesting curiosity, as far as I'm concerned, but I > > doubt whether I will find enough time and motivation to try to speed > > up Emacs when such crazy files are visited. > > Given the special circumstances of this file's creation I think there's > no need to spend any more time it, so unless you decide you do want to, > as far as I'm concerned this bug can be closed. It might be beneficial > to others to document the issue briefly, either in the Elisp manual > under Bidirectional Display or just in etc/PROBLEMS, but maybe this is > such an unusual case that even that isn't worth the effort. I doubt it's worth describing, since even explaining what happens needs a long treatise on what are bidirectional embedding levels, how they affect reordering, and how Emacs implements that reordering in a way that fits into the general structure of the Emacs display code (which examines characters in their visual order). We'd basically need to copy into PROBLEMS or the manual large portions of the commentary at the beginning of bidi.c, which includes a lot of technical details.
Stephen Berman <stephen.berman <at> gmx.net>
:Stephen Berman <stephen.berman <at> gmx.net>
:Message #19 received at 69611-done <at> debbugs.gnu.org (full text, mbox):
From: Stephen Berman <stephen.berman <at> gmx.net> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 69611-done <at> debbugs.gnu.org Subject: Re: bug#69611: 30.0.50; Long bidi line with control characters freezes Emacs Date: Thu, 07 Mar 2024 22:21:54 +0100
On Thu, 07 Mar 2024 21:19:31 +0200 Eli Zaretskii <eliz <at> gnu.org> wrote: >> From: Stephen Berman <stephen.berman <at> gmx.net> >> Cc: 69611 <at> debbugs.gnu.org >> Date: Thu, 07 Mar 2024 18:52:20 +0100 >> >> In the case of the file I sent you, I may be to blame for the unbalanced >> control characters: after yanking the Arabic into Emacs, I did some >> editing of it and may well have unintentionally deleted some of the >> control characters. At the time I wasn't even aware of these; only >> after (re)reading the section on bidirectional display in the Elisp >> manual did I enable glyphless-display-mode and saw the characters, but I >> didn't bother to check if they paired up properly. > > In that case, it could be that the original file wouldn't be so > expensive, if each short Arabic string was included in a right-to-left > embedding, i.e. RLE before it and PDF after it. Then we wouldn't have > nested embeddings, and each embedding would be quite short. This > should produce quite "normal" display speed, not different from when > displaying bidirectional text without the control characters at all. > IOW, it could be that by deleting some of the controls you created the > nested embeddings that were not there in the first place. It would be interesting to test this; perhaps I can without too much effort add the missing controls to make them balanced. I'll take a look. >> > The extremely deep nesting of embeddings in the file, coupled with the >> > fact that the first embedding starts near the beginning of the file, >> > but ends very near its end, causes the algorithm that finds where to >> > position the cursor to fail, because it cannot cope with the situation >> > where, after C-f or C-b, the position of point is very far outside of >> > the window. I guess this causes some infloop (even though I don't see >> > it here, I just see that the cursor doesn't move although point does >> > move). It could also be just a very long calculation, not an infloop, >> > because finding where to place the window-start point in this case is >> > also very expensive. >> >> Ok. But this is only an issue in conjunction with long lines, right? > > Yes, because all the embedding levels are reset at the newline. So > having a newline not far away guarantees that the reordering code > doesn't need to look too far for where the embedding ends. Ah, ok, then that explains it. >> > This file is an interesting curiosity, as far as I'm concerned, but I >> > doubt whether I will find enough time and motivation to try to speed >> > up Emacs when such crazy files are visited. >> >> Given the special circumstances of this file's creation I think there's >> no need to spend any more time it, so unless you decide you do want to, >> as far as I'm concerned this bug can be closed. It might be beneficial >> to others to document the issue briefly, either in the Elisp manual >> under Bidirectional Display or just in etc/PROBLEMS, but maybe this is >> such an unusual case that even that isn't worth the effort. > > I doubt it's worth describing, since even explaining what happens > needs a long treatise on what are bidirectional embedding levels, how > they affect reordering, and how Emacs implements that reordering in a > way that fits into the general structure of the Emacs display code > (which examines characters in their visual order). We'd basically > need to copy into PROBLEMS or the manual large portions of the > commentary at the beginning of bidi.c, which includes a lot of > technical details. Yeah, that would be going overboard. I'm more than satisfied with the explanations you've provided, so I'm going ahead and closing this bug. Thanks again. Steve Berman
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Fri, 05 Apr 2024 11:24:14 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.