Package: emacs;
Reported by: Tom Tromey <tom <at> tromey.com>
Date: Sun, 5 Mar 2017 21:49:01 UTC
Severity: wishlist
Merged with 29004
Found in versions 25.2, 27.0.50
View this message in rfc822 format
From: David Malcolm <dmalcolm <at> redhat.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 25987 <at> debbugs.gnu.org Subject: bug#25987: 25.2; support gcc fixit notes Date: Fri, 13 Nov 2020 11:47:18 -0500
On Thu, 2020-11-12 at 15:54 +0200, Eli Zaretskii wrote: > > From: David Malcolm <dmalcolm <at> redhat.com> > > Cc: 25987 <at> debbugs.gnu.org > > Date: Wed, 11 Nov 2020 14:36:49 -0500 > > > > On Tue, 2020-10-20 at 18:54 +0300, Eli Zaretskii wrote: > > > > From: David Malcolm <dmalcolm <at> redhat.com> > > > > Cc: 25987 <at> debbugs.gnu.org > > > > Date: Tue, 20 Oct 2020 10:52:05 -0400 > > > > > > > > One possible issue: in the final diagnostic, there's a fix-it > > > > hint > > > > with > > > > non-ASCII replacement text, replacing "two_pi" with "two_π" > > > > (where > > > > the > > > > final char in the latter is GREEK SMALL LETTER PI, U+03C0) > > > > > > > > This replacement currently expressed as encoded bytes i.e: > > > > > > > > fix-it:"demo.c":{51:10-51:16}:"two_\317\200" > > > > > > > > where \317\200 is the octal-escaped representation of the two > > > > bytes > > > > of > > > > the UTF-8 encoding of the character. > > > > > > > > Is this going to work for Emacs? > > > > > > You mean, GCC doesn't actually emit the UTF-8 encoding of π, it > > > emits > > > its ASCII-fied representation? We'd need to decode that, but is > > > that > > > really justified? Why not emit UTF-8? > > > > I have an implementation that simply emits UTF-8 in quotes, > > escaping > > backslash, tab, newline, and doublequotes as before. (we have to > > escape at least newline, given that fix-it hint replacement text > > can > > contain them, and we're using newline to terminate the parseable > > hint). > > Sorry, I've lost the context: where did those non-ASCII names come > from? are they names of variables in the user's program? The names are identifiers from the user's program (names of variables, types, macros, etc), where an error has been issued, typically due to a misspelling of an identifier. For example, somewhere there's a declaration of a constant named "two_π", and later the code erroneously references it as "two_pi"; we want to emit a diagnostic saying: did you mean "two_π"? and provide a machine-readable fix-it hint suggesting the replacement of the pertinent source range with "two_π". GCC converts the source code from any encoding specified by -finput- charset= to use UTF-8 internally... https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html > If so, in > what encoding does GCC quote portions of the source code in its > warning/error messages? > Does it use the exact byte stream it found in > the source, or does it perform any conversions of the encoding? ...however there's a bug in GCC in how we print the source code itself, where we blithely emit the undecoded bytes directly to stderr when quoting the lines of source. This GCC bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR other/93067). We ought to encode the source code into UTF-8 when printing it (which may be a no-op for the common case). The annotation lines we print under the source lines for fix-it hints and labels are already printed in UTF-8, however. That said, the above bug is orthogonal to the fix-it hint issue, which prints the names in a different way (using UTF-8 encoded strings in GCC's symbol table, rather than scraping them from the filesystem, which is how the buggy source-quoting routines work). > > However, the filename also needs to be escaped. Currently I'm > > applying > > the same escaping rules to both filename and replacement text. > > What is the encoding of the filename? What if the bytes in a > > filename > > aren't UTF-8 encoded? How does emacs handle this case? > > Emacs has a separate variable for the encoding of file names, which > gets set from the locale settings. But this is not necessarily > relevant to the issue at hand, because we are talking about > processing > output from a sub-process (GCC) which includes both file names and > other stuff, such as fragments of the source code. When Emacs > processes sub-process output, it generally assumes all of it is > encoded in the same encoding. So if, for example, you encode > non-ASCII variables in UTF-8 while the file names are emitted in some > other encoding (perhaps because the locale's codeset is not UTF-8), > then there will be complications: we will have to read the output > from > GCC in its raw form, and then decode "by hand" (in Lisp) each part of > it as appropriate (which means we will need to be able to identifye > each such part). > > So it's important to understand the situation and its limitations for > proposing the best solution. As far as I can tell GCC handles filenames as raw bytes, and doesn't make any attempt to decode them, and emits them as bytes again in diagnostic messages. > > I tried creating file with the name "byte 0xff" .txt, and with > > valid > > UTF-8 non- ascii names and emacs reported them as \377.txt and with > > the UTF-8 names respectively, so perhaps I should simply emit the > > bytes and pretend they are UTF-8? > > What do you mean by "pretend" in this context? By "pretend" I mean simply re-emitting the bytes of the filename to stderr and ignoring encoding issues in them, despite the fact that the rest of the stream is supposed to be UTF-8-encoded. Currently the parseable-fixits option uses IS_PRINT on each "char" (i.e. byte) so that any non-printable bytes get octal-escaped. Is that acceptable for filenames? The other approach, to "pretend they're UTF- 8", would mean to not escape such bytes, so that if they are UTF-8 they are faithfully re-emitted. I think I like the approach where the filename part of the fixit line is octal-escaped, and the replacement text is UTF-8, but I don't know what's going to be best for you. Hope the above clarifies things. Dave
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.