Package: emacs;
Reported by: Tom Tromey <tom <at> tromey.com>
Date: Sun, 5 Mar 2017 21:49:01 UTC
Severity: wishlist
Merged with 29004
Found in versions 25.2, 27.0.50
Message #127 received at 25987 <at> debbugs.gnu.org (full text, mbox):
From: David Malcolm <dmalcolm <at> redhat.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 25987 <at> debbugs.gnu.org Subject: Re: bug#25987: 25.2; support gcc fixit notes Date: Sat, 14 Nov 2020 14:46:29 -0500
On Sat, 2020-11-14 at 16:21 +0200, Eli Zaretskii wrote: > > From: David Malcolm <dmalcolm <at> redhat.com> > > Cc: 25987 <at> debbugs.gnu.org > > Date: Fri, 13 Nov 2020 11:47:18 -0500 > > > > The names are identifiers from the user's program (names of > > variables, > > types, macros, etc), where an error has been issued, typically due > > to a > > misspelling of an identifier. For example, somewhere there's a > > declaration of a constant named "two_π", and later the code > > erroneously > > references it as "two_pi"; we want to emit a diagnostic saying: > > did you mean "two_π"? > > and provide a machine-readable fix-it hint suggesting the > > replacement > > of the pertinent source range with "two_π". > > > > GCC converts the source code from any encoding specified by > > -finput- > > charset= to use UTF-8 internally... > > > > https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html > > And then GCC outputs these identifiers in UTF-8? Or does it convert > back to the original input-charset? It emits them as UTF-8 when emitting diagnostics. > > ...however there's a bug in GCC in how we print the source code > > itself, > > where we blithely emit the undecoded bytes directly to stderr when > > quoting the lines of source. This GCC bug is > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR > > other/93067). We ought to encode the source code into UTF-8 when > > printing it (which may be a no-op for the common case). > > I'm not sure you are right here: I think it is better for GCC to use > the original bytestream, because the user's locale might not support > UTF-8 well; it is better to show the source to the user in the > encoding in which it was written. This seems to me to lead to a bigger question: what should the encoding of GCC's stderr be? Right now I believe we emit a mix of UTF-8 and other encodings, as noted in my earlier post. > However, I'm not familiar with GCC internals, so it is not clear to > me > whether the bug report will indeed affect the way source fragments > will be output: the bug report only talks about converting the input, > and I don't know enough to understand how will that affect output. > > > The annotation lines we print under the source lines for fix-it > > hints and labels are already printed in UTF-8, however. > > The annotations are in US English, though, right? If not, when will > they include non-ASCII characters? Annotation lines can contain labels as of GCC 9, and these can contain identifiers; for example in this C++ type mismatch error, where the types of the pertinent expressions are labeled: $ g++ t.cc t.cc: In function 'int test(const shape&, const shape&)': t.cc:15:4: error: no match for 'operator+' (operand types are 'boxed_value<double>' and 'boxed_value<double>') 14 | return (width(s1) * height(s1) | ~~~~~~~~~~~~~~~~~~~~~~ | | | boxed_value<[...]> 15 | + width(s2) * height(s2)); | ^ ~~~~~~~~~~~~~~~~~~~~~~ | | | boxed_value<[...]> where "boxed_value" is an identifier and in theory could have non-ASCII characters in it. > > That said, the above bug is orthogonal to the fix-it hint issue, > > which > > prints the names in a different way (using UTF-8 encoded strings in > > GCC's symbol table, rather than scraping them from the filesystem, > > which is how the buggy source-quoting routines work). > > [...] > > As far as I can tell GCC handles filenames as raw bytes, and > > doesn't > > make any attempt to decode them, and emits them as bytes again in > > diagnostic messages. > > This is okay, but since the other parts are in UTF-8, this will > complicate things, as I mentioned in my previous message. > > > > > I tried creating file with the name "byte 0xff" .txt, and with > > > > valid > > > > UTF-8 non- ascii names and emacs reported them as \377.txt and > > > > with > > > > the UTF-8 names respectively, so perhaps I should simply emit > > > > the > > > > bytes and pretend they are UTF-8? > > > > > > What do you mean by "pretend" in this context? > > > > By "pretend" I mean simply re-emitting the bytes of the filename to > > stderr and ignoring encoding issues in them, despite the fact that > > the > > rest of the stream is supposed to be UTF-8-encoded. > > As explained, it will be easier for Emacs to process GCC output if > its > encoding is consistent. Indeed. I'll raise this issue on the GCC mailing list. > > Currently the parseable-fixits option uses IS_PRINT on each "char" > > (i.e. byte) so that any non-printable bytes get octal-escaped. Is > > that > > acceptable for filenames? The other approach, to "pretend they're > > UTF- > > 8", would mean to not escape such bytes, so that if they are UTF-8 > > they > > are faithfully re-emitted. > > > > I think I like the approach where the filename part of the fixit > > line > > is octal-escaped, and the replacement text is UTF-8, but I don't > > know > > what's going to be best for you. > > Given your description, it sounds like it will not be simple whatever > you do. > > I guess we should first try getting the plain-ASCII case to work, as > that is the most frequent use case anyway. I added some test cases and posted the patch to the gcc-patches mailing list here: "[PATCH/RFC] Add GCC_EXTRA_DIAGNOSTIC_OUTPUT environment variable for fix-it hints" https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html Thanks Dave
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.