Am Sonntag, 6. Oktober 2019, 19:53:49 CEST schrieb Eli Zaretskii: > > Date: Sun, 6 Oct 2019 14:31:12 +0200 > > From: Anton Ertl > > Cc: bernd@net2o.de, 37633@debbugs.gnu.org, > > anton@mips.complang.tuwien.ac.at > > > > On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote: > > > For byte offsets in external text we have bufferpos-to-filepos, but > > > that requires us to know the encoding of the external text. We need > > > to find a reasonable way of getting that. Suggestions and patches > > > welcome. > > > > It's the encoding that you assumed for the text when you loaded the > > file into the buffer. > > I'm not sure this is correct. You are saying that the compiler counts > bytes in the original file, not in its output (which might be encoded > differently). Do we have conclusive evidence that this is always > true? Almost always. gcc has a gazillion of options almost nobody uses. E.g., you can use -finput-encoding= to transcode input files on reading. It's a not well tested option, as the output (still iso8859-1) shows: % gcc -finput-charset=iso8859-1 test-iso.c test-iso.c: In function ‘foo’: test-iso.c:2:2: warning: implicit declaration of function ‘printf’ [- Wimplicit-function-declaration] 2 | printf("test %i", b); | ^~~~~~ test-iso.c:2:2: warning: incompatible implicit declaration of built-in function ‘printf’ test-iso.c:1:1: note: include ‘’ or provide a declaration of ‘printf’ +++ |+#include 1 | void foo() { test-iso.c:2:20: error: ‘b’ undeclared (first use in this function) 2 | printf("test %i", b); | ^ test-iso.c:2:20: note: each undeclared identifier is reported only once for each function it appears in test-iso.c:3:26: error: ‘c’ undeclared (first use in this function) 3 | printf("test��� %i", c); | ^ Here, due to the conversion on read in, the position reported is different (it was 3:23 before). This transparent conversion on reading is used rarely. Or rather: There is no search result in the entire github database. > > the byte position does not depend on the encoding (unlike the > > character position). > > ??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8 > will yield a different number of bytes. So I don't think I understand > how can you say the above. What I'm trying to tell: The compiler (unless instructed to convert the file on reading) reports the byte position it found in the file. That's the same byte position the editor calculates for that file — and that is regardless of what the editor assumed as encoding. I.e. if the editor mistook a UTF-8 file for an iso8859-1, it will see an UTF-8 string "äöü" (6 bytes UTF-8) as "äöü" (6 bytes iso8859-1). But it's still 6 bytes. -- Bernd Paysan "If you want it done right, you have to do it yourself" net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ* https://net2o.de/