GNU bug report logs -
#46933
Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos
Previous Next
Full log
Message #8 received at 46933 <at> debbugs.gnu.org (full text, mbox):
> Date: Thu, 04 Mar 2021 21:21:24 +0000
> From: Gregory Heytings <gregory <at> heytings.org>
>
> (Disclaimer: I have no knowledge whatsoever about the ISO-2022-JP
> encoding, and although this looks like a bug, I'm not sure this is
> actually a bug; I report this at the suggesion of Eli in bug#46859.)
>
> I downloaded the file [1], and converted it to the ISO-2022-JP encoding
> with iconv -t iso-2022-jp one.txt > iso-2022-jp.txt. The resulting file
> is attached to this bug report. It ends with two CRLFs, at byte offsets
> 2993 and 2995. However, after emacs -Q iso-2022-jp.txt, with M-:
> (goto-char (filepos-to-bufferpos POS 'exact)) we get:
>
> POS = 2991, 2992: last but one visible character (HIRAGANA LETTER RU)
> POS = 2993, 2994: last visible character (IDEOGRAPHIC FULL STOP)
> POS = 2995, 2996: first CRLF
> POS = 2997: second CRLF
> POS = 2998: point-max
> POS = 2999: first CRLF
> POS = 3000, 3001: second CRLF
> POS >= 3002: point-max
>
> I would have expected:
>
> POS = 2989, 2990: last but one visible character (HIRAGANA LETTER RU)
> POS = 2991, 2992: last visible character (IDEOGRAPHIC FULL STOP)
> POS = 2993, 2994: first CRLF
> POS = 2995, 2996: second CRLF
> POS >= 2997: point-max
>
> The opposite operation M-: (bufferpos-to-filepos (- (point) POS) 'exact)
> apparently also has bugs; its return values are not coherent with the
> above ones:
>
> POS = 0: 3003
> POS = 1: 3001
> POS = 2: 2999
> POS = 3 (IDEOGRAPHIC FULL STOP): 2997
> POS = 4 (HIRAGANA LETTER RU): 2995
>
> I would have expected:
>
> POS = 0: 2997
> POS = 1: 2995
> POS = 2: 2993
> POS = 3 (IDEOGRAPHIC FULL STOP): 2991
> POS = 4 (HIRAGANA LETTER RU): 2989
>
> [1] https://darza.com/ecbackend/vendor/symfony/mime/Tests/Fixtures/samples/charsets/iso-2022-jp/one.txt
There's something strange going on here with encoding of the buffer
using iso-2022-jp-dos: near the end of the encoded bytestream, between
the encoded HIRAGANA LETTER KO (こ) and HIRAGANA LETTER TO (と), we
get 6 extra bytes: "ESC ( B ESC $ B". AFAIU, this sequence mean
switch to ASCII and then switch back to Japanese. So together these 6
bytes are a no-op as regards to their effect on the text, but they
disrupt the logic of filepos-to-bufferpos because they introduce extra
bytes that aren't there in the original file.
Kenichi, why are these 6 bytes inserted by encode-coding-region, but
not when we encode the same text as part of saving the buffer to its
file? And why does it happen near the end of the text, between those
2 particular letters?
This bug report was last modified 3 years and 53 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.