#46933 - Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

GNU bug report logs - #46933
Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

Package: emacs;

Reported by: Gregory Heytings <gregory <at> heytings.org>

Date: Thu, 4 Mar 2021 21:22:02 UTC

Severity: normal

Message #14 received at 46933 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: handa <handa <at> gnu.org> Cc: gregory <at> heytings.org, 46933 <at> debbugs.gnu.org Subject: Re: bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos Date: Sat, 27 Mar 2021 10:54:28 +0300

> From: handa <handa <at> gnu.org> > Cc: gregory <at> heytings.org, 46933 <at> debbugs.gnu.org > Date: Sat, 27 Mar 2021 14:38:56 +0900 > > In article <83ft0obk7i.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes: > > > Kenichi, why are these 6 bytes inserted by encode-coding-region, but > > not when we encode the same text as part of saving the buffer to its > > file? And why does it happen near the end of the text, between those > > 2 particular letters? > > There surely exists a bug. Could you please try the attached patch? > > The reason why that bug did not happen on file writing is that the code > in write_region calls encoding routine repeatedly without > CODING_MODE_LAST_BLOCK flag, and only in the case that flushing is > required (e.g. the case of iso-2022-jp), just for flushing, it calls > enoding routine again with CODING_MODE_LAST_BLOCK flag. In that case, > carryover does not happen in encode_coding (). Thanks. The patch fixes the problem with the extra 6 bytes, so I installed it. The results of filepos-to-bufferpos with the file attached by Gregory are better now, but there are still problems for some values of BYTE argument. The problem is that ISO-2022 encoding (and others like it) include shift-in and shift-out sequences, used to switch between character sets. As a trivial example, each CR+LF sequence has the "ESC ( B" sequence before it and "ESC $ B" sequence after it, to switch to ASCII before the newline, then switch to Japanese after it. And likewise whenever there's Latin text within Japanese (there are quite a lot of them in this particular file). These shift-in and shift-out sequences consume bytes, but don't produce any characters. So if the BYTE argument of filepos-to-bufferpos specifies a byte in the middle of one of these shift sequences, the result will be incorrect, because decoding a partial sequence produces the bytes of that sequence verbatim, and the logic in filepos-to-bufferpos of using the length of the decoded text breaks. We need special handling of this and other similar coding-systems to fix these corner use cases, similarly to what we do in filepos-to-bufferpos--dos. Patches welcome. I'm leaving this bug open because not all of the problem was fixed.

This bug report was last modified 3 years and 53 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #46933 Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

GNU bug report logs - #46933
Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos