GNU bug report logs - #40407
[PATCH] slow ENCODE_FILE and DECODE_FILE

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Fri, 3 Apr 2020 16:11:01 UTC

Severity: normal

Tags: patch

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 40407 in the body.
You can then email your comments to 40407 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Fri, 03 Apr 2020 16:11:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 03 Apr 2020 16:11:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Fri, 3 Apr 2020 16:18:43 +0200
[Message part 1 (text/plain, inline)]
ENCODE_FILE and DECODE_FILE turn out to be surprisingly slow, and allocate copious amounts of memory, to the point that they often turn up in both memory and cpu profiles. (This is on macOS; I haven't checked the situation elsewhere.)

For instance, a single call to file-relative-name, with ASCII-only arguments, manages to allocate 140 KiB. There are several conversion steps each involving creating temporary buffers as well as the compilation and execution of very large "quick-check" regexps. Example:

(progn
  (require 'profiler)
  (profiler-reset)
  (garbage-collect)
  (profiler-start 'mem)
  (file-relative-name "abc")
  (profiler-stop)
  (profiler-report))

This applies to just about every function dealing with files or file names.

The attached patch is somewhat conservatively written but at least a starting point. It reduces the memory consumption by file-relative-name in the example above to zero. Perhaps we can assume that file names codings are always ASCII-compatible; if so, the shortcut can be taken in encode_file_name and decode_file_name directly.

There is already a hack in encode_file_name that assumes that no unibyte string ever needs encoding; if so, the shortcut could perhaps be extended to decode_file_name and simplified.

[0001-Avoid-expensive-recoding-for-ASCII-identity-cases.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Fri, 03 Apr 2020 16:25:01 GMT) Full text and rfc822 format available.

Message #8 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Fri, 03 Apr 2020 19:24:09 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Fri, 3 Apr 2020 16:18:43 +0200
> 
> ENCODE_FILE and DECODE_FILE turn out to be surprisingly slow, and allocate copious amounts of memory, to the point that they often turn up in both memory and cpu profiles. (This is on macOS; I haven't checked the situation elsewhere.)

AFAIR, on macOS the situation is worse than elsewhere, because of the
normalization thing.

> For instance, a single call to file-relative-name, with ASCII-only arguments, manages to allocate 140 KiB. There are several conversion steps each involving creating temporary buffers as well as the compilation and execution of very large "quick-check" regexps. Example:
> 
> (progn
>   (require 'profiler)
>   (profiler-reset)
>   (garbage-collect)
>   (profiler-start 'mem)
>   (file-relative-name "abc")
>   (profiler-stop)
>   (profiler-report))

Can you tell more about the conversion steps and the memory each one
allocates?

> Perhaps we can assume that file names codings are always ASCII-compatible

I don't think every encoding is ASCII compatible, so I don't see how
we can assume that in general.  But the check whether an encoding is
ASCII-compatible takes a negligible amount of time, so why bother with
such an assumption?

> There is already a hack in encode_file_name that assumes that no unibyte string ever needs encoding; if so, the shortcut could perhaps be extended to decode_file_name and simplified.

I'm not sure I understand what you mean by extending the shortcut to
decode_file_name.  Please elaborate.

> -  if (BUFFERP (dst_object))
> +  if (EQ (dst_object, Qt))
> +    {
> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> +         act as identity.  */
> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> +          && (STRING_MULTIBYTE (string)
> +              ? (chars == bytes) : string_ascii_p (string)))
> +        return string;

I don't think we can return the same string if NOCOPY is non-zero.
The callers might not expect that, and you might inadvertently cause
the original string be modified behind the caller's back.

But if NOCOPY is 'false', I think this change is OK.  Just make sure
the test suite doesn't start failing, maybe there's something else we
are missing.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Fri, 03 Apr 2020 22:33:01 GMT) Full text and rfc822 format available.

Message #11 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 4 Apr 2020 00:32:21 +0200
3 apr. 2020 kl. 18.24 skrev Eli Zaretskii <eliz <at> gnu.org>:

> AFAIR, on macOS the situation is worse than elsewhere, because of the
> normalization thing.

Very likely. It's just what I had in my lap.

> Can you tell more about the conversion steps and the memory each one
> allocates?

Courtesy the memory profiler:

         - file-relative-name                                 141,551  15%
          - file-name-case-insensitive-p                      100,613  11%
           - ucs-normalize-hfs-nfd-pre-write-conversion       100,613  11%
            - ucs-normalize-HFS-NFD-region                    100,613  11%
               ucs-normalize-region                           100,613  11%
          - expand-file-name                                   40,828   4%
           - ucs-normalize-hfs-nfd-post-read-conversion        40,828   4%
            - ucs-normalize-HFS-NFC-region                     40,828   4%
               ucs-normalize-region                            40,828   4%

where file_name_case_insensitive_p calls ENCODE_FILE and expand_file_name calls DECODE_FILE. I'm not sure how much each part of ucs-normalize-region actually consumes, but I think we can agree that we don't want it called on any platform unless strictly necessary.

> I don't think every encoding is ASCII compatible, so I don't see how
> we can assume that in general.  But the check whether an encoding is
> ASCII-compatible takes a negligible amount of time, so why bother with
> such an assumption?

Quite, I just thought I'd ask in case there were some unwritten invariant that you knew about.

> I'm not sure I understand what you mean by extending the shortcut to
> decode_file_name.  Please elaborate.

Never mind, it was an under-thought idea. The existing bootstrap hack making encode_file_name identity for any unibyte string does not seem to need or allow any symmetry in decode_file_name.

> I don't think we can return the same string if NOCOPY is non-zero.
> The callers might not expect that, and you might inadvertently cause
> the original string be modified behind the caller's back.

You are no doubt correct, but doesn't it look like the sense of NOCOPY has been inverted here? It runs contrary to the intuitive meaning and to the doc string of {encode,decode}-coding-string. In fact:

(let* ((nocopy nil)
       (x "abc")
       (y (decode-coding-string x nil nocopy nil)))
  (eq x y))
=> t

Looks like we suddenly got more work on our hands. What a surprise.

Since string mutation is so rare, I doubt it has caused any real trouble. Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY? I'm fairly neutral, but there are arguments in either way, both in terms of performance and correctness. And what about internal calls to code_convert_string?

There are 193 calls to {encode, decode}-coding-string in the Emacs tree, and only 14 of them pass a non-nil value to NOCOPY. I'd be inclined to keep the semantics but rename the argument to COPY, on the grounds that no-copy is a better default; then change those 14 calls to pass nil instead, since that obviously was the intent.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 09:27:02 GMT) Full text and rfc822 format available.

Message #14 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 12:26:11 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 00:32:21 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
>          - file-relative-name                                 141,551  15%
>           - file-name-case-insensitive-p                      100,613  11%
>            - ucs-normalize-hfs-nfd-pre-write-conversion       100,613  11%
>             - ucs-normalize-HFS-NFD-region                    100,613  11%
>                ucs-normalize-region                           100,613  11%
>           - expand-file-name                                   40,828   4%
>            - ucs-normalize-hfs-nfd-post-read-conversion        40,828   4%
>             - ucs-normalize-HFS-NFC-region                     40,828   4%
>                ucs-normalize-region                            40,828   4%
> 
> where file_name_case_insensitive_p calls ENCODE_FILE and expand_file_name calls DECODE_FILE.

DECODE_FILE is called because the file name in question starts with a
"~"?  Otherwise, I don't think I understand why would expand-file-name
need to decode a file name.

> I'm not sure how much each part of ucs-normalize-region actually consumes, but I think we can agree that we don't want it called on any platform unless strictly necessary.

Any expensive code should be avoided if it isn't necessary, so yes, I
agree.  And yes, Unicode normalization is expensive.  If we consider
the macOS filesystem idiosyncrasies important to support efficiently,
perhaps we should rewrite the normalization code in C.

> > I don't think every encoding is ASCII compatible, so I don't see how
> > we can assume that in general.  But the check whether an encoding is
> > ASCII-compatible takes a negligible amount of time, so why bother with
> > such an assumption?
> 
> Quite, I just thought I'd ask in case there were some unwritten invariant that you knew about.

Whether a coding-system is ASCII-compatible is determined by the
definition of that coding-system.  Look in mule-conf.el, and you will
see there several that aren't ASCII-compatible.  UTF-16 is one
example, but there are others.

> > I don't think we can return the same string if NOCOPY is non-zero.
> > The callers might not expect that, and you might inadvertently cause
> > the original string be modified behind the caller's back.
> 
> You are no doubt correct, but doesn't it look like the sense of NOCOPY has been inverted here?

That ship has sailed long ago (I could explain how this "inverted"
meaning could make sense, but I don't think it's relevant to the issue
at hand), and there are several other internal functions that use a
similar argument in the same "inverted" sense.  This is a separate
issue, anyway.

> Since string mutation is so rare, I doubt it has caused any real trouble.

You are wrong here, it can happen very easily, especially when you
manipulate the encoded string in C.  The simplest use case is that you
encode a file name, and then make some change to the encoded string,
like change the letter-case or remove the trailing slash.  Suddenly
the original string is changed as well, and the Lisp caller of the
high-level function might be mightily surprised by the result.

IME, the cases where we can safely assume it's OK to return the same
string are actually very rare.  It is no accident that you saw so few
calls of these functions where we use that optional behavior.

> Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY?

Neither, IMO.  Again, it's a separate problem, and let's keep our
sights squarely on the original issue you wanted to fix.  Let's tackle
the NOCOPY issue in a separate discussion, OK?

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 10:27:02 GMT) Full text and rfc822 format available.

Message #17 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 13:26:20 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 00:32:21 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> Since string mutation is so rare, I doubt it has caused any real trouble. Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY? I'm fairly neutral, but there are arguments in either way, both in terms of performance and correctness. And what about internal calls to code_convert_string?
> 
> There are 193 calls to {encode, decode}-coding-string in the Emacs tree, and only 14 of them pass a non-nil value to NOCOPY. I'd be inclined to keep the semantics but rename the argument to COPY, on the grounds that no-copy is a better default; then change those 14 calls to pass nil instead, since that obviously was the intent.

After looking at this for some time, I think the problem is rarely if
ever seen.  The only function which has the NOCOPY sense inverted is
code_convert_string, and it only does that when the CODING_SYSTEM
argument is nil, which should almost never happen.  So I think it's OK
to change code_convert_string on master to use NOCOPY in its correct
sense.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 16:42:02 GMT) Full text and rfc822 format available.

Message #20 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 4 Apr 2020 18:41:39 +0200
[Message part 1 (text/plain, inline)]
4 apr. 2020 kl. 11.26 skrev Eli Zaretskii <eliz <at> gnu.org>:

> DECODE_FILE is called because the file name in question starts with a
> "~"?  Otherwise, I don't think I understand why would expand-file-name
> need to decode a file name.

Maybe it's because default-directory started with a tilde. It doesn't really matter; it's a common case, and the profiler tells us as much.

> IME, the cases where we can safely assume it's OK to return the same
> string are actually very rare.  It is no accident that you saw so few
> calls of these functions where we use that optional behavior.

This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.

> Neither, IMO.  Again, it's a separate problem, and let's keep our
> sights squarely on the original issue you wanted to fix.  Let's tackle
> the NOCOPY issue in a separate discussion, OK?

Thank you, a separate bug for it is fine.

Here is a revised patch which takes the nocopy parameter into account (in its inverted sense). Obviously it needs to be adapted if the nocopy inversion is dealt with first; the two bugs do not commute.

[0001-Avoid-expensive-recoding-for-ASCII-identity-cases-bu.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 16:56:02 GMT) Full text and rfc822 format available.

Message #23 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 4 Apr 2020 18:55:26 +0200
4 apr. 2020 kl. 12.26 skrev Eli Zaretskii <eliz <at> gnu.org>:

> After looking at this for some time, I think the problem is rarely if
> ever seen.  The only function which has the NOCOPY sense inverted is
> code_convert_string, and it only does that when the CODING_SYSTEM
> argument is nil, which should almost never happen.

Oh, that's the easy part. It's the ASCII optimisation that makes it interesting.






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 17:05:02 GMT) Full text and rfc822 format available.

Message #26 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 20:04:27 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 18:55:26 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> 4 apr. 2020 kl. 12.26 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > After looking at this for some time, I think the problem is rarely if
> > ever seen.  The only function which has the NOCOPY sense inverted is
> > code_convert_string, and it only does that when the CODING_SYSTEM
> > argument is nil, which should almost never happen.
> 
> Oh, that's the easy part. It's the ASCII optimisation that makes it interesting.

How so?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 17:23:02 GMT) Full text and rfc822 format available.

Message #29 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 20:22:37 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 18:41:39 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> > DECODE_FILE is called because the file name in question starts with a
> > "~"?  Otherwise, I don't think I understand why would expand-file-name
> > need to decode a file name.
> 
> Maybe it's because default-directory started with a tilde. It doesn't really matter; it's a common case, and the profiler tells us as much.

I think it's important that we understand what happens here to the
last detail, but okay.

> > IME, the cases where we can safely assume it's OK to return the same
> > string are actually very rare.  It is no accident that you saw so few
> > calls of these functions where we use that optional behavior.
> 
> This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.

And IMO the default must stay that a copy is returned, except when the
caller says otherwise.

> +  if (EQ (dst_object, Qt))
> +    {
> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> +         act as identity.  */
> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> +          && (STRING_MULTIBYTE (string)
> +              ? (chars == bytes) : string_ascii_p (string)))
> +	return nocopy ? Fcopy_sequence (string) : string;

I think in the use case where we return a copy, we should make sure
the return value is unibyte when encoding and multibyte when decoding.
Otherwise, I think this is OK (for the master branch, obviously).

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 17:39:02 GMT) Full text and rfc822 format available.

Message #32 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: mattiase <at> acm.org
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 20:37:47 +0300
> Date: Sat, 04 Apr 2020 20:22:37 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 40407 <at> debbugs.gnu.org
> 
> > +  if (EQ (dst_object, Qt))
> > +    {
> > +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> > +         act as identity.  */
> > +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> > +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> > +          && (STRING_MULTIBYTE (string)
> > +              ? (chars == bytes) : string_ascii_p (string)))
> > +	return nocopy ? Fcopy_sequence (string) : string;
> 
> I think in the use case where we return a copy, we should make sure
> the return value is unibyte when encoding and multibyte when decoding.
> Otherwise, I think this is OK (for the master branch, obviously).

Btw, if we want this particular use case to be as fast as possible,
then Fcopy_sequence is not the best way, because it is not optimized
for the case of copying a single string.  We could do better by
calling make_uninit_multibyte/unibyte_string and memcpy directly.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 18:02:01 GMT) Full text and rfc822 format available.

Message #35 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 4 Apr 2020 20:01:12 +0200
4 apr. 2020 kl. 19.04 skrev Eli Zaretskii <eliz <at> gnu.org>:

> How so?

Because then NOCOPY suddenly matters for almost all coding systems, not just nil. After all, an all-ASCII input string and ASCII-compatible coding is not an unusual combination. This forces us to be careful when correcting the NOCOPY sense, and may expose latent bugs.

But you are right: we should trust calls to {en,de}code-coding-system with NOCOPY=t, and the rest can also remain as they are until someone cares enough to change them.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 18:07:02 GMT) Full text and rfc822 format available.

Message #38 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 4 Apr 2020 20:06:23 +0200
4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Btw, if we want this particular use case to be as fast as possible,
> then Fcopy_sequence is not the best way, because it is not optimized
> for the case of copying a single string.  We could do better by
> calling make_uninit_multibyte/unibyte_string and memcpy directly.

Yes, if that would provide a benefit. The pattern should probably be encapsulated in copy_string or similar, if it doesn't already exist. (Should it copy properties? Probably not.)





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sat, 04 Apr 2020 18:26:02 GMT) Full text and rfc822 format available.

Message #41 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 04 Apr 2020 21:25:20 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 20:01:12 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> 4 apr. 2020 kl. 19.04 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > How so?
> 
> Because then NOCOPY suddenly matters for almost all coding systems, not just nil. After all, an all-ASCII input string and ASCII-compatible coding is not an unusual combination. This forces us to be careful when correcting the NOCOPY sense, and may expose latent bugs.

That's true.  But as a matter of fact, I don't see any calls to
code_convert_string with NOCOPY non-zero, they all pass zero or false
to it.  So none of the existing direct calls from C wants or expects
to get the same string.

> But you are right: we should trust calls to {en,de}code-coding-system with NOCOPY=t, and the rest can also remain as they are until someone cares enough to change them.

Agreed.

Btw, this bug was introduced in commit 4031e2b, 18 years ago.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 02:38:02 GMT) Full text and rfc822 format available.

Message #44 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 05:37:03 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 4 Apr 2020 20:06:23 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> 4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > Btw, if we want this particular use case to be as fast as possible,
> > then Fcopy_sequence is not the best way, because it is not optimized
> > for the case of copying a single string.  We could do better by
> > calling make_uninit_multibyte/unibyte_string and memcpy directly.
> 
> Yes, if that would provide a benefit. The pattern should probably be encapsulated in copy_string or similar, if it doesn't already exist.

I wouldn't make it a separate function for the benefit of just one
caller.  Every function call is a slowdown, albeit a small one.

> (Should it copy properties? Probably not.)

Definitely not.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 03:43:02 GMT) Full text and rfc822 format available.

Message #47 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: bug-gnu-emacs <at> gnu.org,
 Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 06:42:51 +0300
On April 5, 2020 5:37:03 AM GMT+03:00, Eli Zaretskii <eliz <at> gnu.org> wrote:
> > From: Mattias Engdegård <mattiase <at> acm.org>
> > Date: Sat, 4 Apr 2020 20:06:23 +0200
> > Cc: 40407 <at> debbugs.gnu.org
> > 
> > 4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > > Btw, if we want this particular use case to be as fast as
> possible,
> > > then Fcopy_sequence is not the best way, because it is not
> optimized
> > > for the case of copying a single string.  We could do better by
> > > calling make_uninit_multibyte/unibyte_string and memcpy directly.
> > 
> > Yes, if that would provide a benefit. The pattern should probably be
> encapsulated in copy_string or similar, if it doesn't already exist.
> 
> I wouldn't make it a separate function for the benefit of just one
> caller.  Every function call is a slowdown, albeit a small one.

However, we already have those functions ready, see make_unibyte_string and make_multibyte_string.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 03:44:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 10:16:01 GMT) Full text and rfc822 format available.

Message #53 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 5 Apr 2020 12:14:59 +0200
[Message part 1 (text/plain, inline)]
4 apr. 2020 kl. 19.22 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.
> 
> And IMO the default must stay that a copy is returned, except when the
> caller says otherwise.

Yes, those can be dealt with piecemeal, and we are in no hurry to do so.

> I think in the use case where we return a copy, we should make sure
> the return value is unibyte when encoding and multibyte when decoding.

I'm not necessarily opposed to the suggestion, but why not return a unibyte string in both cases, simplifying the code? In addition, some operations (aref) are faster on unibyte. Either way, it's nothing that a caller could rely on, is there? (In particular when taking NOCOPY into account.)

> Otherwise, I think this is OK (for the master branch, obviously).

Indeed the intention, thanks.

Here is what I would commit, unless you think the string copy should really be multibyte when decoding.

[0001-Avoid-expensive-recoding-for-ASCII-identity-cases-bu.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 10:50:02 GMT) Full text and rfc822 format available.

Message #56 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 5 Apr 2020 12:48:51 +0200
4 apr. 2020 kl. 20.25 skrev Eli Zaretskii <eliz <at> gnu.org>:

> That's true.  But as a matter of fact, I don't see any calls to
> code_convert_string with NOCOPY non-zero, they all pass zero or false
> to it.  So none of the existing direct calls from C wants or expects
> to get the same string.

Right. However, I did some reading and believe that nocopy=true is actually correct for all uses of {EN,DE}CODE_FILE, and in fact all calls to code_convert_string_norecord. One of the reasons is that the callers need to be careful with mutation wrt GC anyway; any post-recoding mutation is done on copies. (Not being able to change the length of strings also helps.)

I pushed what we agreed on in part for the pleasure of resolving such an old-standing bug) to master (962562cde4).
Given the limited scope of the change, would you agree to a backport of that to emacs-27?

For the reasons above, I think it's correct and proper to do (on master)

--- a/src/coding.c
+++ b/src/coding.c
@@ -9554,7 +9554,7 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
 code_convert_string_norecord (Lisp_Object string, Lisp_Object coding_system,
                              bool encodep)
 {
-  return code_convert_string (string, coding_system, Qt, encodep, 0, 1);
+  return code_convert_string (string, coding_system, Qt, encodep, 1, 1);
 }






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 13:29:02 GMT) Full text and rfc822 format available.

Message #59 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 16:28:13 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 5 Apr 2020 12:14:59 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> > I think in the use case where we return a copy, we should make sure
> > the return value is unibyte when encoding and multibyte when decoding.
> 
> I'm not necessarily opposed to the suggestion, but why not return a unibyte string in both cases, simplifying the code?

For compatibility with what happens now:

  (multibyte-string-p (decode-coding-string "abc" 'utf-8)) => t

> In addition, some operations (aref) are faster on unibyte. Either way, it's nothing that a caller could rely on, is there? (In particular when taking NOCOPY into account.)

That is true, of course, but many/most of our strings are multibyte
nowadays, even if they are ASCII.  Suddenly getting a unibyte string
instead would be surprising, I think, even if no one should depend on
it not happening.  (NOCOPY case is different: then it's the caller's
responsibility to deal with the issue.)  So I'd rather we produced a
multibyte string when "decoding" by copying.

> +/* Whether a (unibyte) string only contains chars in the 0..127 range.  */

One subtle point regarding this comment: I'd remove the "unibyte"
part, because (1) you apply this test to multibyte strings as well,
and (2) strings encoded in iso-2022 will look "pure-ASCII", but they
aren't.  The latter subtlety doesn't interfere with the caller,
because iso-2022 is not ASCII-compatible, but it's something I'd
mention in the comment, lest someone uses this function for some
other use case.

The patch is OK otherwise.  Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 13:40:01 GMT) Full text and rfc822 format available.

Message #62 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 16:39:25 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 5 Apr 2020 12:48:51 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> 4 apr. 2020 kl. 20.25 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > That's true.  But as a matter of fact, I don't see any calls to
> > code_convert_string with NOCOPY non-zero, they all pass zero or false
> > to it.  So none of the existing direct calls from C wants or expects
> > to get the same string.
> 
> Right. However, I did some reading and believe that nocopy=true is actually correct for all uses of {EN,DE}CODE_FILE, and in fact all calls to code_convert_string_norecord.

I don't think I follow.  We call code_convert_string_norecord, which
invokes code_convert_string with NOCOPY set to 'false'.  So all those
users should NOT receive the same string as the argument, and I don't
believe they expect that and can cope with it.

Perhaps what you meant was that NOCOPY = false was actually acting as
if the value were 'true', due to the bug.  But if so, that didn't
affect ENCODE_FILE and DECODE_FILE, because the bug is only visible
when the CODING_SYSTEM argument is nil, and both ENCODE_FILE and
DECODE_FILE never let that happen:

  if (! NILP (Vfile_name_coding_system))
    return code_convert_string_norecord (fname, Vfile_name_coding_system, 1);
  else if (! NILP (Vdefault_file_name_coding_system))
    return code_convert_string_norecord (fname,
					 Vdefault_file_name_coding_system, 1);
  else
    return fname;

So in practice this bug was probably never seen until now.

> One of the reasons is that the callers need to be careful with mutation wrt GC anyway; any post-recoding mutation is done on copies. (Not being able to change the length of strings also helps.)

I don't think I understand your line of reasoning here.  I don't think
GC is relevant, and as long as we are talking about file names, the
first null byte terminates it even though the Lisp string's length
could be larger.

> Given the limited scope of the change, would you agree to a backport of that to emacs-27?

That'd be a mistake, I think.  My reasoning goes like this: If I'm
right that this bug was never seen, fixing it on emacs-27 will have no
visible effect; and if I'm wrong, then we will break the release
branch.  The danger of breakage in the latter case is much more severe
than the gain from the fix in the former case.

> For the reasons above, I think it's correct and proper to do (on master)
> 
> --- a/src/coding.c
> +++ b/src/coding.c
> @@ -9554,7 +9554,7 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
>  code_convert_string_norecord (Lisp_Object string, Lisp_Object coding_system,
>                               bool encodep)
>  {
> -  return code_convert_string (string, coding_system, Qt, encodep, 0, 1);
> +  return code_convert_string (string, coding_system, Qt, encodep, 1, 1);
>  }

I hope you now agree with me that we should not do this.  The default
should stay NOCOPY = false, and any caller that wants otherwise must
explicitly request that by calling code_convert_string.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 13:41:01 GMT) Full text and rfc822 format available.

Message #65 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 5 Apr 2020 15:40:36 +0200
5 apr. 2020 kl. 15.28 skrev Eli Zaretskii <eliz <at> gnu.org>:

> That is true, of course, but many/most of our strings are multibyte
> nowadays, even if they are ASCII.  Suddenly getting a unibyte string
> instead would be surprising, I think, even if no one should depend on
> it not happening.  (NOCOPY case is different: then it's the caller's
> responsibility to deal with the issue.)  So I'd rather we produced a
> multibyte string when "decoding" by copying.

I don't agree fully but it is definitely not a strongly held opinion, so I followed your suggestion.

> One subtle point regarding this comment: I'd remove the "unibyte"
> part

Right, done. Thanks for the reviews!





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 15:04:03 GMT) Full text and rfc822 format available.

Message #68 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 5 Apr 2020 17:03:49 +0200
5 apr. 2020 kl. 15.39 skrev Eli Zaretskii <eliz <at> gnu.org>:

> I don't think I follow.  We call code_convert_string_norecord, which
> invokes code_convert_string with NOCOPY set to 'false'.  So all those
> users should NOT receive the same string as the argument, and I don't
> believe they expect that and can cope with it.

Actually they can, as far as I can tell. Have a look yourself.

> I don't think I understand your line of reasoning here.  I don't think
> GC is relevant, and as long as we are talking about file names, the
> first null byte terminates it even though the Lisp string's length
> could be larger.

It is stated as a reason in Fexpand_file_name for working on copies of strings; see comments therein. But that is not really important in itself.

>> Given the limited scope of the change, would you agree to a backport of that to emacs-27?
> 
> That'd be a mistake, I think.  My reasoning goes like this: If I'm
> right that this bug was never seen, fixing it on emacs-27 will have no
> visible effect; and if I'm wrong, then we will break the release
> branch.  The danger of breakage in the latter case is much more severe
> than the gain from the fix in the former case.

We do fix clear bugs on emacs-27 even when nobody complained about them, but you are right that it's not that important in this case. Let's leave it on master.

> I hope you now agree with me that we should not do this.  The default
> should stay NOCOPY = false, and any caller that wants otherwise must
> explicitly request that by calling code_convert_string.

I disagree -- if the callers handle the situation safely, there is no reason not to to do the change, saving some consing. We do this sort of code improvement all the time; nothing special about this one.

Of course, if you prefer the scenic route, we could add {en,de}code_file_nocopy and replace {EN,DE}CODE_FILE calls one by one until they all are done, and arrive at essentially the same code.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 15:37:02 GMT) Full text and rfc822 format available.

Message #71 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 5 Apr 2020 17:35:55 +0200
5 apr. 2020 kl. 17.03 skrev Mattias Engdegård <mattiase <at> acm.org>:

> Actually they can, as far as I can tell. Have a look yourself.

To clarify, calls to {EN,DE}CODE_FILE are probably safe by design since they already return their argument in several cases. Some calls to code_convert_string_norecord are not safe because they are used with auto lisp strings; I'm going through them to find out just which ones. This can be done piecemeal.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 15:57:02 GMT) Full text and rfc822 format available.

Message #74 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 18:56:13 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 5 Apr 2020 17:35:55 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> 5 apr. 2020 kl. 17.03 skrev Mattias Engdegård <mattiase <at> acm.org>:
> 
> > Actually they can, as far as I can tell. Have a look yourself.
> 
> To clarify, calls to {EN,DE}CODE_FILE are probably safe by design since they already return their argument in several cases.

In theory, yes.  In practice, not really.  The cases where they return
their argument are those when we didn't yet set up any encoding for
file names.  This happens only very early into startup, and frankly,
we have nothing else to do at that point.

Once we do set up default-file-name-coding-system, these macros will
never return their argument (unless someone forcefully sets the
encoding to nil, in which case they deserve what they get).  Do you
agree?

> Some calls to code_convert_string_norecord are not safe because they are used with auto lisp strings; I'm going through them to find out just which ones. This can be done piecemeal.

I'm okay with making the safe cases faster, but we'd need to clearly
comment each one, because later changes might make them unsafe.  Any
code that uses the un-encoded string after encoding it, or the
un-decoded string after decoding it, could become broken if these two
are the same string.

And let's please keep in mind that on most modern platforms in most
use cases default-file-name-coding-system is utf-8, so encoding is
fast, and thus we don't need to go overboard here.  IOW, if there's a
doubt, there's no doubt.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Sun, 05 Apr 2020 16:01:01 GMT) Full text and rfc822 format available.

Message #77 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sun, 05 Apr 2020 19:00:17 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 5 Apr 2020 17:03:49 +0200
> Cc: 40407 <at> debbugs.gnu.org
> 
> > I don't think I understand your line of reasoning here.  I don't think
> > GC is relevant, and as long as we are talking about file names, the
> > first null byte terminates it even though the Lisp string's length
> > could be larger.
> 
> It is stated as a reason in Fexpand_file_name for working on copies
> of strings; see comments therein.

That refers to code that keeps C pointers into string text.  This is
not our case here: we are talking about Lisp strings, not C pointers
into them.

> > I hope you now agree with me that we should not do this.  The default
> > should stay NOCOPY = false, and any caller that wants otherwise must
> > explicitly request that by calling code_convert_string.
> 
> I disagree -- if the callers handle the situation safely, there is no reason not to to do the change, saving some consing. We do this sort of code improvement all the time; nothing special about this one.

"If the callers handle the situation safely" is not a trivial
condition.  The programmer will more often than not be unaware of this
subtlety, and may not write such safe code.  Moreover, the callers
will have to handle it safely in the future, or be sure to insist on a
copy if not, and these two macros don't give the callers knobs to
request that.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 10:11:01 GMT) Full text and rfc822 format available.

Message #80 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org,
 Mattias Engdegård <mattiase <at> acm.org>
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 06 Apr 2020 19:10:48 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:

>> -  if (BUFFERP (dst_object))
>> +  if (EQ (dst_object, Qt))
>> +    {
>> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
>> +         act as identity.  */
>> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
>> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
>> +          && (STRING_MULTIBYTE (string)
>> +              ? (chars == bytes) : string_ascii_p (string)))
>> +        return string;

While using the latest master branch, I noticed this became the cause of
decoding error.

The simple reproducible test is,

	(decode-coding-string "&abc" 'utf-7-imap)
        => "&abc"

like the above result, decoding utf-7-imap didn't work.

Because (coding-system-get 'utf-7-imap :ascii-compatible-p) => t.

I'm not sure, 'utf-7* should be fixed as non ascii-compatible, or
string_ascii_p() should check more strictly.

[BTW,

(define-coding-system 'utf-7-imap
  "UTF-7 encoding of Unicode, IMAP version (RFC 2060)"
  :coding-type 'utf-8
  :mnemonic ?u
  :charset-list '(unicode)
  :ascii-compatible-p nil		;; <=== added this line
  :pre-write-conversion 'utf-7-imap-pre-write-conversion
  :post-read-conversion 'utf-7-imap-post-read-conversion)

doesn't work. Because define-coding-system-internal overwrites
ascii-compatible-p.]

Thanks.
-- 
OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 14:22:01 GMT) Full text and rfc822 format available.

Message #83 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>, Kenichi Handa <handa <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, mattiase <at> acm.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 06 Apr 2020 17:21:34 +0300
> From: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
> Cc: Mattias Engdegård <mattiase <at> acm.org>,
>         40407 <at> debbugs.gnu.org
> Date: Mon, 06 Apr 2020 19:10:48 +0900
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> -  if (BUFFERP (dst_object))
> >> +  if (EQ (dst_object, Qt))
> >> +    {
> >> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> >> +         act as identity.  */
> >> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> >> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> >> +          && (STRING_MULTIBYTE (string)
> >> +              ? (chars == bytes) : string_ascii_p (string)))
> >> +        return string;
> 
> While using the latest master branch, I noticed this became the cause of
> decoding error.
> 
> The simple reproducible test is,
> 
> 	(decode-coding-string "&abc" 'utf-7-imap)
>         => "&abc"
> 
> like the above result, decoding utf-7-imap didn't work.
> 
> Because (coding-system-get 'utf-7-imap :ascii-compatible-p) => t.

Thanks.

> I'm not sure, 'utf-7* should be fixed as non ascii-compatible, or
> string_ascii_p() should check more strictly.

The former, since UTF-7 is definitely *not* ASCII-compatible.  Does
the patch below produce good results?

Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
Wouldn't it be better to set it to 'utf-16'?  Or is there some
subtlety here that we should be aware of?  Do you have any comments on
the patch below?

Thanks.

diff --git a/src/coding.c b/src/coding.c
index 97a6eb9..71ff93c 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -11301,7 +11301,10 @@ DEFUN ("define-coding-system-internal", Fdefine_coding_system_internal,
 	  CHECK_CODING_SYSTEM (val);
 	}
       ASET (attrs, coding_attr_utf_bom, bom);
-      if (NILP (bom))
+      if (NILP (bom)
+	  /* UTF-7 has :coding-type set to 'utf-8' (why not
+	     'utf-16'?), but it is definitely NOT ASCII-compatible.  */
+	  && !EQ (name, Qutf_7) && !EQ (name, Qutf_7_imap))
 	ASET (attrs, coding_attr_ascii_compat, Qt);
 
       category = (CONSP (bom) ? coding_category_utf_8_auto
@@ -11673,6 +11676,9 @@ syms_of_coding (void)
   DEFSYM (Qutf_8_unix, "utf-8-unix");
   DEFSYM (Qutf_8_emacs, "utf-8-emacs");
 
+  DEFSYM (Qutf_7, "utf-7");
+  DEFSYM (Qutf_7_imap, "utf-7-imap");
+
 #if defined (WINDOWSNT) || defined (CYGWIN)
   /* No, not utf-16-le: that one has a BOM.  */
   DEFSYM (Qutf_16le, "utf-16le");




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 15:57:03 GMT) Full text and rfc822 format available.

Message #86 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, Kenichi Handa <handa <at> gnu.org>,
 OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 6 Apr 2020 17:56:27 +0200
[Message part 1 (text/plain, inline)]
6 apr. 2020 kl. 16.21 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
> Wouldn't it be better to set it to 'utf-16'?  Or is there some
> subtlety here that we should be aware of?  Do you have any comments on
> the patch below?

There is no reason why utf-7[-imap] should have utf-8 as coding-type, is there? utf-16 is definitely wrong (utf-7* are encoded in ASCII). What about the patch below instead?

By the way, there appears to be another, unrelated bug in utf-7-imap: According to RFC 2060, all C0 controls are base64-encoded, but in Emacs some of them are passed through unchanged (CR, LF and TAB). This is permitted by plain UTF-7 (RFC 1642) but not in the IMAP variant.

[utf-7.diff (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]


Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 16:34:02 GMT) Full text and rfc822 format available.

Message #89 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 06 Apr 2020 19:33:16 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 6 Apr 2020 17:56:27 +0200
> Cc: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>,
>         Kenichi Handa <handa <at> gnu.org>, 40407 <at> debbugs.gnu.org
> 
> > Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
> > Wouldn't it be better to set it to 'utf-16'?  Or is there some
> > subtlety here that we should be aware of?  Do you have any comments on
> > the patch below?
> 
> There is no reason why utf-7[-imap] should have utf-8 as coding-type, is there?

I think it might be just some convenience thing: utf-7 and utf-8 have
something in common that made it convenient to treat them the same in
the internal routines.  Or maybe it's just an accident.

> utf-16 is definitely wrong (utf-7* are encoded in ASCII).

Why do you think the ASCII encoding contradicts the utf-16
coding-type?

> What about the patch below instead?

I don't think 'charset' is the right type for this encoding (any
reason why you've chosen it?), but I will let Handa-san comment.
Defining coding-systems is a black art which I don't think I ever
mastered.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 16:56:02 GMT) Full text and rfc822 format available.

Message #92 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 6 Apr 2020 18:55:30 +0200
6 apr. 2020 kl. 18.33 skrev Eli Zaretskii <eliz <at> gnu.org>:

> I think it might be just some convenience thing: utf-7 and utf-8 have
> something in common that made it convenient to treat them the same in
> the internal routines.  Or maybe it's just an accident.

There is nothing common between utf-7 and utf-8 at all (apart from a subset of ASCII being encoded in the same way, and the fact that both encode the Unicode repertoire).

> Why do you think the ASCII encoding contradicts the utf-16
> coding-type?

Because :coding type is the first stage of decoding, or the last stage of encoding. It reflects the low-level structure of the encoded data: using utf-16 as :coding-type implies that utf-7 is encoded into 16-bit parcels, but it's not -- the result of utf-7-imap encoding is a sequence of ASCII bytes. (UTF-16 plays a part in an intermediary step for some values before they are base64-encoded, but that's not visible in the final byte stream.)

> I don't think 'charset' is the right type for this encoding (any
> reason why you've chosen it?), but I will let Handa-san comment.

We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).

Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 17:19:02 GMT) Full text and rfc822 format available.

Message #95 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 06 Apr 2020 20:18:05 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 6 Apr 2020 18:55:30 +0200
> Cc: hirofumi <at> mail.parknet.co.jp, handa <at> gnu.org, 40407 <at> debbugs.gnu.org
> 
> 6 apr. 2020 kl. 18.33 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> > I think it might be just some convenience thing: utf-7 and utf-8 have
> > something in common that made it convenient to treat them the same in
> > the internal routines.  Or maybe it's just an accident.
> 
> There is nothing common between utf-7 and utf-8 at all (apart from a subset of ASCII being encoded in the same way, and the fact that both encode the Unicode repertoire).

By "in common" in this context I meant from the POV of internal
treating of the two encodings.

> > I don't think 'charset' is the right type for this encoding (any
> > reason why you've chosen it?), but I will let Handa-san comment.
> 
> We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
> In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).
> 
> Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.

I prefer to wait for Handa-san's response, and meanwhile install the
least disruptive change, which just fixes the one aspect that got
broken.  Call me a coward, if you wish.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 17:50:02 GMT) Full text and rfc822 format available.

Message #98 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 6 Apr 2020 19:49:19 +0200
6 apr. 2020 kl. 19.18 skrev Eli Zaretskii <eliz <at> gnu.org>:

> By "in common" in this context I meant from the POV of internal
> treating of the two encodings.

So did I.

> I prefer to wait for Handa-san's response, and meanwhile install the
> least disruptive change, which just fixes the one aspect that got
> broken.  Call me a coward, if you wish.

If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 18:14:02 GMT) Full text and rfc822 format available.

Message #101 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 6 Apr 2020 20:13:43 +0200
[Message part 1 (text/plain, inline)]
5 apr. 2020 kl. 17.56 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Once we do set up default-file-name-coding-system, these macros will
> never return their argument (unless someone forcefully sets the
> encoding to nil, in which case they deserve what they get).  Do you
> agree?

Thank you, and yes, I do agree partly: ENCODE_FILE is the identity for all unibyte strings no matter the coding system in use.

However, my point (which I didn't do a very good job explaining) was that if either ENCODE_FILE or DECODE_FILE are called with the assumption that they return a new string, that is at least a latent bug.

Thus I went through them all once again, and found a few questionable calls that I'd like to fix. They rely on Fexpand_file_name returning a new string, which may or may not be true now but we would be better without such assumptions. (I also stumbled on a potential GC-related bug.) Patch attached!

With these fixed, nothing prevents those two functions from using no-copy semantics. I agree this approach is better and safer than going straight for code_convert_string_norecord in one pass.

[0001-Don-t-rely-on-copying-in-EN-DE-CODE_FILE.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 18:21:02 GMT) Full text and rfc822 format available.

Message #104 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 06 Apr 2020 21:20:14 +0300
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 6 Apr 2020 19:49:19 +0200
> Cc: hirofumi <at> mail.parknet.co.jp, handa <at> gnu.org, 40407 <at> debbugs.gnu.org
> 
> > I prefer to wait for Handa-san's response, and meanwhile install the
> > least disruptive change, which just fixes the one aspect that got
> > broken.  Call me a coward, if you wish.
> 
> If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).

Fine with me, but please say something about why we use 'put' instead
of specifying :ascii-compatible-p directly in the coding-system's
definition, and also add a FIXME there, since I hope this is not the
final word on that matter.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 18:35:01 GMT) Full text and rfc822 format available.

Message #107 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org,
 Mattias Engdegård <mattiase <at> acm.org>, handa <at> gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Tue, 07 Apr 2020 03:34:34 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:

>> > I prefer to wait for Handa-san's response, and meanwhile install the
>> > least disruptive change, which just fixes the one aspect that got
>> > broken.  Call me a coward, if you wish.
>> 
>> If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).
>
> Fine with me, but please say something about why we use 'put' instead
> of specifying :ascii-compatible-p directly in the coding-system's
> definition, and also add a FIXME there, since I hope this is not the
> final word on that matter.

BTW, 

Mattias Engdegård <mattiase <at> acm.org> writes:

>  (define-coding-system 'utf-7-imap
>    "UTF-7 encoding of Unicode, IMAP version (RFC 2060)"
> -  :coding-type 'utf-8
> +  :coding-type 'charset
> +  :charset-list '(ascii)
>    :mnemonic ?u
> -  :charset-list '(unicode)
>    :pre-write-conversion 'utf-7-imap-pre-write-conversion
>    :post-read-conversion 'utf-7-imap-post-read-conversion)
> +;; See comment for utf-7 above.
> +(coding-system-put 'utf-7-imap :ascii-compatible-p nil)

(check-coding-systems-region "あ" nil '(utf-7-imap))
=> ((utf-7-imap 0))

It says "cannot encodable by utf-7-imap", so looks like ":charset-list
'(ascii)" doesn't work at least.

Thanks.
-- 
OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Mon, 06 Apr 2020 21:58:01 GMT) Full text and rfc822 format available.

Message #110 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Mon, 6 Apr 2020 23:57:14 +0200
6 apr. 2020 kl. 20.34 skrev OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>:

> (check-coding-systems-region "あ" nil '(utf-7-imap))
> => ((utf-7-imap 0))
> 
> It says "cannot encodable by utf-7-imap", so looks like ":charset-list
> '(ascii)" doesn't work at least.

Thanks for catching that! The documentation doesn't explain the role of :charset-list for a :coding-type of 'utf-8, but a minimal patch seemed to work. Your example has been added to the test.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 09 Apr 2020 11:04:02 GMT) Full text and rfc822 format available.

Message #113 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
Cc: 40407 <at> debbugs.gnu.org, handa <at> gnu.org, Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 9 Apr 2020 13:03:51 +0200
Eli, thank you very much for thinking about EOL conversion; obviously I didn't. I fixed a couple of glitches: a variable was used uninitialised, the logic didn't quite work for both unibyte and multibyte, and unless I'm mistaken it's LF that we should look for when encoding, not CR? Anyway, hope you don't mind.

The alternative would be to skip the no-conversion shortcut whenever EOL conversion applies, but memchr is quite fast and it's all likely to be in D1$ by that time.

I also need to thank Ogawa-san again for drawing my attention to check-coding-systems-region which crashes Emacs (SIGABRT) if given an invalid encoding; fixed.

An exhaustive search for other encodings that erroneously were marked as ASCII compatible found only one (chinese-hz); now fixed.

The calls to {ENCODE,DECODE}_FILE messily mutating the returned string have now also been fixed, along with a potential GC bug.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 09 Apr 2020 14:10:01 GMT) Full text and rfc822 format available.

Message #116 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Kazuhiro Ito <kzhr <at> d1.dion.ne.jp>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 09 Apr 2020 23:09:06 +0900
On Thu, 09 Apr 2020 20:03:51 +0900,
Mattias Engdegård wrote:
> 
> Eli, thank you very much for thinking about EOL conversion;
> obviously I didn't. I fixed a couple of glitches: a variable was
> used uninitialised, the logic didn't quite work for both unibyte and
> multibyte, and unless I'm mistaken it's LF that we should look for
> when encoding, not CR? Anyway, hope you don't mind.

I noticed that last-coding-system-used was not set when the fast path
was used.

(let ((string "ABCD\r\nEFGH")
      inhibit-eol-conversion)
  (decode-coding-string string 'raw-text-dos)
  (decode-coding-string string 'raw-text-unix)
  last-coding-system-used)

-> raw-text-dos

-- 
Kazuhiro Ito




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 09 Apr 2020 14:23:02 GMT) Full text and rfc822 format available.

Message #119 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Kazuhiro Ito <kzhr <at> d1.dion.ne.jp>
Cc: 40407 <at> debbugs.gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 9 Apr 2020 16:22:33 +0200
9 apr. 2020 kl. 16.09 skrev Kazuhiro Ito <kzhr <at> d1.dion.ne.jp>:

> I noticed that last-coding-system-used was not set when the fast path
> was used.

A keen observation -- thank you! Now fixed.





Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Sat, 11 Apr 2020 15:10:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Sat, 11 Apr 2020 15:10:02 GMT) Full text and rfc822 format available.

Message #124 received at 40407-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 40407-done <at> debbugs.gnu.org
Cc: Kenichi Handa <handa <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org>,
 OGAWA Hirofumi <hirofumi <at> mail.parknet.co.jp>
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Sat, 11 Apr 2020 17:09:27 +0200
I think we are done here -- now that all calls to ENCODE_FILE and DECODE_FILE have been checked to be safe for no-copy semantics, there is no need to copy in the ASCII identity case; pushed to master.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 16 Apr 2020 13:12:02 GMT) Full text and rfc822 format available.

Message #127 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: handa <handa <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, mattiase <at> acm.org, hirofumi <at> mail.parknet.co.jp,
 handa <at> gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 16 Apr 2020 22:11:37 +0900
In article <83wo6sr0s2.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> > > I don't think 'charset' is the right type for this encoding (any
> > > reason why you've chosen it?), but I will let Handa-san comment.
> > 
> > We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
> > In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).

I don't remember why utf-7 has coding type utf-8.  As main
decoding/encoding routines of utf-7 are by Lisp (in utf-7.el which was
contributed not by me), perhaps, any other ASCII transparent types was
ok.  It seems that we should introduce a new type for such a coding
system.

> > Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.

> I prefer to wait for Handa-san's response, and meanwhile install the
> least disruptive change, which just fixes the one aspect that got
> broken.  Call me a coward, if you wish.

I think Mattias' patch is good.

---
K. Handa
handa <at> gnu.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 16 Apr 2020 13:45:01 GMT) Full text and rfc822 format available.

Message #130 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: handa <handa <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, mattiase <at> acm.org, hirofumi <at> mail.parknet.co.jp,
 handa <at> gnu.org
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 16 Apr 2020 16:44:07 +0300
> From: handa <handa <at> gnu.org>
> Cc: hirofumi <at> mail.parknet.co.jp, mattiase <at> acm.org, 40407 <at> debbugs.gnu.org,
>  handa <at> gnu.org
> Date: Thu, 16 Apr 2020 22:11:37 +0900
> 
> > I prefer to wait for Handa-san's response, and meanwhile install the
> > least disruptive change, which just fixes the one aspect that got
> > broken.  Call me a coward, if you wish.
> 
> I think Mattias' patch is good.

Including making the coding-type of utf-7 'charset'?  I think that
didn't work, see an earlier message in this discussion:

  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=40407#107

And could you please tell more about the conditions for a
coding-system to be a candidate for 'charset' coding-type? what
exactly is such a coding-system supposed to do/provide/support?  I
don't think this is documented anywhere, and most/all of the
coding-systems that have this type are simple single-byte encodings
that support just 256 codepoints (which is not what utf-7 is).

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40407; Package emacs. (Thu, 16 Apr 2020 14:00:02 GMT) Full text and rfc822 format available.

Message #133 received at 40407 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 40407 <at> debbugs.gnu.org, handa <handa <at> gnu.org>, hirofumi <at> mail.parknet.co.jp
Subject: Re: bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
Date: Thu, 16 Apr 2020 15:59:16 +0200
16 apr. 2020 kl. 15.44 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Including making the coding-type of utf-7 'charset'?  I think that
> didn't work, see an earlier message in this discussion:
> 
>  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=40407#107

Indeed, 'charset' was probably not the right choice. Perhaps we should revisit the decision to set :ascii-compatible-p automatically, since it is an end-to-end property that depends on the semantics of all parts in the coding chain, including the provided conversion functions and translation tables. The caller is in a much better position to know whether that property should be set.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 15 May 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 38 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.