GNU bug report logs - #79296
30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config

Previous Next

Package: emacs;

Reported by: Shingo Tanaka <shingo.fg8 <at> gmail.com>

Date: Sun, 24 Aug 2025 02:17:02 UTC

Severity: normal

Found in version 30.2

Done: Eli Zaretskii <eliz <at> gnu.org>

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Bruno Haible <bruno <at> clisp.org>
Cc: 79296 <at> debbugs.gnu.org, corwin <at> bru.st, shingo.fg8 <at> gmail.com, eggert <at> cs.ucla.edu
Subject: bug#79296: 30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config
Date: Wed, 27 Aug 2025 15:04:45 +0300
> From: Bruno Haible <bruno <at> clisp.org>
> Cc: corwin <at> bru.st, shingo.fg8 <at> gmail.com, eggert <at> cs.ucla.edu,
>  79296 <at> debbugs.gnu.org
> Date: Wed, 27 Aug 2025 02:05:42 +0200
> 
> Eli Zaretskii wrote:
> > Any details beyond that general consideration?  Are you saying that
> > MSVCRT doesn't support codepage 65001 as a codeset of a locale,
> > whereas UCRT does?  Do the tests you wrote fail when linked with
> > MSVCRT?
> 
> Tried it now: running that unit test in the Windows UTF-8 environment, linked
> against MSVCRT:
> 
>   * GetACP() returns 65001. Which is not surprising, since GetACP() is a
>     Windows API, not a libc API.
> 
>   * setlocale (LC_ALL, "") fails. [This was the Gnulib setlocale() override.
>     I assume the MSVCRT setlocale failed in the same way.]
> 
>   * If you ignore the setlocale failure, MB_CUR_MAX is not >= 4. Meaning
>     that the locale encoding is not UTF-8.
> 
> MSVCRT supports only MB_CUR_MAX == 1 or == 2.

Thanks for this info.

> Looking at the output of "dumpbin /imports emacs.exe, I see that the Emacs
> binary uses the following symbols from MSVCRT:
> 
>                           6C ___lc_codepage_func
>                           6F ___mb_cur_max_func
>                          188 _getmbcp
>                          240 _mbschr
>                          252 _mbsinc
>                          256 _mbslwr
>                          27A _mbsncpy
>                          27E _mbsnextc
>                          28C _mbspbrk
>                          28E _mbsrchr
>                          302 _snprintf
>                          33C _stricmp
>                          343 _strlwr
>                          34A _strnicmp
>                          4B1 fprintf
>                          4D4 isalpha
>                          4DC isspace
>                          4EB isxdigit
>                          4EF localeconv
>                          51E setlocale
>                          534 strerror
>                          535 strftime
>                          556 tolower
>                          557 toupper
>                          55D vfprintf

Those are in most cases used only when w32-unicode-filenames is turned
off, which is supposed to happen only on Windows 9X (or in debugging).
The rest are used at startup, when the system locale and the
corresponding encoding machinery is not yet set up.

But yes, if turning on this UTF-8 feature doesn't make these functions
in MSVCRT use UTF-8 as the multibyte encoding, things will fall apart
in subtle ways when non-ASCII strings are involved.

> Additionally, the Emacs binary uses several DLLs, some of which
> also use locale-aware functions from libc. These DLLs will not
> work as expected either.

That's a separate issue, and it doesn't get resolved by linking Emacs
with UCRT.  That's because, AFAIK, if a DLL was linked against MSVCRT
at its build time, it will continue using MSVCRT even when called from
a program that uses UCRT.  So a person who wants to use UTF-8 as the
system codepage will need to make sure _all_ of the optional libraries
used by Emacs were also linked with UCRT.  Moreover, the source code
of those libraries should be UTF-8 aware.  For example, it should use
multibyte-aware functions for walking a string by character, instead
of assuming that each byte is a separate character.  And how many
ported Unix and GNU libraries are aware of that?  As a simple example,
it's enough to have something like

  char filename[MAX_PATH];

to run the risk of blowing up the stack if the file name is non-ASCII,
encoded in UTF-8, and is long enough.  (Emacs handles this particular
problem in its own code, but many external libraries don't.)

> So, the only reasonable way forward, for supporting the Windows UTF-8
> environment, is to produce two sets of binaries for Emacs:
>   - one set of .exe and .dlls linked with MSVCRT, for use on old
>     Windows versions,
>   - one set of .exe and .dlls linked with UCRT, for use on Windows
>     versions from 2019 or newer [1].

The Emacs project doesn't produce binaries.  That is left to distros.
The MS-Windows binaries on the Gnu FTP site are produced by Corwin who
volunteered for this job, so it is up to him what he wants to support
and how much would he agree to complicate his job.  Windows versions
before Vista (perhaps even before Windows 8.1) are already unsupported
by those binaries, since MSYS2 tossed them, so the resulting binaries
depend on APIs and DLLs that older systems don't have, and will thus
refuse to run on those older systems.

In addition, linking Emacs itself against UCRT is not enough, see
above.

For these reasons, I stand by my opinion that UTF-8 support on Windows
is not yet ready for prime time, and advise against turning it on if
one wants to use Emacs reliably on MS-Windows.  MS knew what they were
doing when they designated this feature "Beta".

As a stopgap, we could introduce Windows-specific variables in Emacs
through which users could specify the encoding to decode time strings
and perhaps other strings if needed, instead of automatically falling
back on locale-coding-system.  Then users like Shingo Tanaka could say

  (setq w32-time-coding-system 'cp932)

and have the time strings decoded correctly.




This bug report was last modified 21 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.