#79296 - 30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config

GNU bug report logs - #79296
30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config

Package: emacs;

Reported by: Shingo Tanaka <shingo.fg8 <at> gmail.com>

Date: Sun, 24 Aug 2025 02:17:02 UTC

Severity: normal

Found in version 30.2

Done: Eli Zaretskii <eliz <at> gnu.org>

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Bruno Haible <bruno <at> clisp.org> Cc: 79296 <at> debbugs.gnu.org, corwin <at> bru.st, shingo.fg8 <at> gmail.com, eggert <at> cs.ucla.edu Subject: bug#79296: 30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config Date: Wed, 27 Aug 2025 15:04:45 +0300

> From: Bruno Haible <bruno <at> clisp.org> > Cc: corwin <at> bru.st, shingo.fg8 <at> gmail.com, eggert <at> cs.ucla.edu, > 79296 <at> debbugs.gnu.org > Date: Wed, 27 Aug 2025 02:05:42 +0200 > > Eli Zaretskii wrote: > > Any details beyond that general consideration? Are you saying that > > MSVCRT doesn't support codepage 65001 as a codeset of a locale, > > whereas UCRT does? Do the tests you wrote fail when linked with > > MSVCRT? > > Tried it now: running that unit test in the Windows UTF-8 environment, linked > against MSVCRT: > > * GetACP() returns 65001. Which is not surprising, since GetACP() is a > Windows API, not a libc API. > > * setlocale (LC_ALL, "") fails. [This was the Gnulib setlocale() override. > I assume the MSVCRT setlocale failed in the same way.] > > * If you ignore the setlocale failure, MB_CUR_MAX is not >= 4. Meaning > that the locale encoding is not UTF-8. > > MSVCRT supports only MB_CUR_MAX == 1 or == 2. Thanks for this info. > Looking at the output of "dumpbin /imports emacs.exe, I see that the Emacs > binary uses the following symbols from MSVCRT: > > 6C ___lc_codepage_func > 6F ___mb_cur_max_func > 188 _getmbcp > 240 _mbschr > 252 _mbsinc > 256 _mbslwr > 27A _mbsncpy > 27E _mbsnextc > 28C _mbspbrk > 28E _mbsrchr > 302 _snprintf > 33C _stricmp > 343 _strlwr > 34A _strnicmp > 4B1 fprintf > 4D4 isalpha > 4DC isspace > 4EB isxdigit > 4EF localeconv > 51E setlocale > 534 strerror > 535 strftime > 556 tolower > 557 toupper > 55D vfprintf Those are in most cases used only when w32-unicode-filenames is turned off, which is supposed to happen only on Windows 9X (or in debugging). The rest are used at startup, when the system locale and the corresponding encoding machinery is not yet set up. But yes, if turning on this UTF-8 feature doesn't make these functions in MSVCRT use UTF-8 as the multibyte encoding, things will fall apart in subtle ways when non-ASCII strings are involved. > Additionally, the Emacs binary uses several DLLs, some of which > also use locale-aware functions from libc. These DLLs will not > work as expected either. That's a separate issue, and it doesn't get resolved by linking Emacs with UCRT. That's because, AFAIK, if a DLL was linked against MSVCRT at its build time, it will continue using MSVCRT even when called from a program that uses UCRT. So a person who wants to use UTF-8 as the system codepage will need to make sure _all_ of the optional libraries used by Emacs were also linked with UCRT. Moreover, the source code of those libraries should be UTF-8 aware. For example, it should use multibyte-aware functions for walking a string by character, instead of assuming that each byte is a separate character. And how many ported Unix and GNU libraries are aware of that? As a simple example, it's enough to have something like char filename[MAX_PATH]; to run the risk of blowing up the stack if the file name is non-ASCII, encoded in UTF-8, and is long enough. (Emacs handles this particular problem in its own code, but many external libraries don't.) > So, the only reasonable way forward, for supporting the Windows UTF-8 > environment, is to produce two sets of binaries for Emacs: > - one set of .exe and .dlls linked with MSVCRT, for use on old > Windows versions, > - one set of .exe and .dlls linked with UCRT, for use on Windows > versions from 2019 or newer [1]. The Emacs project doesn't produce binaries. That is left to distros. The MS-Windows binaries on the Gnu FTP site are produced by Corwin who volunteered for this job, so it is up to him what he wants to support and how much would he agree to complicate his job. Windows versions before Vista (perhaps even before Windows 8.1) are already unsupported by those binaries, since MSYS2 tossed them, so the resulting binaries depend on APIs and DLLs that older systems don't have, and will thus refuse to run on those older systems. In addition, linking Emacs itself against UCRT is not enough, see above. For these reasons, I stand by my opinion that UTF-8 support on Windows is not yet ready for prime time, and advise against turning it on if one wants to use Emacs reliably on MS-Windows. MS knew what they were doing when they designated this feature "Beta". As a stopgap, we could introduce Windows-specific variables in Emacs through which users could specify the encoding to decode time strings and perhaps other strings if needed, instead of automatically falling back on locale-coding-system. Then users like Shingo Tanaka could say (setq w32-time-coding-system 'cp932) and have the time strings decoded correctly.

This bug report was last modified 21 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #79296 30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config

GNU bug report logs - #79296
30.2; format-time-string returns wrongly encoded string in MS Windows Japanese with cp65001 beta config