#67841 - [PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

GNU bug report logs - #67841
[PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

Reported by: Zack Weinberg <zack <at> owlfolio.org>

Date: Fri, 15 Dec 2023 20:45:02 UTC

Severity: normal

Tags: patch

Done: Karl Berry <karl <at> freefriends.org>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Jacob Bachmeyer <jcb62281 <at> gmail.com> To: Karl Berry <karl <at> freefriends.org> Cc: zack <at> owlfolio.org, 67841 <at> debbugs.gnu.org Subject: [bug#67841] [PATCH] Clarify error messages for misuse of m4_warn and --help for -W. Date: Tue, 19 Dec 2023 21:05:46 -0600

Karl Berry wrote: > "All possible characters have a UTF-8 representation so this function > [encode_utf8] cannot fail." > > What about non-characters, i.e., byte sequences that are invalid UTF-8? > Each individual byte gets encoded as UTF-8. 0x00..0x7F are an identity map, while 0x80..0xFF are translated to 2-octet sequences. /Decoding/ UTF-8 can blow up or produce bogus results (I think Perl might just drop in the "substitute" character and emit a warning) but /encoding/ UTF-8 always works, even on UTF-8. Remember that Perl could handle arbitrary binary data long before it had Unicode support. > What I found was that using \N{...} implies a Unicode string. From the > charnames(3) man page (stranged not named "perlcharnames"): > > Otherwise, any string that includes a "\N{charname}" or "\N{U+code > point}" will automatically have Unicode rules (see "Byte and > Character Semantics" in perlunicode). > That page is named "charnames" because it documents the "charnames" pragmatic module. The man page version was translated from the perldoc system when perl was built/installed/packaged. The "perlunicode" page documents general Unicode support in Perl. > Maybe pack("C") somehow can get to the bytes from a Unicode string? > All strings in Perl are Unicode now, internally stored as UTF-8 or, as an optimization if no codepoints exceed 255, raw octets. (A string of raw octets is considered to be a sequence of characters in the range [0,255].) The "utf8 flag" on a string indicates which of those forms is in use on any particular string. Using encode_utf8 simply gives you the internal encoding, converting an octet string to UTF-8 if needed, marked as an octet string. If the string is already UTF-8, encode_utf8 simply clears the utf8 flag so you get access to the raw bytes. (Brain twisted yet? Mine was when I first looked at this...) Perl's Unicode handling is fun because Perl could always handle binary data, and Unicode support was more-or-less retrofitted on top of that support for binary data. In other words, if your program does not handle Unicode properly (or if you are running on Perl 5.6 and your program does not do the Perl 5.6 magic Unicode dances), Perl will treat "Unicode" data as its underlying octet sequence; thus my earlier advice to conditionally import Encode and shim encode_utf8 with an identity function if Encode is not available. -- Jacob

This bug report was last modified 1 year and 151 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #67841 [PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

GNU bug report logs - #67841
[PATCH] Clarify error messages for misuse of m4_warn and --help for -W.