GNU bug report logs - #67841
[PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

Previous Next

Package: automake-patches;

Reported by: Zack Weinberg <zack <at> owlfolio.org>

Date: Fri, 15 Dec 2023 20:45:02 UTC

Severity: normal

Tags: patch

Done: Karl Berry <karl <at> freefriends.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jacob Bachmeyer <jcb62281 <at> gmail.com>
To: Karl Berry <karl <at> freefriends.org>
Cc: zack <at> owlfolio.org, 67841 <at> debbugs.gnu.org
Subject: [bug#67841] [PATCH] Clarify error messages for misuse of m4_warn and --help for -W.
Date: Tue, 19 Dec 2023 21:05:46 -0600
Karl Berry wrote:
>     "All possible characters have a UTF-8 representation so this function 
>     [encode_utf8] cannot fail."
>
> What about non-characters, i.e., byte sequences that are invalid UTF-8?
>   

Each individual byte gets encoded as UTF-8.  0x00..0x7F are an identity 
map, while 0x80..0xFF are translated to 2-octet sequences.  /Decoding/ 
UTF-8 can blow up or produce bogus results (I think Perl might just drop 
in the "substitute" character and emit a warning) but /encoding/ UTF-8 
always works, even on UTF-8.  Remember that Perl could handle arbitrary 
binary data long before it had Unicode support.

> What I found was that using \N{...} implies a Unicode string. From the
> charnames(3) man page (stranged not named "perlcharnames"):
>
>      Otherwise, any string that includes a "\N{charname}" or "\N{U+code
>      point}" will automatically have Unicode rules (see "Byte and
>      Character Semantics" in perlunicode).
>   

That page is named "charnames" because it documents the "charnames" 
pragmatic module.  The man page version was translated from the perldoc 
system when perl was built/installed/packaged.  The "perlunicode" page 
documents general Unicode support in Perl.

> Maybe pack("C") somehow can get to the bytes from a Unicode string?
>   

All strings in Perl are Unicode now, internally stored as UTF-8 or, as 
an optimization if no codepoints exceed 255, raw octets.  (A string of 
raw octets is considered to be a sequence of characters in the range 
[0,255].)  The "utf8 flag" on a string indicates which of those forms is 
in use on any particular string.  Using encode_utf8 simply gives you the 
internal encoding, converting an octet string to UTF-8 if needed, marked 
as an octet string.  If the string is already UTF-8, encode_utf8 simply 
clears the utf8 flag so you get access to the raw bytes.  (Brain twisted 
yet?  Mine was when I first looked at this...)

Perl's Unicode handling is fun because Perl could always handle binary 
data, and Unicode support was more-or-less retrofitted on top of that 
support for binary data.  In other words, if your program does not 
handle Unicode properly (or if you are running on Perl 5.6 and your 
program does not do the Perl 5.6 magic Unicode dances), Perl will treat 
"Unicode" data as its underlying octet sequence; thus my earlier advice 
to conditionally import Encode and shim encode_utf8 with an identity 
function if Encode is not available.


-- Jacob




This bug report was last modified 1 year and 151 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.