#6283 - doc/lispref/searching.texi reference to octal code `0377' correct?

GNU bug report logs - #6283
doc/lispref/searching.texi reference to octal code `0377' correct?

Package: emacs;

Reported by: MON KEY <monkey <at> sandpframing.com>

Date: Thu, 27 May 2010 17:29:02 UTC

Severity: minor

Done: Chong Yidong <cyd <at> stupidchicken.com>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: MON KEY <monkey <at> sandpframing.com> Cc: 6283 <at> debbugs.gnu.org Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? Date: Tue, 01 Jun 2010 21:38:41 +0300

> Date: Mon, 31 May 2010 20:24:00 -0400 > From: MON KEY <monkey <at> sandpframing.com> > Cc: 6283 <at> debbugs.gnu.org > > If I evauate the following: > > (progn > (save-excursion > (insert-byte (multibyte-char-to-unibyte 4194221) 1) > (insert-byte (multibyte-char-to-unibyte 4194303) 1)) > (search-forward-regexp "ÿ" nil t)) > > I don't match. Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)' is a raw byte. Emacs can distinguish between these two because it uses a special multibyte representation for raw bytes, which is different from any other Unicode character. See this fragment from the ELisp manual: Emacs defines several special character sets. The character set `unicode' includes all the characters whose Emacs code points are in the range `0..#x10FFFF'. The character set `emacs' includes all ASCII and non-ASCII characters. Finally, the `eight-bit' charset includes the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered in text. and also this one: To support this multitude of characters and scripts, Emacs closely follows the "Unicode Standard". The Unicode Standard assigns a unique number, called a "codepoint", to each and every character. The range of codepoints defined by Unicode, or the Unicode "codespace", is `0..#x10FFFF' (in hexadecimal notation), inclusive. Emacs extends this range with codepoints in the range `#x110000..#x3FFFFF', which it uses for representing characters that are not unified with Unicode and "raw 8-bit bytes" that cannot be interpreted as characters. Thus, a character codepoint in Emacs is a 22-bit integer number. > Whereas if I evaluate: > > (progn > (save-excursion (insert 10 #o377)) > (search-forward-regexp "ÿ" nil t)) > > I get a match. Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH DIAERESIS, by design. > Likewise, if I evaluate > > (progn (save-excursion (insert 10 4194303)) > (search-forward-regexp "\377" nil t)) > > I get a match. > > Which is to say, given the example regexp from the manual, i.e: > > ,---- > | You cannot always match all non-ASCII characters with the regular > | expression `"[\200-\377]"' > `---- > > I am unable to locate the character: ÿ (255, #o377, #xff) e.g. > LATIN SMALL LETTER Y WITH DIAERESIS Sounds like a bug to me --- not in the conventions used by the manual, but rather in regexp search in Emacs. Feel free to file a separate bug about that. > To be clear, my issue isn't that I am not able to match `ÿ' but rather > that I am able to match the raw-byte character representation with a > visual appearance which coincides with the octal value for the `ÿ' > character code i.e. #o377 this being otherwise widely understood as > `octal 0377'. > > I hope this is more clear than the previous mail. I apologize if it is not. I hope my answers make this issue more clear. (Did I say that use of raw bytes is complicated and full of subtleties?)

This bug report was last modified 15 years and 41 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #6283 doc/lispref/searching.texi reference to octal code `0377' correct?

GNU bug report logs - #6283
doc/lispref/searching.texi reference to octal code `0377' correct?