GNU bug report logs -
#6283
doc/lispref/searching.texi reference to octal code `0377' correct?
Previous Next
Reported by: MON KEY <monkey <at> sandpframing.com>
Date: Thu, 27 May 2010 17:29:02 UTC
Severity: minor
Done: Chong Yidong <cyd <at> stupidchicken.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey <at> sandpframing.com>
> Cc: 6283 <at> debbugs.gnu.org
>
> If I evauate the following:
>
> (progn
> (save-excursion
> (insert-byte (multibyte-char-to-unibyte 4194221) 1)
> (insert-byte (multibyte-char-to-unibyte 4194303) 1))
> (search-forward-regexp "ÿ" nil t))
>
> I don't match.
Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte. Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character. See this fragment from
the ELisp manual:
Emacs defines several special character sets. The character set
`unicode' includes all the characters whose Emacs code points are in
the range `0..#x10FFFF'. The character set `emacs' includes all ASCII
and non-ASCII characters. Finally, the `eight-bit' charset includes
the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
in text.
and also this one:
To support this multitude of characters and scripts, Emacs closely
follows the "Unicode Standard". The Unicode Standard assigns a unique
number, called a "codepoint", to each and every character. The range
of codepoints defined by Unicode, or the Unicode "codespace", is
`0..#x10FFFF' (in hexadecimal notation), inclusive. Emacs extends this
range with codepoints in the range `#x110000..#x3FFFFF', which it uses
for representing characters that are not unified with Unicode and "raw
8-bit bytes" that cannot be interpreted as characters. Thus, a
character codepoint in Emacs is a 22-bit integer number.
> Whereas if I evaluate:
>
> (progn
> (save-excursion (insert 10 #o377))
> (search-forward-regexp "ÿ" nil t))
>
> I get a match.
Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.
> Likewise, if I evaluate
>
> (progn (save-excursion (insert 10 4194303))
> (search-forward-regexp "\377" nil t))
>
> I get a match.
>
> Which is to say, given the example regexp from the manual, i.e:
>
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
>
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS
Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs. Feel free to file a
separate bug about that.
> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
>
> I hope this is more clear than the previous mail. I apologize if it is not.
I hope my answers make this issue more clear. (Did I say that use of
raw bytes is complicated and full of subtleties?)
This bug report was last modified 14 years and 358 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.