GNU bug report logs - #6283
doc/lispref/searching.texi reference to octal code `0377' correct?

Previous Next

Package: emacs;

Reported by: MON KEY <monkey <at> sandpframing.com>

Date: Thu, 27 May 2010 17:29:02 UTC

Severity: minor

Done: Chong Yidong <cyd <at> stupidchicken.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: MON KEY <monkey <at> sandpframing.com>
Cc: 6283 <at> debbugs.gnu.org
Subject: bug#6283: doc/lispref/searching.texi reference to octal code	`0377' correct?
Date: Tue, 01 Jun 2010 21:38:41 +0300
> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey <at> sandpframing.com>
> Cc: 6283 <at> debbugs.gnu.org
> 
> If I evauate the following:
> 
>  (progn
>    (save-excursion
>      (insert-byte (multibyte-char-to-unibyte 4194221) 1)
>      (insert-byte (multibyte-char-to-unibyte 4194303) 1))
>    (search-forward-regexp "ÿ" nil t))
> 
> I don't match.

Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte.  Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character.  See this fragment from
the ELisp manual:

     Emacs defines several special character sets.  The character set
  `unicode' includes all the characters whose Emacs code points are in
  the range `0..#x10FFFF'.  The character set `emacs' includes all ASCII
  and non-ASCII characters.  Finally, the `eight-bit' charset includes
  the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
  in text.

and also this one:

     To support this multitude of characters and scripts, Emacs closely
  follows the "Unicode Standard".  The Unicode Standard assigns a unique
  number, called a "codepoint", to each and every character.  The range
  of codepoints defined by Unicode, or the Unicode "codespace", is
  `0..#x10FFFF' (in hexadecimal notation), inclusive.  Emacs extends this
  range with codepoints in the range `#x110000..#x3FFFFF', which it uses
  for representing characters that are not unified with Unicode and "raw
  8-bit bytes" that cannot be interpreted as characters.  Thus, a
  character codepoint in Emacs is a 22-bit integer number.

> Whereas if I evaluate:
> 
>  (progn
>    (save-excursion (insert 10 #o377))
>    (search-forward-regexp "ÿ" nil t))
> 
> I get a match.

Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.

> Likewise, if I evaluate
> 
>  (progn (save-excursion (insert 10 4194303))
>         (search-forward-regexp "\377" nil t))
> 
> I get a match.
> 
> Which is to say, given the example regexp from the manual, i.e:
> 
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
> 
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS

Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs.  Feel free to file a
separate bug about that.

> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
> 
> I hope this is more clear than the previous mail. I apologize if it is not.

I hope my answers make this issue more clear.  (Did I say that use of
raw bytes is complicated and full of subtleties?)





This bug report was last modified 14 years and 358 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.