GNU bug report logs - #16046
Bug with Regexp Containing only a Character Class with a Caret

Previous Next

Package: emacs;

Reported by: Cameron Desautels <camdez <at> gmail.com>

Date: Wed, 4 Dec 2013 10:06:03 UTC

Severity: normal

Done: Stefan Monnier <monnier <at> iro.umontreal.ca>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16046 in the body.
You can then email your comments to 16046 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#16046; Package emacs. (Wed, 04 Dec 2013 10:06:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Cameron Desautels <camdez <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 04 Dec 2013 10:06:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Cameron Desautels <camdez <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Bug with Regexp Containing only a Character Class with a Caret
Date: Tue, 3 Dec 2013 22:57:56 -0600
Hi all,

I've run across a dilemma, in the most literal sense: either there's a
problem in Emacs's regexp engine or there's an issue with
`regexp-opt-charset`---I'm not sure which.

The issue has to do with regular expressions containing character
classes with only a caret character.  I know this seems like a rather
silly case (why not just use "\\^"?) but it came up in the context of
trying to track down a bug in ruby-mode, so it does occur in real (and
particularly *programmatic*) settings.

The simplest case to reproduce is the following:

    (re-search-forward "[^]")
    ; => Debugger entered--Lisp error: (invalid-regexp "Unmatched [ or [^")
    ;   re-search-forward("[^]")
    ;   eval((re-search-forward "[^]") nil)
    ;   eval-last-sexp-1(t)
    ;   eval-last-sexp(t)
    ;   eval-print-last-sexp()
    ;   call-interactively(eval-print-last-sexp record nil)
    ;   command-execute(eval-print-last-sexp record)
    ;   execute-extended-command(nil "eval-print-last-sexp")
    ;   call-interactively(execute-extended-command nil nil)

Now, you can make a compelling case that that's not a valid regexp
(and the Emacs Lisp Reference Manual doesn't seem to *directly*
contradict this argument), but that presents a problem when paired
with `regexp-opt-charset`:

    (regexp-opt-charset '(?^))
    => "[^]"

Note that that produces the problem regexp; which is to say that the
following code is bound to fail when it should succeed:

    (re-search-forward (regexp-opt-charset '(?^)))

What's the correct behavior? I'd be happy to offer a patch for either
side of the equation but I'm not sure which one to target.

All the best.

-- Cameron


In GNU Emacs 24.3.1 (x86_64-apple-darwin11.4.2, Carbon Version 1.6.0
AppKit 1138.51)
 of 2013-05-13 on atago
Windowing system distributor `Apple Inc.', version 10.9.0
Configured using:
 `configure '--with-mac'
 '--enable-mac-app=/Users/xin/Documents/emacs-mac-port/build'
 '--prefix=/Users/xin/Documents/emacs-mac-port/build''

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
/Applications/Emacs.app/Contents/Resources/lisp/.dir-locals hides
/Applications/Emacs.app/Contents/Resources/lisp/gnus/.dir-locals

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils help-mode easymenu debug time-date tooltip
ediff-hook vc-hooks lisp-float-type mwheel mac-win tool-bar dnd fontset
image regexp-opt fringe tabulated-list newcomment lisp-mode register
page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock
font-lock syntax facemenu font-core frame cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew
greek romanian slovak czech european ethiopic indian cyrillic chinese
case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote mac multi-tty make-network-process emacs)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16046; Package emacs. (Thu, 05 Dec 2013 19:28:01 GMT) Full text and rfc822 format available.

Message #8 received at 16046 <at> debbugs.gnu.org (full text, mbox):

From: Cameron Desautels <camdez <at> gmail.com>
To: 16046 <at> debbugs.gnu.org
Subject: Bug with Regexp Containing only a Character Class with a Caret (PATCH)
Date: Thu, 5 Dec 2013 13:26:58 -0600
[Message part 1 (text/plain, inline)]
After further experimentation, I suspect that "[^]" is simply not
a valid regular expression.  For instance, grep(1) gives the
following behavior:

    $ echo "^" | grep "[^]"
    grep: brackets ([ ]) not balanced

This suggests that the broken behavior is within
`regexp-opt-charset`.  I've attached a patch for that function.

Here are some test cases which reveal the behavior of the unpatched
and patched versions of the function (the only difference is the
handling of the "[^]" case):

    ;; Pre-patch
    (regexp-opt-charset (list ?^))          ; "[^]"
    (regexp-opt-charset (list ?^ ?a))       ; "[a^]"
    (regexp-opt-charset (list ?^ ?-))       ; "[-^]"
    (regexp-opt-charset (list ?^ ?\]))      ; "[]^]"
    (regexp-opt-charset (list ?^ ?- ?\]))   ; "[]^-]"

    ;; Post-patch
    (regexp-opt-charset (list ?^))          ; "\\^"
    (regexp-opt-charset (list ?^ ?a))       ; "[a^]"
    (regexp-opt-charset (list ?^ ?-))       ; "[-^]"
    (regexp-opt-charset (list ?^ ?\]))      ; "[]^]"
    (regexp-opt-charset (list ?^ ?- ?\]))   ; "[]^-]"

--
Cameron Desautels <camdez <at> gmail.com>
[regexp-opt.el.diff (text/plain, attachment)]

Reply sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
You have taken responsibility. (Thu, 05 Dec 2013 20:27:01 GMT) Full text and rfc822 format available.

Notification sent to Cameron Desautels <camdez <at> gmail.com>:
bug acknowledged by developer. (Thu, 05 Dec 2013 20:27:02 GMT) Full text and rfc822 format available.

Message #13 received at 16046-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Cameron Desautels <camdez <at> gmail.com>
Cc: 16046-done <at> debbugs.gnu.org
Subject: Re: bug#16046: Bug with Regexp Containing only a Character Class with
 a Caret (PATCH)
Date: Thu, 05 Dec 2013 15:26:38 -0500
> After further experimentation, I suspect that "[^]" is simply not
> a valid regular expression.

Indeed, according to the documentation, for ^ to be treated as itself,
it needs to be "not the first char", but since we have nothing else to
put there, we're kind of screwed.

> This suggests that the broken behavior is within
> `regexp-opt-charset`.  I've attached a patch for that function.

Thank you for tracking down the problem and providing a fix.  I just
installed it in trunk, closing,


        Stefan




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 03 Jan 2014 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 220 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.