GNU bug report logs -
#5812
expr: Difference in behavior of match and :
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 5812 in the body.
You can then email your comments to 5812 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#5812
; Package
coreutils
.
(Wed, 31 Mar 2010 14:05:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Adil Mujeeb <mujeeb.adil <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 31 Mar 2010 14:05:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello team,
I have tried following snippet in a bash script:
-bash-3.1$userid=`expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" :
".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"`
-bash-3.1$echo $userid
ADILM
-bash-3.1$
To my knowledge it should not able to extract ADILM as the regex does not
include uppercase letters (A-Z).
In the expr man page it is mentioned that:
-----8<----------
match STRING REGEXP
same as STRING : REGEXP
-----8<----------
So i tried following snippet:-
-bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)"
".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"`
-bash-3.1$ echo $userid
-bash-3.1$
I changed the regex and added uppercase letters:-
-bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)"
".*uid=[0-9]*(\(.[0-9A-Za-z]*\)) .*"`
-bash-3.1$ echo $userid
ADILM
-bash-3.1$
So it means that match is not same as ":". As per observation ":" uses
case-insensitive matching while match is strict case sensitive matching.
Can you update the man page OR let me know if i am doing anything wrong?
Package:-
-bash-3.1$ rpm -qf /usr/bin/expr
coreutils-5.97-12.1.el5
-bash-3.1$
Thanks and Regards,
Adil Mujeeb
[Message part 2 (text/html, inline)]
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#5812
; Package
coreutils
.
(Wed, 31 Mar 2010 21:51:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 5812 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 03/31/2010 07:05 AM, Adil Mujeeb wrote:
> Hello team,
>
> I have tried following snippet in a bash script:
>
> -bash-3.1$userid=`expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" :
> ".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"`
> -bash-3.1$echo $userid
> ADILM
> -bash-3.1$
I cannot repeat your results with 7.6 (the version in fedora 12) or the
latest coreutils.git.
$ expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : \
".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"
$ expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : \
".*uid=[0-9]*(\(.[0-9a-zA-Z]*\)) .*"
ADILM
Perhaps you have a locale issue at play?
> -bash-3.1$ rpm -qf /usr/bin/expr
> coreutils-5.97-12.1.el5
That's rather old. Perhaps it might be a bug that has been fixed in the
meantime, in which case, you would want to upgrade to 8.4.
At any rate, there's nothing in the source code that introduces any case
insensitivity, and the documentation is correct, that match and : behave
identically.
--
Eric Blake eblake <at> redhat.com +1-801-349-2682
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#5812
; Package
coreutils
.
(Sat, 03 Apr 2010 22:34:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 5812 <at> debbugs.gnu.org (full text, mbox):
tags 5812 + moreinfo unreproducible
thanks
Adil Mujeeb wrote:
> I have tried following snippet in a bash script:
>
> -bash-3.1$userid=`expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"`
> -bash-3.1$echo $userid
> ADILM
> -bash-3.1$
>
> To my knowledge it should not able to extract ADILM as the regex does not
> include uppercase letters (A-Z).
Thank you for the bug report. It stands out as being exceptionally
well written and covering the needed information. However I believe
what you are seeing is intended behavior. It is an effect of the
character collation sequence chosen by your locale setting.
What is your locale?
$ locale
Your sort order depends upon your locale. You didn't say what your
locale was and therefore I assume that you were not aware that it
had an effect.
If your locale is set to a dictionary collation sequence such as
en_US.UTF-8 then this is the expected (not necessarily desired but
expected) behavior. You probably expected a US-ASCII sort ordering
but the powers that be (in the system, in libc, not in coreutils) have
decided that the collation ordering (sort ordering) for data should be
dictionary sort ordering. In dictionary ordering case is folded
together and punctuation is ignored. By having LANG set to any of the
"en*" locales the system is instructed to use dictionary sort
ordering. This affects almost everything on the system that sorts.
This includes commands such as 'ls' and also your shell (e.g. 'echo
*') too. Plus things like 'expr'.
The collation sequence of [a-z] in dictionary ordering is really
"aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are
getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are
really getting "AbBcC...xXyYzZ" with 'A'!
Here is what I see with your case example:
$ LC_ALL=C expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"
...no output...
$ LC_ALL=en_US.UTF-8 expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"
ADILM
> In the expr man page it is mentioned that:
> -----8<----------
> match STRING REGEXP
> same as STRING : REGEXP
> -----8<----------
> So i tried following snippet:-
> -bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*(\(.[0-9a-z]*\)) .*"`
> -bash-3.1$ echo $userid
> -bash-3.1$
> I changed the regex and added uppercase letters:-
> -bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*(\(.[0-9A-Za-z]*\)) .*"`
> -bash-3.1$ echo $userid
> ADILM
> -bash-3.1$
> So it means that match is not same as ":". As per observation ":" uses
> case-insensitive matching while match is strict case sensitive matching.
I cannot reproduce this behavior. But I am impressed that you went
looking for it. :-)
Was this perhaps tested on different machines? Or on any different
login account where different locale settings may have been in effect?
$ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*(\(.[0-9a-z]*\))"
...no output...
$ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*(\(.[0-9a-z]*\))"
ADILM
In addition to setting LC_ALL=C in scripts that need standard behavior
you may want to use POSIX character classes here. They may help with
situations such as yours.
$ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*(\(.[[:digit:][:upper:]]*\))"
ADILM
$ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*(\(.[[:digit:][:upper:]]*\))"
ADILM
$ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*(\(.[[:digit:][:lower:]]*\))"
...no output...
$ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*(\(.[[:digit:][:lower:]]*\))"
...no output...
> Can you update the man page OR let me know if i am doing anything wrong?
This is something that has such global behavior that the problem comes
in where do you document it? It shouldn't be documented everywhere.
It is a libc behavior and everything that uses libc (everything!) will
get the same behavior. But 'sort' has taken the full force of it and
so you might look there for the best explanations.
The sort documentation says:
Unless otherwise specified, all comparisons use the character
collating sequence specified by the `LC_COLLATE' locale.(1)
...
(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to. In that case, set the `LC_ALL'
environment variable to `C'. Note that setting only `LC_COLLATE'
has two problems. First, it is ineffective if `LC_ALL' is also set.
Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if
`LC_CTYPE' is unset) is set to an incompatible value. For example,
you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but
`LC_COLLATE' is `en_US.UTF-8'.
Personally I have the following in my $HOME/.bashrc file.
export LANG=en_US.UTF-8
export LC_COLLATE=C
That sets most of my locale to a UTF-8 one but forces sorting to be
standard C/POSIX. This probably won't work in the general case since
I have no idea how that would interact with all character sets.
You may want to look at the FAQ.
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
Notes:
* You don't need to include a trailing ".*" or " .*" in your pattern.
It won't affect your match and it will be slightly more efficient
without it.
* You don't need to capture the output with backticks and then echo it.
You can just run the command and display the output.
Thanks again for the very nice bug report!
Bob
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#5812
; Package
coreutils
.
(Mon, 05 Apr 2010 04:34:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 5812 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks Bob for such a nice explanation and your instinct is right. It is
locale problem.
-bash-3.1$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
-bash-3.1$
And the other point you made is also right. I didn’t realize that I was
using another session for comparing the result with match which ahs
different locale:-
-bash-3.1$
LANG=ja_JP.UTF-8
LC_CTYPE="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_PAPER="ja_JP.UTF-8"
LC_NAME="ja_JP.UTF-8"
LC_ADDRESS="ja_JP.UTF-8"
LC_TELEPHONE="ja_JP.UTF-8"
LC_MEASUREMENT="ja_JP.UTF-8"
LC_IDENTIFICATION="ja_JP.UTF-8"
LC_ALL=
-bash-3.1$
I never knew that locale has effect on the behavior. We can close this bug.
Thank you so much for your time and details, I have learnt new thing :)
Also, thanks for correcting my regex.
Thanks and Regards,
Adil Mujeeb
[Message part 2 (text/html, inline)]
Reply sent
to
Bob Proulx <bob <at> proulx.com>
:
You have taken responsibility.
(Mon, 05 Apr 2010 04:43:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Adil Mujeeb <mujeeb.adil <at> gmail.com>
:
bug acknowledged by developer.
(Mon, 05 Apr 2010 04:43:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 5812-done <at> debbugs.gnu.org (full text, mbox):
Adil Mujeeb wrote:
> Thanks Bob for such a nice explanation and your instinct is right. It is
> locale problem.
> ...
> And the other point you made is also right. I didn’t realize that I was
> using another session for comparing the result with match which ahs
> different locale:-
I thought it might have been something like that.
> I never knew that locale has effect on the behavior. We can close this bug.
I will close the bug with this message then.
> Thank you so much for your time and details, I have learnt new thing :)
I am glad to have helped!
> Also, thanks for correcting my regex.
Sure thing!
Bob
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 03 May 2010 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 15 years and 110 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.