#5812 - expr: Difference in behavior of match and :

GNU bug report logs - #5812
expr: Difference in behavior of match and :

Reported by: Adil Mujeeb <mujeeb.adil <at> gmail.com>

Date: Wed, 31 Mar 2010 14:05:01 UTC

Severity: normal

Done: Bob Proulx <bob <at> proulx.com>

Bug is archived. No further changes may be made.

Message #11 received at 5812 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com> To: Adil Mujeeb <mujeeb.adil <at> gmail.com> Cc: 5812 <at> debbugs.gnu.org Subject: Re: bug#5812: expr: Difference in behavior of match and : Date: Sat, 3 Apr 2010 16:33:53 -0600

tags 5812 + moreinfo unreproducible thanks Adil Mujeeb wrote: > I have tried following snippet in a bash script: > > -bash-3.1$userid=`expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*($.[0-9a-z]*$) .*"` > -bash-3.1$echo $userid > ADILM > -bash-3.1$ > > To my knowledge it should not able to extract ADILM as the regex does not > include uppercase letters (A-Z). Thank you for the bug report. It stands out as being exceptionally well written and covering the needed information. However I believe what you are seeing is intended behavior. It is an effect of the character collation sequence chosen by your locale setting. What is your locale? $ locale Your sort order depends upon your locale. You didn't say what your locale was and therefore I assume that you were not aware that it had an effect. If your locale is set to a dictionary collation sequence such as en_US.UTF-8 then this is the expected (not necessarily desired but expected) behavior. You probably expected a US-ASCII sort ordering but the powers that be (in the system, in libc, not in coreutils) have decided that the collation ordering (sort ordering) for data should be dictionary sort ordering. In dictionary ordering case is folded together and punctuation is ignored. By having LANG set to any of the "en*" locales the system is instructed to use dictionary sort ordering. This affects almost everything on the system that sorts. This includes commands such as 'ls' and also your shell (e.g. 'echo *') too. Plus things like 'expr'. The collation sequence of [a-z] in dictionary ordering is really "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are really getting "AbBcC...xXyYzZ" with 'A'! Here is what I see with your case example: $ LC_ALL=C expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*($.[0-9a-z]*$) .*" ...no output... $ LC_ALL=en_US.UTF-8 expr "uid=11008(ADILM) gid=1200(cvs),1400(build)" : ".*uid=[0-9]*($.[0-9a-z]*$) .*" ADILM > In the expr man page it is mentioned that: > -----8<---------- > match STRING REGEXP > same as STRING : REGEXP > -----8<---------- > So i tried following snippet:- > -bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*($.[0-9a-z]*$) .*"` > -bash-3.1$ echo $userid > -bash-3.1$ > I changed the regex and added uppercase letters:- > -bash-3.1$ userid=`expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*($.[0-9A-Za-z]*$) .*"` > -bash-3.1$ echo $userid > ADILM > -bash-3.1$ > So it means that match is not same as ":". As per observation ":" uses > case-insensitive matching while match is strict case sensitive matching. I cannot reproduce this behavior. But I am impressed that you went looking for it. :-) Was this perhaps tested on different machines? Or on any different login account where different locale settings may have been in effect? $ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*($.[0-9a-z]*$)" ...no output... $ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[0-9]*($.[0-9a-z]*$)" ADILM In addition to setting LC_ALL=C in scripts that need standard behavior you may want to use POSIX character classes here. They may help with situations such as yours. $ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*($.[[:digit:][:upper:]]*$)" ADILM $ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*($.[[:digit:][:upper:]]*$)" ADILM $ LC_ALL=C expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*($.[[:digit:][:lower:]]*$)" ...no output... $ LC_ALL=en_US.UTF-8 expr match "uid=11008(ADILM) gid=1200(cvs),1400(build)" ".*uid=[[:digit:]]*($.[[:digit:][:lower:]]*$)" ...no output... > Can you update the man page OR let me know if i am doing anything wrong? This is something that has such global behavior that the problem comes in where do you document it? It shouldn't be documented everywhere. It is a libc behavior and everything that uses libc (everything!) will get the same behavior. But 'sort' has taken the full force of it and so you might look there for the best explanations. The sort documentation says: Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.(1) ... (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to `en_US'), then `sort' may produce output that is sorted differently than you're accustomed to. In that case, set the `LC_ALL' environment variable to `C'. Note that setting only `LC_COLLATE' has two problems. First, it is ineffective if `LC_ALL' is also set. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is set to an incompatible value. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'. Personally I have the following in my $HOME/.bashrc file. export LANG=en_US.UTF-8 export LC_COLLATE=C That sets most of my locale to a UTF-8 one but forces sorting to be standard C/POSIX. This probably won't work in the general case since I have no idea how that would interact with all character sets. You may want to look at the FAQ. http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 Notes: * You don't need to include a trailing ".*" or " .*" in your pattern. It won't affect your match and it will be slightly more efficient without it. * You don't need to capture the output with backticks and then echo it. You can just run the command and display the output. Thanks again for the very nice bug report! Bob

This bug report was last modified 15 years and 110 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #5812 expr: Difference in behavior of match and :

GNU bug report logs - #5812
expr: Difference in behavior of match and :