GNU bug report logs -
#16812
Eszett handling
Previous Next
To reply to this bug, email your comments to 16812 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16812
; Package
grep
.
(Wed, 19 Feb 2014 19:04:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
mathstuf <at> gmail.com
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 19 Feb 2014 19:04:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
[ I am not subscribed; please keep me on the CC. ]
Hi,
From the new grep announcement on LWN[1], I had a thought about how the
German eszett was handled. It seems that it hasn't been handled at all.
This may fall to the same resolution as the recent LJ/Lj thread[2]
though.
Basically, it seems that grep doesn't support alternates when changing
case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
context[3]. From some poking, only the latter is supported. My
thought[4] was that the code would generate '[ßSS]' which would be wrong
when matching and would instead need to do '(ß|SS)'. It now seems that
'(ß|SS|ẞ)' or even '(ß|[sS][sS]|ẞ)' would need to be generated instead
using the new code.
I've attached a test case I wrote based on 'turkish-eyes'. I release it
to the public domain.
Thanks,
--Ben
[1]https://lwn.net/Articles/586899/
[2]https://lists.gnu.org/archive/html/bug-grep/2014-02/msg00004.html
[3]https://en.wikipedia.org/wiki/Capital_%C3%9F
[4]https://lwn.net/Articles/587010/
[german-eszett (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16812
; Package
grep
.
(Wed, 19 Feb 2014 20:29:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 16812 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/19/2014 11:59 AM, Ben Boeckel wrote:
> [ I am not subscribed; please keep me on the CC. ]
>
> Hi,
>
>>From the new grep announcement on LWN[1], I had a thought about how the
> German eszett was handled. It seems that it hasn't been handled at all.
> This may fall to the same resolution as the recent LJ/Lj thread[2]
> though.
>
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
> context[3].
Alas, in terms of POSIX functionality, we can only change case between
single-character entities. Changing ß to SS is a
single->multi-character change; it is DIFFERENT than the Turkish i
situation (there, although we change between single-byte and multi-byte,
the changes are still always single character). Similar problems apply
to Greek trailing sigma, which is also a context-sensitive change operation.
As long as we are stuck using the POSIX definition of case changes on a
character-by-character basis, where the input and output are 1:1
character mappings, we cannot handle the German eszett case specially.
For PROPER handling of locale-sensitive case rules, we'd need full
Unicode rules that operate on words, rather than characters, which
quickly gets out of scope of what we can do in POSIX regex.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16812
; Package
grep
.
(Thu, 20 Feb 2014 16:55:02 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello,
On Feb 19 13:59 Ben Boeckel wrote (excerpt):
> [ I am not subscribed; please keep me on the CC. ]
...
> I had a thought about how the German eszett was handled
...
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of 'ß' is either 'SS' or '?' depending on the
> context
As far as I understand it you are talking about
"Unicode case folding".
As far as I know grep does not support "Unicode case folding".
Currently grep works on a pure "character by character" base
where each character could be in UTF-8 encoding (a possible
encoding for Unicode characters) so that grep supports
the UTF-8 encoding which could be misunderstood that
grep supports Unicode but the latter is not true.
For more details see the various (usually very long mail threads)
regarding "grep -i" in particular together with UTF-8.
For example on
http://lists.gnu.org/archive/html/bug-grep/2012-06/threads.html#00011
mail threads like
"Ignore case handling of special unicode characters (case folding)"
which is
http://savannah.gnu.org/bugs/?36682
or the mail thread
"grep -i (case-insensitive) is broken with UTF8"
Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16812
; Package
grep
.
(Sat, 08 Mar 2014 18:53:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 16812 <at> debbugs.gnu.org (full text, mbox):
'grep' is conforming to its specification, even though it's not as
useful as it might be when searching German text. The situation with
'ß'/'SS' is different than the situation with 'lj'/'Lj'/'LJ' because in the
latter case 'grep' is dealing only with individual characters.
There's a related issue with 'ß' versus the recently-introduced capital
sharp-S 'ẞ'. These do not match each other with 'grep --ignore-case' in
the current savannah git master. This is an unfortunate property of how
the glibc regex code behaves: the regex code uppercases both pattern and
data before comparing, but in the standard German locale 'ß' is
unchanged by uppercasing.
I'll leave this bug open as it is an awkward situation. Fixing it would
require changing the glibc regex code, which is a big deal -- it would
have some performance implications in a lot of programs. So I'm not
optimistic about fixing it any time soon.
Severity set to 'wishlist' from 'normal'
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sun, 27 Apr 2014 01:15:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 53 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.