GNU bug report logs - #16812
Eszett handling

Previous Next

Package: grep;

Reported by: mathstuf <at> gmail.com

Date: Wed, 19 Feb 2014 19:04:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 16812 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16812; Package grep. (Wed, 19 Feb 2014 19:04:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to mathstuf <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 19 Feb 2014 19:04:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ben Boeckel <mathstuf <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Eszett handling
Date: Wed, 19 Feb 2014 13:59:18 -0500
[Message part 1 (text/plain, inline)]
[ I am not subscribed; please keep me on the CC. ]

Hi,

From the new grep announcement on LWN[1], I had a thought about how the
German eszett was handled. It seems that it hasn't been handled at all.
This may fall to the same resolution as the recent LJ/Lj thread[2]
though.

Basically, it seems that grep doesn't support alternates when changing
case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
context[3]. From some poking, only the latter is supported. My
thought[4] was that the code would generate '[ßSS]' which would be wrong
when matching and would instead need to do '(ß|SS)'. It now seems that
'(ß|SS|ẞ)' or even '(ß|[sS][sS]|ẞ)' would need to be generated instead
using the new code.

I've attached a test case I wrote based on 'turkish-eyes'. I release it
to the public domain.

Thanks,

--Ben

[1]https://lwn.net/Articles/586899/
[2]https://lists.gnu.org/archive/html/bug-grep/2014-02/msg00004.html
[3]https://en.wikipedia.org/wiki/Capital_%C3%9F
[4]https://lwn.net/Articles/587010/
[german-eszett (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16812; Package grep. (Wed, 19 Feb 2014 20:29:01 GMT) Full text and rfc822 format available.

Message #8 received at 16812 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: mathstuf <at> gmail.com, 16812 <at> debbugs.gnu.org
Subject: Re: bug#16812: Eszett handling
Date: Wed, 19 Feb 2014 13:27:58 -0700
[Message part 1 (text/plain, inline)]
On 02/19/2014 11:59 AM, Ben Boeckel wrote:
> [ I am not subscribed; please keep me on the CC. ]
> 
> Hi,
> 
>>From the new grep announcement on LWN[1], I had a thought about how the
> German eszett was handled. It seems that it hasn't been handled at all.
> This may fall to the same resolution as the recent LJ/Lj thread[2]
> though.
> 
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
> context[3].

Alas, in terms of POSIX functionality, we can only change case between
single-character entities.  Changing ß to SS is a
single->multi-character change; it is DIFFERENT than the Turkish i
situation (there, although we change between single-byte and multi-byte,
the changes are still always single character).  Similar problems apply
to Greek trailing sigma, which is also a context-sensitive change operation.

As long as we are stuck using the POSIX definition of case changes on a
character-by-character basis, where the input and output are 1:1
character mappings, we cannot handle the German eszett case specially.
For PROPER handling of locale-sensitive case rules, we'd need full
Unicode rules that operate on words, rather than characters, which
quickly gets out of scope of what we can do in POSIX regex.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16812; Package grep. (Thu, 20 Feb 2014 16:55:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Johannes Meixner <jsmeix <at> suse.de>
To: bug-grep <at> gnu.org
Cc: mathstuf <at> gmail.com
Subject: Re: bug#16812: Eszett handling
Date: Thu, 20 Feb 2014 11:07:34 +0100 (CET)
[Message part 1 (text/plain, inline)]
Hello,

On Feb 19 13:59 Ben Boeckel wrote (excerpt):
> [ I am not subscribed; please keep me on the CC. ]
...
> I had a thought about how the German eszett was handled
...
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of 'ß' is either 'SS' or '?' depending on the
> context

As far as I understand it you are talking about
"Unicode case folding".

As far as I know grep does not support "Unicode case folding".

Currently grep works on a pure "character by character" base
where each character could be in UTF-8 encoding (a possible
encoding for Unicode characters) so that grep supports
the UTF-8 encoding which could be misunderstood that
grep supports Unicode but the latter is not true.

For more details see the various (usually very long mail threads)
regarding "grep -i" in particular together with UTF-8.

For example on

http://lists.gnu.org/archive/html/bug-grep/2012-06/threads.html#00011

mail threads like
"Ignore case handling of special unicode characters (case folding)"
which is
http://savannah.gnu.org/bugs/?36682
or the mail thread
"grep -i (case-insensitive) is broken with UTF8"


Kind Regards
Johannes Meixner
-- 
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer

Information forwarded to bug-grep <at> gnu.org:
bug#16812; Package grep. (Sat, 08 Mar 2014 18:53:01 GMT) Full text and rfc822 format available.

Message #14 received at 16812 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16812 <at> debbugs.gnu.org
Subject: Re:  Eszett handling
Date: Sat, 08 Mar 2014 10:52:49 -0800
'grep' is conforming to its specification, even though it's not as 
useful as it might be when searching German text.  The situation with 
'ß'/'SS' is different than the situation with 'lj'/'Lj'/'LJ' because in the 
latter case 'grep' is dealing only with individual characters.

There's a related issue with 'ß' versus the recently-introduced capital 
sharp-S 'ẞ'.  These do not match each other with 'grep --ignore-case' in 
the current savannah git master.  This is an unfortunate property of how 
the glibc regex code behaves: the regex code uppercases both pattern and 
data before comparing, but in the standard German locale 'ß' is 
unchanged by uppercasing.

I'll leave this bug open as it is an awkward situation.  Fixing it would 
require changing the glibc regex code, which is a big deal -- it would 
have some performance implications in a lot of programs.  So I'm not 
optimistic about fixing it any time soon.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 27 Apr 2014 01:15:02 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 53 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.