GNU bug report logs -
#16812
Eszett handling
Previous Next
Full log
Message #8 received at 16812 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/19/2014 11:59 AM, Ben Boeckel wrote:
> [ I am not subscribed; please keep me on the CC. ]
>
> Hi,
>
>>From the new grep announcement on LWN[1], I had a thought about how the
> German eszett was handled. It seems that it hasn't been handled at all.
> This may fall to the same resolution as the recent LJ/Lj thread[2]
> though.
>
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
> context[3].
Alas, in terms of POSIX functionality, we can only change case between
single-character entities. Changing ß to SS is a
single->multi-character change; it is DIFFERENT than the Turkish i
situation (there, although we change between single-byte and multi-byte,
the changes are still always single character). Similar problems apply
to Greek trailing sigma, which is also a context-sensitive change operation.
As long as we are stuck using the POSIX definition of case changes on a
character-by-character basis, where the input and output are 1:1
character mappings, we cannot handle the German eszett case specially.
For PROPER handling of locale-sensitive case rules, we'd need full
Unicode rules that operate on words, rather than characters, which
quickly gets out of scope of what we can do in POSIX regex.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 11 years and 53 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.