GNU bug report logs - #55331
Improved support for combining diacritics

Previous Next

Package: grep;

Reported by: Benson Muite <benson_muite <at> emailplus.org>

Date: Mon, 9 May 2022 07:04:02 UTC

Severity: wishlist

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Benson Muite <benson_muite <at> emailplus.org>
Cc: 55331 <at> debbugs.gnu.org
Subject: bug#55331: Improved support for combining diacritics
Date: Mon, 9 May 2022 11:30:28 -0700
On 5/8/22 23:38, Benson Muite wrote:
> When using
> 
> grep -E "\s[a-z\`\'āáàēéèīíìịị̄ị́ị̀ōóòọọ̄ọọ́ọ̀ūúùụ̄ụ́ụ̀n̄ńǹm̄ḿm̀]{4}$"
> 
> to extract 4 letter Igbo words

The {4} means "4 characters", not "4 letters", and a combining character 
counts as a character.

It might be nice for 'grep' to have ways to perform Unicode 
normalization before matching. In the meantime perhaps you can get what 
you want by normalizing the text before running it through 'grep'.




This bug report was last modified 3 years and 40 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.