GNU bug report logs - #55331
Improved support for combining diacritics

Previous Next

Package: grep;

Reported by: Benson Muite <benson_muite <at> emailplus.org>

Date: Mon, 9 May 2022 07:04:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 55331 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#55331; Package grep. (Mon, 09 May 2022 07:04:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Benson Muite <benson_muite <at> emailplus.org>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 09 May 2022 07:04:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Benson Muite <benson_muite <at> emailplus.org>
To: bug-grep <at> gnu.org
Subject: Improved support for combining diacritics
Date: Mon, 9 May 2022 09:38:26 +0300
Hi,

Unicode allows for combining diacritics. When using

grep -E "\s[a-z\`\'āáàēéèīíìịị̄ị́ị̀ōóòọọ̄ọọ́ọ̀ūúùụ̄ụ́ụ̀n̄ńǹm̄ḿm̀]{4}$"

to extract 4 letter Igbo words from a text, akụ̀ is incorrectly 
classified as a 4 letter word, when it is a three letter word.  Would a 
patch to fix this be accepted?

Regards,
Benson Muite




Information forwarded to bug-grep <at> gnu.org:
bug#55331; Package grep. (Mon, 09 May 2022 18:31:02 GMT) Full text and rfc822 format available.

Message #8 received at 55331 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Benson Muite <benson_muite <at> emailplus.org>
Cc: 55331 <at> debbugs.gnu.org
Subject: Re: bug#55331: Improved support for combining diacritics
Date: Mon, 9 May 2022 11:30:28 -0700
On 5/8/22 23:38, Benson Muite wrote:
> When using
> 
> grep -E "\s[a-z\`\'āáàēéèīíìịị̄ị́ị̀ōóòọọ̄ọọ́ọ̀ūúùụ̄ụ́ụ̀n̄ńǹm̄ḿm̀]{4}$"
> 
> to extract 4 letter Igbo words

The {4} means "4 characters", not "4 letters", and a combining character 
counts as a character.

It might be nice for 'grep' to have ways to perform Unicode 
normalization before matching. In the meantime perhaps you can get what 
you want by normalizing the text before running it through 'grep'.




Information forwarded to bug-grep <at> gnu.org:
bug#55331; Package grep. (Mon, 09 May 2022 18:50:03 GMT) Full text and rfc822 format available.

Message #11 received at 55331 <at> debbugs.gnu.org (full text, mbox):

From: Benson Muite <benson_muite <at> emailplus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 55331 <at> debbugs.gnu.org
Subject: Re: bug#55331: Improved support for combining diacritics
Date: Mon, 9 May 2022 21:44:17 +0300
On 5/9/22 21:30, Paul Eggert wrote:
> On 5/8/22 23:38, Benson Muite wrote:
> 
> It might be nice for 'grep' to have ways to perform Unicode 
> normalization before matching. In the meantime perhaps you can get what 
> you want by normalizing the text before running it through 'grep'.
Thanks for the advice. uconv should work.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 09 May 2022 19:12:02 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 40 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.