From debbugs-submit-bounces@debbugs.gnu.org Mon May 09 03:03:39 2022 Received: (at submit) by debbugs.gnu.org; 9 May 2022 07:03:39 +0000 Received: from localhost ([127.0.0.1]:55821 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnxQh-0004cH-0T for submit@debbugs.gnu.org; Mon, 09 May 2022 03:03:39 -0400 Received: from lists.gnu.org ([209.51.188.17]:37352) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nnx4o-0001qh-GK for submit@debbugs.gnu.org; Mon, 09 May 2022 02:41:02 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:52216) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnx4j-0004OV-DR for bug-grep@gnu.org; Mon, 09 May 2022 02:41:00 -0400 Received: from wout3-smtp.messagingengine.com ([64.147.123.19]:58163) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nnx4h-0001bU-K2 for bug-grep@gnu.org; Mon, 09 May 2022 02:40:57 -0400 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.west.internal (Postfix) with ESMTP id BC7B0320098A for ; Mon, 9 May 2022 02:40:50 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute3.internal (MEProxy); Mon, 09 May 2022 02:40:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=emailplus.org; h=cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to; s=fm1; t=1652078450; x=1652164850; bh=rdRoNk/s8j lcreROUMZZpZPSUiYA59biJNQsbVLXhyo=; b=mVLcOIkVCWEiM8+6tGU2219dr1 7iLNBdu7VHFSRC7IHFI4LHnz/EFHK6cm7R90DWPter9+rt4IbZvubaZzDHqUS0ak In4dhzhXGDzPIsPLSjM/qCO3aTnbl4Yy1lxob3516MQ/Skjg2Bhv4UbtkWWdpzL1 uNR43Y4xbVZ5vvuCvxrc5kC4mzN6jwFdl+GiozEiq6LAlKZMGkk9VEKkujh7knd+ +gNUhtvmoeRolRODB72+tEcKWFwt+PtgL5Xfa0y5FWR8MopdKWTCTjei+/bf2fUT SZgn1a+CuPBdrWGIPi/jed1D1GA4AiqFvDIiqUnwOwzjBhvJEj7+Op840uSQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:from:from:in-reply-to:message-id:mime-version :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1652078450; x= 1652164850; bh=rdRoNk/s8jlcreROUMZZpZPSUiYA59biJNQsbVLXhyo=; b=G IcvJW0IeLrT0UWYf3DxWV2piNMwIqsOEKSZLcE0GJ2BWfvJd+UnDPslMlRDOACy1 SJsfoQ0gH5RF+mIHZXwNCRK1HObZUB9RlZfsVTmugHZDsWnUCW1ZxSQdkN6SXhfY ByxRiaW56vIQbnw6rZY0wcAIoRGFOlAcxDswrDf8rflgArMJpMIjDSf/affn/0T+ uTtoI1MV0xbI1dqq4CdNqBaXCxmDG3j3Vpx9Yp9ZCVclc1eiNTasrOiATjsYf9M5 ET03RHOknr5/fTULfFp2ndtdgBLfVVPQBacBk1fAQQZQRLdVCKO9YRXwA/rfvWWU iRlWu+yqLtgqqu+4P49Mg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeekgdduudduucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefkffggfgfhuffvtgfgsehtkeertd dtfeejnecuhfhrohhmpeeuvghnshhonhcuofhuihhtvgcuoegsvghnshhonhgpmhhuihht vgesvghmrghilhhplhhushdrohhrgheqnecuggftrfgrthhtvghrnhepgefhfeehleejie elkeefleeghfehfeelhfdthefhieefvefftdegudehfffhhfehnecuvehluhhsthgvrhfu ihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepsggvnhhsohhnpghmuhhithgvse gvmhgrihhlphhluhhsrdhorhhg X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Mon, 9 May 2022 02:40:41 -0400 (EDT) Message-ID: <55709462-5ea6-ff90-a0bc-5c919cb1af47@emailplus.org> Date: Mon, 9 May 2022 09:38:26 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Content-Language: en-US From: Benson Muite Subject: Improved support for combining diacritics To: bug-grep@gnu.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=64.147.123.19; envelope-from=benson_muite@emailplus.org; helo=wout3-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.7 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 09 May 2022 03:03:37 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.7 (--) Hi, Unicode allows for combining diacritics. When using grep -E "\s[a-z\`\'āáàēéèīíìịị̄ị́ị̀ōóòọọ̄ọọ́ọ̀ūúùụ̄ụ́ụ̀n̄ńǹm̄ḿm̀]{4}$" to extract 4 letter Igbo words from a text, akụ̀ is incorrectly classified as a 4 letter word, when it is a three letter word. Would a patch to fix this be accepted? Regards, Benson Muite From debbugs-submit-bounces@debbugs.gnu.org Mon May 09 14:30:38 2022 Received: (at 55331) by debbugs.gnu.org; 9 May 2022 18:30:38 +0000 Received: from localhost ([127.0.0.1]:59422 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no89W-0000z0-00 for submit@debbugs.gnu.org; Mon, 09 May 2022 14:30:38 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39560) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no89T-0000yM-Vk for 55331@debbugs.gnu.org; Mon, 09 May 2022 14:30:36 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B18511600D1; Mon, 9 May 2022 11:30:29 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id sC7awXmK3iUh; Mon, 9 May 2022 11:30:29 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 10E371600D4; Mon, 9 May 2022 11:30:29 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id V71HQyVjOWhQ; Mon, 9 May 2022 11:30:28 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E039B1600D1; Mon, 9 May 2022 11:30:28 -0700 (PDT) Message-ID: <85688b8d-04ff-bcfa-814a-a8415d9df291@cs.ucla.edu> Date: Mon, 9 May 2022 11:30:28 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: bug#55331: Improved support for combining diacritics Content-Language: en-US To: Benson Muite References: <55709462-5ea6-ff90-a0bc-5c919cb1af47@emailplus.org> From: Paul Eggert Organization: UCLA Computer Science Department In-Reply-To: <55709462-5ea6-ff90-a0bc-5c919cb1af47@emailplus.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 55331 Cc: 55331@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 5/8/22 23:38, Benson Muite wrote: > When using >=20 > grep -E "\s[a-z\`\'a=CC=84a=CC=81a=CC=80e=CC=84e=CC=81e=CC=80i=CC=84i=CC= =81i=CC=80i=CC=A3i=CC=A3=CC=84i=CC=A3=CC=81i=CC=A3=CC=80o=CC=84o=CC=81o=CC= =80=E1=BB=8D=E1=BB=8D=CC=84=E1=BB=8D=E1=BB=8D=CC=81=E1=BB=8D=CC=80u=CC=84= u=CC=81u=CC=80u=CC=A3=CC=84=E1=BB=A5=CC=81=E1=BB=A5=CC=80n=CC=84n=CC=81n=CC= =80m=CC=84m=CC=81m=CC=80]{4}$" >=20 > to extract 4 letter Igbo words The {4} means "4 characters", not "4 letters", and a combining character=20 counts as a character. It might be nice for 'grep' to have ways to perform Unicode=20 normalization before matching. In the meantime perhaps you can get what=20 you want by normalizing the text before running it through 'grep'. From debbugs-submit-bounces@debbugs.gnu.org Mon May 09 14:50:01 2022 Received: (at 55331) by debbugs.gnu.org; 9 May 2022 18:50:01 +0000 Received: from localhost ([127.0.0.1]:59446 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no8SH-0002SF-1w for submit@debbugs.gnu.org; Mon, 09 May 2022 14:50:01 -0400 Received: from out5-smtp.messagingengine.com ([66.111.4.29]:53653) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no8Mt-0002FO-0l for 55331@debbugs.gnu.org; Mon, 09 May 2022 14:44:27 -0400 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 7C50E5C01CA; Mon, 9 May 2022 14:44:21 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Mon, 09 May 2022 14:44:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=emailplus.org; h=cc:cc:content-transfer-encoding:content-type:date:date:from :from:in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm1; t=1652121861; x= 1652208261; bh=6XjTX5eXv33RjH5AkypQh4kfaiXo2P4TXZq0EqLUp/A=; b=j yxcvX0X9FcDjfqoBwow/jI8FH2jwj7fe6W+CU4F0X7tQf1S0+SGqdCALujEd4UZV ccKNvWsqCJYvOUEQIUezpsX1IuZyItpQVsjdavjmmtPTAIveocQefBgcQlbLis/U RIbX97354JXJTpvWeQaLXg6pTmE8UjVkCrs9ZY0t9g4x8rVD8WInYKfuXuBX0kmp ip4PSfwT4qgO1ovTyGj8KHhTquMWwc9dgo6Ke0eSFEH7HsqT+qIM8yQIQ4yWhNGm 7lSQuHy/iKxUpZAU3IfG7sClK0ylanpeZ+7KxRN4rgKCyXcf6BeSCz5epSp70Wr5 JTup67vgZDAzhtswTJRCQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; t=1652121861; x=1652208261; bh=6XjTX5eXv33Rj H5AkypQh4kfaiXo2P4TXZq0EqLUp/A=; b=VeDK/sY9emwp52I8YOUCG1L1OXgX1 reUwOgWQ06dR5zva2PwGSLXygXCr5jfS/lrhgsdmmQcsPL5VNpJvijiy8b1Ekr/j ew2G8YT2NJm8yPBbBQoWqwcOWY88SVq7lwxwlObZ0tS2ONp6EE/dkdv0WRA4BQaM /Ji5spBGsNzqqg9pk2f120GoW+u0Rj2GicLmbWRjyWc9yimT/0POjc6+WmsF4ABH BQ5H6iH7zODRiUD0oqjd6vKtyQh976VSN75I45v0vI8+8t4BCIc8sx+qR5VKqZGN 0ex/G9cDfT3ErtpstLxY50WkOICULttxTTSnRuXJQPjoERtMRb5K2q/hg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeelgdduvdekucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkfffgggfuffvvehfhfgjtgfgsehtjeertddtfeejnecuhfhrohhmpeeuvghn shhonhcuofhuihhtvgcuoegsvghnshhonhgpmhhuihhtvgesvghmrghilhhplhhushdroh hrgheqnecuggftrfgrthhtvghrnhepveetledtueellefhgeduvddtgfejgeduveeviedu veevleejleekgedugeeuuefhnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpe hmrghilhhfrhhomhepsggvnhhsohhnpghmuhhithgvsegvmhgrihhlphhluhhsrdhorhhg X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 9 May 2022 14:44:20 -0400 (EDT) Message-ID: <86421642-9579-a9bb-8ef0-61c9cfcbee8f@emailplus.org> Date: Mon, 9 May 2022 21:44:17 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: bug#55331: Improved support for combining diacritics Content-Language: en-US To: Paul Eggert References: <55709462-5ea6-ff90-a0bc-5c919cb1af47@emailplus.org> <85688b8d-04ff-bcfa-814a-a8415d9df291@cs.ucla.edu> From: Benson Muite In-Reply-To: <85688b8d-04ff-bcfa-814a-a8415d9df291@cs.ucla.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 55331 X-Mailman-Approved-At: Mon, 09 May 2022 14:50:00 -0400 Cc: 55331@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) On 5/9/22 21:30, Paul Eggert wrote: > On 5/8/22 23:38, Benson Muite wrote: > > It might be nice for 'grep' to have ways to perform Unicode > normalization before matching. In the meantime perhaps you can get what > you want by normalizing the text before running it through 'grep'. Thanks for the advice. uconv should work. From debbugs-submit-bounces@debbugs.gnu.org Mon May 09 15:11:09 2022 Received: (at control) by debbugs.gnu.org; 9 May 2022 19:11:09 +0000 Received: from localhost ([127.0.0.1]:59477 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no8mj-00033Z-NV for submit@debbugs.gnu.org; Mon, 09 May 2022 15:11:09 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:45658) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1no8mi-00033L-Av for control@debbugs.gnu.org; Mon, 09 May 2022 15:11:08 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1485D1600D4 for ; Mon, 9 May 2022 12:11:03 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id mrx9XEmPwLx6 for ; Mon, 9 May 2022 12:11:02 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7E0E51600D5 for ; Mon, 9 May 2022 12:11:02 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id EX2PmDZgz2wN for ; Mon, 9 May 2022 12:11:02 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 5D6B41600D4 for ; Mon, 9 May 2022 12:11:02 -0700 (PDT) Message-ID: Date: Mon, 9 May 2022 12:11:02 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Content-Language: en-US To: control@debbugs.gnu.org From: Paul Eggert Subject: 55331 is wishlist Organization: UCLA Computer Science Department Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) severity 55331 wishlist