From unknown Sat Aug 16 21:20:52 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 23 Sep 2013 05:18:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 15440 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 15440@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.137991345827535 (code B ref -1); Mon, 23 Sep 2013 05:18:02 +0000 Received: (at submit) by debbugs.gnu.org; 23 Sep 2013 05:17:38 +0000 Received: from localhost ([127.0.0.1]:56992 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VNyWf-0007A2-4q for submit@debbugs.gnu.org; Mon, 23 Sep 2013 01:17:37 -0400 Received: from eggs.gnu.org ([208.118.235.92]:36690) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VNyWd-00079p-HO for submit@debbugs.gnu.org; Mon, 23 Sep 2013 01:17:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VNyWX-0005ff-Am for submit@debbugs.gnu.org; Mon, 23 Sep 2013 01:17:30 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:55372) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VNyWX-0005fa-7K for submit@debbugs.gnu.org; Mon, 23 Sep 2013 01:17:29 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38406) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VNyWV-0002Ko-UK for bug-grep@gnu.org; Mon, 23 Sep 2013 01:17:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VNyWR-0005ea-Bi for bug-grep@gnu.org; Mon, 23 Sep 2013 01:17:27 -0400 Received: from mail-pd0-x231.google.com ([2607:f8b0:400e:c02::231]:52266) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VNyWR-0005eP-0U for bug-grep@gnu.org; Mon, 23 Sep 2013 01:17:23 -0400 Received: by mail-pd0-f177.google.com with SMTP id y10so2752459pdj.36 for ; Sun, 22 Sep 2013 22:17:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:message-id:subject:to:content-type; bh=rFqHR7S5A6hDlOSke4ovHO5taZTATUUDtbbkfDS1m0I=; b=lDySmu6MZsS/rMGnypPez+aMecxDTyN4RXyr/ofNzR9yNHgziHWZcdao7t7PnerJNl Lgx7HlFHGaA2fatCgKcuRPY0tQ6FkKEVT+3BY+U3YkYMAAxw1S++FK7SyGH4vdW2VA4l xSKx9rFgTd8+J1xtyGWaJ9ElHddX4D8lyPN8Gowpetosf2+21viBigFqU3a+/QRwdm4P gQKCKFeHnjgVtSyZW9UAvx6gCysLqrCRJz2AhJqpklKRVjBDaEnuB8ndorQOKaTSLkqA I8gaO8X8Lw9qPmLZ9IIQyPk/IfKbSSHTbiV5zDmgernygsk0kGaqa/VCXzWjyv5j1mu9 UqXw== X-Received: by 10.66.162.195 with SMTP id yc3mr22707536pab.64.1379913442159; Sun, 22 Sep 2013 22:17:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.6.66 with HTTP; Sun, 22 Sep 2013 22:17:01 -0700 (PDT) From: Jim Meyering Date: Sun, 22 Sep 2013 22:17:01 -0700 X-Google-Sender-Auth: NRQKu1xj6NzyYJmRTnisVZc5OrA Message-ID: Content-Type: multipart/mixed; boundary=047d7b6dc1a88a84c704e70622ed X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --047d7b6dc1a88a84c704e70622ed Content-Type: text/plain; charset=ISO-8859-1 This one really surprised me. Learning that multibyte \s and \S had been broken since grep-2.6 did not make my day. But fixing it helped. Here's how it started: To demonstrate the (first)bug, set up to use a UTF8 locale: export LC_ALL=en_US.UTF-8 then run this and note that it matches: $ printf '\x82\n' > in; ./grep -q '\S' in && echo match match Now, require a back-reference (forcing switch from grep's DFA matcher to use of the regex functions), and you see there is no match: $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match $ One fix would be to make it so dfaexec's \S-processing fails to match an invalid multibyte sequence, just as it's "."-processing does. That led me to this realization: Uh oh. This is worse: \s is not multi-byte aware. The two-byte "NO-BREAK SPACE" character is not matched by \s. This fails: $ printf 'a\xc2\xa0b\n'|./grep 'a\sb' $ This matches in spite of the fact that grep.texi says \s is equivalent to [[:space:]] : $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b' a b GNU grep fails: (but if I do s/\\s/[[:space:]]/ to the RE, then it does match) $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep: $ Patch attached: --047d7b6dc1a88a84c704e70622ed Content-Type: application/octet-stream; name="0003-dfa-fix-s-and-S-to-work-for-multibyte.patch" Content-Disposition: attachment; filename="0003-dfa-fix-s-and-S-to-work-for-multibyte.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hlx8t1bz0 RnJvbSA2NzQ1NmUxZjA2YWEwYzc1MTk3MDk1ZmMwMTBjMDIwNzE5NDg4ZTExIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U3VuLCAyMiBTZXAgMjAxMyAxMDo1MDowNSAtMDcwMApTdWJqZWN0OiBbUEFUQ0hdIGRmYTogZml4 IFxzIGFuZCBcUyB0byB3b3JrIGZvciBtdWx0aWJ5dGUKCiogc3JjL2RmYS5jIChsZXgpOiBJbiBt dWx0aWJ5dGUgbW9kZSwgd2UgY2FuJ3QgdHJlYXQgXHMgYW5kIFxTIGFzIHdlIGRvCmluIHNpbmds ZS1ieXRlIG1vZGUuICBNYXAgdGhlbSB0byBbWzpzcGFjZTpdXSBhbmQgW15bOnNwYWNlOl1dIHJl c3BlY3RpdmVseSwKdG8gbWFrZSB0aGUgREZBIG1hdGNoZXIgdXNlIHRoZSByZWdleC1tYXRjaGVy IGZvciB0aGlzIHRlcm0uCiogdGVzdHMvbXVsdGlieXRlLXdoaXRlLXNwYWNlOiBOZXcgZmlsZS4g IFRlc3QgZm9yIHRoZSBidWcuCiogdGVzdHMvTWFrZWZpbGUuYW0gKFRFU1RTKTogQWRkIGl0LgpU aGlzIGJ1ZyB3YXMgaW50cm9kdWNlZCB3aXRoIHRoZSBhZGRpdGlvbiBvZiBERkEgc3VwcG9ydApm b3IgXHMgYW5kIFxTIGluIGNvbW1pdCB2Mi41LjQtMTEyLWdmOTc5Y2EwLgotLS0KIE5FV1MgICAg ICAgICAgICAgICAgICAgICAgICB8ICA3ICsrKysrKysKIHNyYy9kZmEuYyAgICAgICAgICAgICAg ICAgICB8IDQ3ICsrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKystLS0tLS0tLQog dGVzdHMvTWFrZWZpbGUuYW0gICAgICAgICAgIHwgIDEgKwogdGVzdHMvbXVsdGlieXRlLXdoaXRl LXNwYWNlIHwgNDUgKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKwog NCBmaWxlcyBjaGFuZ2VkLCA5MiBpbnNlcnRpb25zKCspLCA4IGRlbGV0aW9ucygtKQogY3JlYXRl IG1vZGUgMTAwNzU1IHRlc3RzL211bHRpYnl0ZS13aGl0ZS1zcGFjZQoKZGlmZiAtLWdpdCBhL05F V1MgYi9ORVdTCmluZGV4IGZiYTMwOTQuLjc1NjljY2YgMTAwNjQ0Ci0tLSBhL05FV1MKKysrIGIv TkVXUwpAQCAtNCw2ICs0LDEzIEBAIEdOVSBncmVwIE5FV1MgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAtKi0gb3V0bGluZSAtKi0KCiAqKiBCdWcgZml4ZXMKCisgIGdyZXAncyBc cyBhbmQgXFMgZmFpbGVkIHRvIHdvcmsgd2l0aCBtdWx0aS1ieXRlIHdoaXRlIHNwYWNlIGNoYXJh Y3RlcnMuCisgIEZvciBleGFtcGxlLCBccyB3b3VsZCBmYWlsIHRvIG1hdGNoIGEgbm9uLWJyZWFr aW5nIHNwYWNlLCBhbmQgdGhpcworICB3b3VsZCBwcmludCBub3RoaW5nOiBMQ19BTEw9ZW5fVVMu dXRmOCBwcmludGYgJ1x4YzJceGEwJyB8IGdyZXAgJ1xzJworICBBIHJlbGF0ZWQgYnVnIGlzIHRo YXQgXFMgd291bGQgbWlzdGFrZW5seSBtYXRjaCBhbiBpbnZhbGlkIG11bHRpYnl0ZQorICBjaGFy YWN0ZXIuICBFLmcuIHRoaXMgd291bGQgbWF0Y2g6IHByaW50ZiAnXHg4MlxuJyB8IHNyYy9ncmVw ICdeXFMkJworICBbYnVnIHByZXNlbnQgc2luY2UgMi42XQorCiAgIGdyZXAgLWkgd291bGQgc2Vn ZmF1bHQgb24gc3lzdGVtcyB1c2luZyBVVEYtMTYtYmFzZWQgd2NoYXJfdCAoQ3lnd2luKQogICB3 aGVuIGNvbnZlcnRpbmcgYW4gaW5wdXQgc3RyaW5nIGNvbnRhaW5pbmcgY2VydGFpbiA0LWJ5dGUg VVRGLTgKICAgc2VxdWVuY2VzIHRvIGxvd2VyIGNhc2UuICBUaGUgY29udmVyc2lvbnMgdG8gd2No YXJfdCBhbmQgYmFjayB0bwpkaWZmIC0tZ2l0IGEvc3JjL2RmYS5jIGIvc3JjL2RmYS5jCmluZGV4 IGU0NjRmYTEuLmRlNmM2NzEgMTAwNjQ0Ci0tLSBhL3NyYy9kZmEuYworKysgYi9zcmMvZGZhLmMK QEAgLTE0MzUsMTQgKzE0MzUsNDUgQEAgbGV4ICh2b2lkKQogICAgICAgICBjYXNlICdTJzoKICAg ICAgICAgICBpZiAoIWJhY2tzbGFzaCB8fCAoc3ludGF4X2JpdHMgJiBSRV9OT19HTlVfT1BTKSkK ICAgICAgICAgICAgIGdvdG8gbm9ybWFsX2NoYXI7Ci0gICAgICAgICAgemVyb3NldCAoY2NsKTsK LSAgICAgICAgICBmb3IgKGMyID0gMDsgYzIgPCBOT1RDSEFSOyArK2MyKQotICAgICAgICAgICAg aWYgKGlzc3BhY2UgKGMyKSkKLSAgICAgICAgICAgICAgc2V0Yml0IChjMiwgY2NsKTsKLSAgICAg ICAgICBpZiAoYyA9PSAnUycpCi0gICAgICAgICAgICBub3RzZXQgKGNjbCk7Ci0gICAgICAgICAg bGFzdHN0YXJ0ID0gMDsKLSAgICAgICAgICByZXR1cm4gbGFzdHRvayA9IENTRVQgKyBjaGFyY2xh c3NfaW5kZXggKGNjbCk7CisgICAgICAgICAgaWYgKE1CX0NVUl9NQVggPT0gMSkKKyAgICAgICAg ICAgIHsKKyAgICAgICAgICAgICAgemVyb3NldCAoY2NsKTsKKyAgICAgICAgICAgICAgZm9yIChj MiA9IDA7IGMyIDwgTk9UQ0hBUjsgKytjMikKKyAgICAgICAgICAgICAgICBpZiAoaXNzcGFjZSAo YzIpKQorICAgICAgICAgICAgICAgICAgc2V0Yml0IChjMiwgY2NsKTsKKyAgICAgICAgICAgICAg aWYgKGMgPT0gJ1MnKQorICAgICAgICAgICAgICAgIG5vdHNldCAoY2NsKTsKKyAgICAgICAgICAg ICAgbGFzdHN0YXJ0ID0gMDsKKyAgICAgICAgICAgICAgcmV0dXJuIGxhc3R0b2sgPSBDU0VUICsg Y2hhcmNsYXNzX2luZGV4IChjY2wpOworICAgICAgICAgICAgfQorCisjZGVmaW5lIFBVU0hfTEVY X1NUQVRFKHMpCQkJXAorICBkbwkJCQkJCVwKKyAgICB7CQkJCQkJXAorICAgICAgY2hhciBjb25z dCAqbGV4cHRyX3NhdmVkID0gbGV4cHRyOwlcCisgICAgICBzaXplX3QgbGV4bGVmdF9zYXZlZCA9 IGxleGxlZnQ7CQlcCisgICAgICBsZXhwdHIgPSAocyk7CQkJCVwKKyAgICAgIGxleGxlZnQgPSBz dHJsZW4gKGxleHB0cikKKworI2RlZmluZSBQT1BfTEVYX1NUQVRFKCkJCQkJXAorICAgICAgbGV4 cHRyID0gbGV4cHRyX3NhdmVkOwkJCVwKKyAgICAgIGxleGxlZnQgPSBsZXhsZWZ0X3NhdmVkOwkJ CVwKKyAgICB9CQkJCQkJXAorICB3aGlsZSAoMCkKKworICAgICAgICAgIC8qIEZJWE1FOiBzZWUg aWYgb3B0aW1pemluZyB0aGlzLCBhcyBpcyBkb25lIHdpdGggQU5ZQ0hBUiBhbmQKKyAgICAgICAg ICAgICBhZGRfdXRmOF9hbnljaGFyLCBtYWtlcyBzZW5zZS4gICovCisKKyAgICAgICAgICAvKiBc cyBhbmQgXFMgYXJlIGRvY3VtZW50ZWQgdG8gYmUgZXF1aXZhbGVudCB0byBbWzpzcGFjZTpdXSBh bmQKKyAgICAgICAgICAgICBbXls6c3BhY2U6XV0gcmVzcGVjdGl2ZWx5LCBzbyB0ZWxsIHRoZSBs ZXhlciB0byBwcm9jZXNzIHRob3NlCisgICAgICAgICAgICAgc3RyaW5ncywgZWFjaCBtaW51cyBp dHMgImFscmVhZHkgcHJvY2Vzc2VkIiAnWycuICAqLworICAgICAgICAgIFBVU0hfTEVYX1NUQVRF IChjID09ICdzJyA/ICJbOnNwYWNlOl1dIiA6ICJeWzpzcGFjZTpdXSIpOworCisgICAgICAgICAg bGFzdHRvayA9IHBhcnNlX2JyYWNrZXRfZXhwICgpOworCisgICAgICAgICAgUE9QX0xFWF9TVEFU RSAoKTsKKworICAgICAgICAgIHJldHVybiBsYXN0dG9rOwoKICAgICAgICAgY2FzZSAndyc6CiAg ICAgICAgIGNhc2UgJ1cnOgpkaWZmIC0tZ2l0IGEvdGVzdHMvTWFrZWZpbGUuYW0gYi90ZXN0cy9N YWtlZmlsZS5hbQppbmRleCA1ODFmNjg4Li43NjBmNzkzIDEwMDY0NAotLS0gYS90ZXN0cy9NYWtl ZmlsZS5hbQorKysgYi90ZXN0cy9NYWtlZmlsZS5hbQpAQCAtNzAsNiArNzAsNyBAQCBURVNUUyA9 CQkJCQkJXAogICBpbnZhbGlkLW11bHRpYnl0ZS1pbmZsb29wCQkJXAogICBraGFkYWZ5CQkJCQlc CiAgIG1heC1jb3VudC12cy1jb250ZXh0CQkJCVwKKyAgbXVsdGlieXRlLXdoaXRlLXNwYWNlCQkJ CVwKICAgZW1wdHktbGluZS1tYgkJCQkJXAogICB1bmlieXRlLWJyYWNrZXQtZXhwcgkJCQlcCiAg IGhpZ2gtYml0LXJhbmdlCQkJCVwKZGlmZiAtLWdpdCBhL3Rlc3RzL211bHRpYnl0ZS13aGl0ZS1z cGFjZSBiL3Rlc3RzL211bHRpYnl0ZS13aGl0ZS1zcGFjZQpuZXcgZmlsZSBtb2RlIDEwMDc1NQpp bmRleCAwMDAwMDAwLi5hMTk2MGExCi0tLSAvZGV2L251bGwKKysrIGIvdGVzdHMvbXVsdGlieXRl LXdoaXRlLXNwYWNlCkBAIC0wLDAgKzEsNDUgQEAKKyMhIC9iaW4vc2gKKyMgVGVzdCB3aGV0aGVy IFxzIG1hdGNoZXMgbXVsdGlieXRlIHdoaXRlIHNwYWNlIGNoYXJhY3RlcnMuCisjCisjIENvcHly aWdodCAoQykgMjAxMyBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24sIEluYy4KKyMKKyMgQ29weWlu ZyBhbmQgZGlzdHJpYnV0aW9uIG9mIHRoaXMgZmlsZSwgd2l0aCBvciB3aXRob3V0IG1vZGlmaWNh dGlvbiwKKyMgYXJlIHBlcm1pdHRlZCBpbiBhbnkgbWVkaXVtIHdpdGhvdXQgcm95YWx0eSBwcm92 aWRlZCB0aGUgY29weXJpZ2h0CisjIG5vdGljZSBhbmQgdGhpcyBub3RpY2UgYXJlIHByZXNlcnZl ZC4KKworLiAiJHtzcmNkaXI9Ln0vaW5pdC5zaCI7IHBhdGhfcHJlcGVuZF8gLi4vc3JjCisKK3Jl cXVpcmVfZW5fdXRmOF9sb2NhbGVfCisKK0xDX0FMTD1lbl9VUy5VVEYtOAorZXhwb3J0IExDX0FM TAorCit1dGY4X3NwYWNlX2NoYXJhY3RlcnM9JChzZWQgJ3MvLio6IC8vO3MvXi9cXHgvO3MvIC9c XHgvZycgPDxcRU9GCitVKzAwMjAgU1BBQ0U6IDIwCitVKzAwQTAgTk8tQlJFQUsgU1BBQ0U6IGMy IGEwCitVKzE2ODAgT0dIQU0gU1BBQ0UgTUFSSzogZTEgOWEgODAKK1UrMjAwMCBFTiBRVUFEOiBl MiA4MCA4MAorVSsyMDAxIEVNIFFVQUQ6IGUyIDgwIDgxCitVKzIwMDIgRU4gU1BBQ0U6IGUyIDgw IDgyCitVKzIwMDMgRU0gU1BBQ0U6IGUyIDgwIDgzCitVKzIwMDQgVEhSRUUtUEVSLUVNIFNQQUNF OiBlMiA4MCA4NAorVSsyMDA1IEZPVVItUEVSLUVNIFNQQUNFOiBlMiA4MCA4NQorVSsyMDA2IFNJ WC1QRVItRU0gU1BBQ0U6IGUyIDgwIDg2CitVKzIwMDcgRklHVVJFIFNQQUNFOiBlMiA4MCA4Nwor VSsyMDA4IFBVTkNUVUFUSU9OIFNQQUNFOiBlMiA4MCA4OAorVSsyMDA5IFRISU4gU1BBQ0U6IGUy IDgwIDg5CitVKzIwMEEgSEFJUiBTUEFDRTogZTIgODAgOGEKK1UrMjAwQiBaRVJPIFdJRFRIIFNQ QUNFOiBlMiA4MCA4YgorVSsyMDJGIE5BUlJPVyBOTy1CUkVBSyBTUEFDRTogZTIgODAgYWYKK1Ur MjA1RiBNRURJVU0gTUFUSEVNQVRJQ0FMIFNQQUNFOiBlMiA4MSA5ZgorVSszMDAwIElERU9HUkFQ SElDIFNQQUNFOiBlMyA4MCA4MAorRU9GCispCisKK2ZhaWw9MAorCitmb3IgaSBpbiAkdXRmOF9z cGFjZV9jaGFyYWN0ZXJzOyBkbworICBwcmludGYgIiRpXG4iIHwgZ3JlcCAtcSAnXlxzJCcgfHwg eyB3YXJuXyAnJXNcbicgIiRpIEZBSUxFRCI7IGZhaWw9MTsgfQorZG9uZQorCitFeGl0ICRmYWls Ci0tCjEuOC40LjI5OS5nYjNlN2QyNAo= --047d7b6dc1a88a84c704e70622ed-- From unknown Sat Aug 16 21:20:52 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte References: In-Reply-To: Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 23 Sep 2013 21:05:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15440 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Aharon Robbins , 15440@debbugs.gnu.org Received: via spool by 15440-submit@debbugs.gnu.org id=B15440.137997028028089 (code B ref 15440); Mon, 23 Sep 2013 21:05:01 +0000 Received: (at 15440) by debbugs.gnu.org; 23 Sep 2013 21:04:40 +0000 Received: from localhost ([127.0.0.1]:58994 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VODJ9-0007Iy-S5 for submit@debbugs.gnu.org; Mon, 23 Sep 2013 17:04:40 -0400 Received: from mail-pa0-f41.google.com ([209.85.220.41]:52262) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VODJ7-0007Ii-7T for 15440@debbugs.gnu.org; Mon, 23 Sep 2013 17:04:37 -0400 Received: by mail-pa0-f41.google.com with SMTP id bj1so4103499pad.28 for <15440@debbugs.gnu.org>; Mon, 23 Sep 2013 14:04:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:message-id:subject:to:content-type; bh=g6e0qgl9jbmBUPThdBESiyzQ511qL9cb/Zh7+w+y3yY=; b=A12Yzeh47z0/hYkquDet+zlZFmDgMh4i2y2H+8Knkf4scDOgfEfiZP0spxJKeCLbOv iCUP6ONHpqFc+55KqH40Ee+H2FPI6qXe5bvMWNgbDJL/E7t9czIKLxf07mgMi4C4q6i8 XZMbbtgXY0T4lupRGHoyhnrH78rxiIhmcpTIABlRDl5YnpekODVlgFN9xxN1RLVkxt43 IwVYhbmzeThpcUAFfHQAQVKJNs0k8/8xvdE+HXWcH04hUDHJmM92i7sya24lAGeoLdU0 SM7Q3eYH0vFnmbirGXHjTrFYWLjSE5zQlN9D4GOAio9rBlBuJmQX4ac0r7oy1AlS/yW2 qQ3A== X-Received: by 10.68.252.33 with SMTP id zp1mr25481151pbc.95.1379970269418; Mon, 23 Sep 2013 14:04:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.6.66 with HTTP; Mon, 23 Sep 2013 14:04:09 -0700 (PDT) From: Jim Meyering Date: Mon, 23 Sep 2013 14:04:09 -0700 X-Google-Sender-Auth: yXTWF0GhMkIM0eiFaPNA8Hgp34s Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) [using the right bug address, this time] On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins wrote: > Hi. > >> $ printf '\x82\n' > in; ./grep -q '\S' in && echo match >> match >> >> Now, require a back-reference (forcing switch from grep's DFA matcher >> to use of the regex functions), and you see there is no match: >> >> $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match >> $ > > I see similar results with gawk, accounting for syntactic difference > and a different way to force the regex matcher. > > So far so good. > >> Uh oh. This is worse: \s is not multi-byte aware. >> The two-byte "NO-BREAK SPACE" character is not matched by \s. >> >> This fails: >> $ printf 'a\xc2\xa0b\n'|./grep 'a\sb' >> $ >> >> This matches in spite of the fact that grep.texi says \s is >> equivalent to [[:space:]] : >> $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b' >> a b >> >> GNU grep fails: >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match) >> $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep: >> $ > > I cannot reproduce this with gawk. Setting GAWK_NO_DFA=1 in the > environment causes gawk to bypass dfa. For these it makes no > difference: > > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/' > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/' > > No result from either, and similar results for [[:space:]]. Hi Arnold, [re-adding CC to the bug tracker] Thanks for testing. When I test on glibc, I confirm what you report: [[:space:]] fails to match NBSP. Makes me think either glibc's UTF8 attribute tables are wrong, or there's a bug in regex: $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US. UTF-8 grep 'a[[:space:]]b' [Exit 1] Initially, I considered constructing a DFA that would match all UTF8 white space characters (see the FIXME comment), and another that would match the complement of that set minus the set of invalid UTF8 bytes, but ended up preferring the simpler change. FTR, I tested this only on a system for which all tests passed (OS/X). Very surprised to find it doesn't work on a glibc-based system. From unknown Sat Aug 16 21:20:52 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte Resent-From: Aharon Robbins Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 24 Sep 2013 12:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15440 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: jim@meyering.net, arnold@skeeve.com, 15440@debbugs.gnu.org Received: via spool by 15440-submit@debbugs.gnu.org id=B15440.138002550021172 (code B ref 15440); Tue, 24 Sep 2013 12:25:02 +0000 Received: (at 15440) by debbugs.gnu.org; 24 Sep 2013 12:25:00 +0000 Received: from localhost ([127.0.0.1]:60219 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VORfl-0005VM-9g for submit@debbugs.gnu.org; Tue, 24 Sep 2013 08:24:58 -0400 Received: from mxout4.netvision.net.il ([194.90.9.27]:33116) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VORfi-0005V7-Fa for 15440@debbugs.gnu.org; Tue, 24 Sep 2013 08:24:55 -0400 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([89.139.60.110]) by mxout4.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0MTM009UDQHBOV00@mxout4.netvision.net.il> for 15440@debbugs.gnu.org; Tue, 24 Sep 2013 15:24:48 +0300 (IDT) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2) with ESMTP id r8OCOkh8003360; Tue, 24 Sep 2013 15:24:47 +0300 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id r8OCOjBs003359; Tue, 24 Sep 2013 15:24:45 +0300 From: Aharon Robbins Message-id: <201309241224.r8OCOjBs003359@skeeve.com> Date: Tue, 24 Sep 2013 15:24:45 +0300 References: In-reply-to: User-Agent: Heirloom mailx 12.5 6/20/10 X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi Jim. I should note that gawk uses its own regex, although it does rely on glibc for isspace / iswspace etc... Can you test gawk (using the master branch is fine) on Mac OS X? Basically you'd want to enclose the pattern in /.../ on the command line and use GAWK_NO_DFA=1 to force use of regex. In any case, once you push the changes I'll pick them up. Thanks, Arnold P.S. To test gawk, cut and paste: git clone git://git.savannah.gnu.org/gawk.git cd gawk ./bootstrap.sh && ./configure && make -j 10 # or whatever make check # optional printf '....' | ./gawk '/.../' # your tests here. :-) Much thanks! > From: Jim Meyering > Date: Mon, 23 Sep 2013 14:04:09 -0700 > Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte > To: Aharon Robbins , 15440@debbugs.gnu.org > > [using the right bug address, this time] > > On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins wrote: > > Hi. > > > >> $ printf '\x82\n' > in; ./grep -q '\S' in && echo match > >> match > >> > >> Now, require a back-reference (forcing switch from grep's DFA matcher > >> to use of the regex functions), and you see there is no match: > >> > >> $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match > >> $ > > > > I see similar results with gawk, accounting for syntactic difference > > and a different way to force the regex matcher. > > > > So far so good. > > > >> Uh oh. This is worse: \s is not multi-byte aware. > >> The two-byte "NO-BREAK SPACE" character is not matched by \s. > >> > >> This fails: > >> $ printf 'a\xc2\xa0b\n'|./grep 'a\sb' > >> $ > >> > >> This matches in spite of the fact that grep.texi says \s is > >> equivalent to [[:space:]] : > >> $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b' > >> a b > >> > >> GNU grep fails: > >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match) > >> $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep: > >> $ > > > > I cannot reproduce this with gawk. Setting GAWK_NO_DFA=1 in the > > environment causes gawk to bypass dfa. For these it makes no > > difference: > > > > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/' > > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/' > > > > No result from either, and similar results for [[:space:]]. > > Hi Arnold, > [re-adding CC to the bug tracker] > > Thanks for testing. > When I test on glibc, I confirm what you report: [[:space:]] fails to > match NBSP. Makes me think either glibc's UTF8 attribute tables are > wrong, or there's a bug in regex: > > $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US. > UTF-8 grep 'a[[:space:]]b' > [Exit 1] > > Initially, I considered constructing a DFA that would match all UTF8 > white space characters (see the FIXME comment), and another that would > match the complement of that set minus the set of invalid UTF8 bytes, > but ended up preferring the simpler change. > > FTR, I tested this only on a system for which all tests passed (OS/X). > Very surprised to find it doesn't work on a glibc-based system. From unknown Sat Aug 16 21:20:52 2025 X-Loop: help-debbugs@gnu.org Subject: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 02 Oct 2013 00:40:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 15440 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Aharon Robbins Cc: 15440@debbugs.gnu.org Received: via spool by 15440-submit@debbugs.gnu.org id=B15440.13806743507365 (code B ref 15440); Wed, 02 Oct 2013 00:40:03 +0000 Received: (at 15440) by debbugs.gnu.org; 2 Oct 2013 00:39:10 +0000 Received: from localhost ([127.0.0.1]:48214 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VRAT8-0001ui-7U for submit@debbugs.gnu.org; Tue, 01 Oct 2013 20:39:10 -0400 Received: from mail-pd0-f169.google.com ([209.85.192.169]:46016) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VRAT4-0001uX-77 for 15440@debbugs.gnu.org; Tue, 01 Oct 2013 20:39:06 -0400 Received: by mail-pd0-f169.google.com with SMTP id r10so116558pdi.14 for <15440@debbugs.gnu.org>; Tue, 01 Oct 2013 17:39:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=jdPc4vxVOQ+L4im5rtLeLzV8+kDD/lDekiwZDjjYyR0=; b=xo5Qc3E7KVPogxDaI2RMttKQ1Mu85VO8WaZK9/ro+aBxkMjuRFUZ5oOuCT9B1xWXaS UkxmdBn8ccLufqoDigFcp3qNYXuzwdLgVk8uKD5wybgoD51zhJf7SCPnWCbj8OQg+3Ke 1q/jNiG7Cw5jAJOH6i0FDENa8nimqG50TPJ1k0OPz3dhg8DRObQjQ2VwxKeQpEVtd0mA PrZ+nsvJlPUx0QRiwX6ehrSkRnaKNEdJEBivmtfUGvyQbLSE9oYBLknhc498UxmM1cn7 5cN9TLx6z2idxlM+5e/SJeTpnQe8bb46fD+u+MTn0CChlxZ2GwZdxHL4pH0HAQH+sLx/ R0Cg== X-Received: by 10.68.125.129 with SMTP id mq1mr129166pbb.174.1380674342026; Tue, 01 Oct 2013 17:39:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.6.66 with HTTP; Tue, 1 Oct 2013 17:38:40 -0700 (PDT) In-Reply-To: <201309241224.r8OCOjBs003359@skeeve.com> References: <201309241224.r8OCOjBs003359@skeeve.com> From: Jim Meyering Date: Tue, 1 Oct 2013 17:38:40 -0700 X-Google-Sender-Auth: OdFHXt2H-4cEEpPNN8iIolSzuME Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On Tue, Sep 24, 2013 at 5:24 AM, Aharon Robbins wrote: > Hi Jim. > > I should note that gawk uses its own regex, although it does rely > on glibc for isspace / iswspace etc... ... close 15440 thanks I've pushed my grep patches, but chose to omit 4 multibyte space characters from the list in the test, since each of those would provoke a failure on recent glibc-based systems (fedora 19). That seems to be due to errors in glibc's UTF-8 multibyte flags (wrong whitespace bit) for those characters. Arnold, I tried your latest gawk on a Fedora 19 system, and see the same failure for those four characters, e.g., $ printf '\xc2\xa0\n' | LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./gawk '/[[:space:]]/'|wc -c 0 From debbugs-submit-bounces@debbugs.gnu.org Sun Oct 27 20:21:05 2013 Received: (at control) by debbugs.gnu.org; 28 Oct 2013 00:21:05 +0000 Received: from localhost ([127.0.0.1]:47905 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VaaZs-0000zb-7L for submit@debbugs.gnu.org; Sun, 27 Oct 2013 20:21:04 -0400 Received: from mail-pb0-f45.google.com ([209.85.160.45]:62646) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VaaZp-0000z1-S7 for control@debbugs.gnu.org; Sun, 27 Oct 2013 20:21:02 -0400 Received: by mail-pb0-f45.google.com with SMTP id ma3so2155892pbc.18 for ; Sun, 27 Oct 2013 17:20:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:message-id:subject:to:content-type; bh=xIAbIUvdHxtLXoUME1Jcui6vvsX/6CcaQ0Jk9CH5L3Q=; b=fcAa7hIAvzKMcoyBWCdaJDUwrXkUbxj1Nj8nhElglI+cJYB/XhuODt4U1FBm0rTcqN bgFeZKW8t6HFbdvGSdCu8/3gFVf63CoMarodzOX/ZYe2MOM7XB6Nmw114sk9mAZyMfOB qxJ/qzNy8N/YqMzBKb9/mRYqGDxTf451kpwWx30VsblZoDpT9rEyASrY63ROnv8/mbjr ITqja9Y7DnGfL5YBH35LR8IlyD8AV3pbBIxkIhboLJVj463mT7XtL32tpkyAw47dXKYb chDcJ6LLjZpto0/5jJlXdUZn4bGDW6LWp6u470RFhAWl5LzdzGcNlP+Nu2LZGMtHUI9j t9NQ== X-Received: by 10.67.4.197 with SMTP id cg5mr22525916pad.10.1382919656110; Sun, 27 Oct 2013 17:20:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.6.66 with HTTP; Sun, 27 Oct 2013 17:20:35 -0700 (PDT) From: Jim Meyering Date: Sun, 27 Oct 2013 17:20:35 -0700 X-Google-Sender-Auth: Q1i4AgwbYEKddh9oMQA4K320oHs Message-ID: Subject: mark many issues as non-bugs, and close even more To: control@debbugs.gnu.org Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) If you think I've marked or closed a bug inappropriately, please let me know. tags 15438 15439 15441 15486 15656 15664 15677 15690 15726 notabug close 15307 close 15438 close 15439 close 15440 close 15441 close 15486 close 15527 close 15656 close 15664 close 15677 close 15690 close 15724 close 15726 done