From unknown Thu Aug 14 21:52:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#25366: 26.0.50; [:blank:] character class should match all Unicode horizontal whitespace Resent-From: Philipp Stephani
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Thu, 05 Jan 2017 14:46:01 +0100
>
> (string-match-p "[[:blank:]]" "\N{HAIR SPACE}")
> =3D> nil, expected 0
>
> [[:blank:]] should be the same as \h in PRCE.
We are consistent with our documentation, but I agree that it would be
good to extend [:blank:], as proposed here:
=C2=A0 = http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properties= a>
Patches to that effect are welcome.
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Fri, 06 Jan 2017 15:00:22 +0000
> Cc: 25366@debbugs.gnu.org
>
>=C2=A0 http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properti= es
>
>=C2=A0 Patches to that effect are welcome.
>
> Here's a patch.
Thanks.=C2=A0 A few minor comments below.
> +/* Return true if C is a horizontal whitespace character, as defined<= br class=3D"gmail_msg"> > +=C2=A0 =C2=A0by http= ://www.unicode.org/reports/tr18/tr18-19.html#blank.=C2=A0 */
> +bool
> +blankp (int c)
> +{
> +=C2=A0 if (c =3D=3D '\t')
> +=C2=A0 =C2=A0 return true;
Why does this test explicitly only for a TAB?=C2=A0 What about SPC, for
example?
> --- a/doc/lispref/searching.texi
> +++ b/doc/lispref/searching.texi
> @@ -553,7 +553,10 @@ Char Classes
>=C2=A0 (@pxref{Character Properties}) indicates they are alphabetic
>=C2=A0 characters.
>=C2=A0 @item [:blank:]
> -This matches space and tab only.
> +This matches horizontal whitespace, as defined by Unicode Technical > +Standard #18.=C2=A0 In particular, it matches tabs and characters who= se
> +Unicode @samp{general-category} property (@pxref{Character
> +Properties}) indicates they are spacing separators.
Similarly here: I find the lack of reference to a space potentially
confusing.
> +** The regular expression character class [:blank:] now matches
> +Unicode horizontal whitespace as defined in
> +http://www.unicode.o= rg/reports/tr18/tr18-19.html#blank.
The reference to a particular version of UTS#18 might become obsolete
when a new version is released.=C2=A0 So I suggest to provide a general
reference to the report and its section, not an exact URL.
> From: Ph= ilipp Stephani <p.stephani2@gmail.com>
> Date: Fri, 06 Jan 2017 15:00:22 +0000
> Cc: 25366@debbugs.gnu.org
>
>=C2=A0 http://www.unicode.org/reports/tr18/tr18-19.html#Compatibility_Properti= es
>
>=C2=A0 Patches to that effect are welcome.
>
> Here's a patch.
Thanks.=C2=A0 A few minor comments below.
> +/* Return true if C is a horizontal whitespace character, as defined<= br class=3D"gmail_msg"> > +=C2=A0 =C2=A0by http= ://www.unicode.org/reports/tr18/tr18-19.html#blank.=C2=A0 */
> +bool
> +blankp (int c)
> +{
> +=C2=A0 if (c =3D=3D '\t')
> +=C2=A0 =C2=A0 return true;
Why does this test explicitly only for a TAB?=C2=A0 What about SPC, for
example?
> --- a/doc/lispref/searching.texi
> +++ b/doc/lispref/searching.texi
> @@ -553,7 +553,10 @@ Char Classes
>=C2=A0 (@pxref{Character Properties}) indicates they are alphabetic
>=C2=A0 characters.
>=C2=A0 @item [:blank:]
> -This matches space and tab only.
> +This matches horizontal whitespace, as defined by Unicode Technical > +Standard #18.=C2=A0 In particular, it matches tabs and characters who= se
> +Unicode @samp{general-category} property (@pxref{Character
> +Properties}) indicates they are spacing separators.
Similarly here: I find the lack of reference to a space potentially
confusing.
> +** The regular expression character class [:blank:] now matches
> +Unicode horizontal whitespace as defined in
> +http://www.unicode.o= rg/reports/tr18/tr18-19.html#blank.
The reference to a particular version of UTS#18 might become obsolete
when a new version is released.=C2=A0 So I suggest to provide a general
reference to the report and its section, not an exact URL.