GNU bug report logs -
#27978
Detection of section name in man.el
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 27978 in the body.
You can then email your comments to 27978 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#27978
; Package
emacs
.
(Sat, 05 Aug 2017 23:58:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Grégory Mounié <Gregory.Mounie <at> imag.fr>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Sat, 05 Aug 2017 23:58:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
When parsing manual in languages with non-ascii letters, the section
names using non-ascii letters are not added to the table of content.
I noticed the bug reading the French bash manual: the quite useful
"COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL
BUILTIN COMMAND). (because of the É letter)
I propose to use Character class instead of ascii interval in the
appropriate regexp defvar. It should not change anything for english
manual and it should work for many other languages.
It works great for the bash manual in French.
Grégory Mounié
[0001-Unicode-support-for-man-section-name-detection.patch (text/x-patch, attachment)]
Reply sent
to
Eli Zaretskii <eliz <at> gnu.org>
:
You have taken responsibility.
(Fri, 18 Aug 2017 08:51:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Grégory Mounié <Gregory.Mounie <at> imag.fr>
:
bug acknowledged by developer.
(Fri, 18 Aug 2017 08:51:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 27978-done <at> debbugs.gnu.org (full text, mbox):
> From: Grégory Mounié
> <Gregory.Mounie <at> imag.fr>
> Date: Sun, 6 Aug 2017 01:44:19 +0200
>
> When parsing manual in languages with non-ascii letters, the section
> names using non-ascii letters are not added to the table of content.
>
> I noticed the bug reading the French bash manual: the quite useful
> "COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL
> BUILTIN COMMAND). (because of the É letter)
>
> I propose to use Character class instead of ascii interval in the
> appropriate regexp defvar. It should not change anything for english
> manual and it should work for many other languages.
Thanks, I pushed these changes with some minor adjustments.
Specifically:
> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
> "Regular expression describing a manpage section within parentheses.")
I didn't change this one, because I think a section always uses only
ASCII letters and numbers, as in ".1n". If you disagree, can you show
an example where this is not so?
> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
> "Regular expression describing a manpage heading entry.")
I see no reason to replace 0-9 with [:digit:] here, since I think
non-ASCII digits will never be used in this context. Do you agree?
Incidentally, I see quite a few similar regexps elsewhere in man.el,
did you audit all of them and established that they don't need similar
changes? If not, would you like to propose similar changes there?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#27978
; Package
emacs
.
(Fri, 18 Aug 2017 19:24:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 27978 <at> debbugs.gnu.org (full text, mbox):
[Please keep the bug address on the CC list.]
> From: Grégory Mounié <Gregory.Mounie <at> imag.fr>
> Date: Fri, 18 Aug 2017 19:53:44 +0200
>
> In brief, I would not change the other a-zA-Z regexps (details below).
>
> But I would change the SEE ALSO regexp (around line 298) to add other
> languages. Should I fill another bug report with another patch ?
>
> (defvar Man-see-also-regexp "SEE ALSO"
> "Regular expression for SEE ALSO heading (or your equivalent).
> This regexp should not start with a `^' character.")
>
> using the debian manpages translation as référence, and using
> "zgrep -h SH man*/* | sort | uniq -c | sort -n" inside appropriate
> /usr/share/man subdirectories to infer the values, I propose:
>
> "SEE ALSO\|VOIR AUSSI\|SIEHE AUCH\|VÉASE TAMBIÉN\|VEJA TAMBÉM\|VEDERE
> ANCHE\|ZOBACZ TAKŻE\|İLGİLİ BELGELER\|参照|参见 SEE ALSO\|參見 SEE ALSO"
>
> (French, German, Spanish, Portugese, Italian, Polish, Turkish,
> Japanese, Chinese CN, Chinese TW)
OK. If no one objects, I will make this change soon. Thanks.
> Details below about the a-zA-Z regexps:
>
> Le 18/08/2017 à 10:49, Eli Zaretskii a écrit :
> >
> > Thanks, I pushed these changes with some minor adjustments.
> > Specifically:
> >
> >> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> >> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
> >> "Regular expression describing a manpage section within parentheses.")
> >
> > I didn't change this one, because I think a section always uses only
> > ASCII letters and numbers, as in ".1n". If you disagree, can you show
> > an example where this is not so?
> >
>
> I have install the various multilingual standard manpages of my debian
> and I have not grep a counter example so I guess it is perfect.
>
> >> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> >> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
> >> "Regular expression describing a manpage heading entry.")
> >
> > I see no reason to replace 0-9 with [:digit:] here, since I think
> > non-ASCII digits will never be used in this context. Do you agree?
> >
> > Incidentally, I see quite a few similar regexps elsewhere in man.el,
> > did you audit all of them and established that they don't need similar
> > changes? If not, would you like to propose similar changes there?
> >
>
> There are 18 a-Z. They seem like a detection carefully crafted by
> history, thus I would not change them without counter-example either.
>
> The first four a-zA-Z seems related to the parsing of external
> command, with particularities in Windows port so I would not recommend
> to change it.
> The 5-18 a-zA-Z try to guess the manpage around POS. The main pattern
> is "-a-zA-Z0-9._+:"
>
> With the same set of multi-lingual manpages, I have found only one
> character used in manpage name and not in the set: "[" (man [ leads you
> to test). I suspect that adding "[" would add more regressions than
> solutions.
>
> Note that line 720 the pattern is slightly different (missing "-._:").
> I do not understand really why.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 16 Sep 2017 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 274 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.