GNU bug report logs - #27978
Detection of section name in man.el

Previous Next

Package: emacs;

Reported by: Grégory Mounié <Gregory.Mounie <at> imag.fr>

Date: Sat, 5 Aug 2017 23:58:02 UTC

Severity: minor

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 27978 in the body.
You can then email your comments to 27978 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#27978; Package emacs. (Sat, 05 Aug 2017 23:58:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Grégory Mounié <Gregory.Mounie <at> imag.fr>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 05 Aug 2017 23:58:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Grégory Mounié <Gregory.Mounie <at> imag.fr>
To: bug-gnu-emacs <at> gnu.org
Subject: Detection of section name in man.el
Date: Sun, 6 Aug 2017 01:44:19 +0200
[Message part 1 (text/plain, inline)]
 When parsing manual in languages with non-ascii letters, the section 
names using non-ascii letters are not added to the table of content.

 I noticed the bug reading the French bash manual: the quite useful 
"COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL 
BUILTIN COMMAND). (because of the É letter)

 I propose to use Character class instead of ascii interval in the 
appropriate regexp defvar. It should not change anything for english 
manual and it should work for many other languages.

 It works great for the bash manual in French.
 Grégory Mounié
[0001-Unicode-support-for-man-section-name-detection.patch (text/x-patch, attachment)]

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Fri, 18 Aug 2017 08:51:02 GMT) Full text and rfc822 format available.

Notification sent to Grégory Mounié <Gregory.Mounie <at> imag.fr>:
bug acknowledged by developer. (Fri, 18 Aug 2017 08:51:02 GMT) Full text and rfc822 format available.

Message #10 received at 27978-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Grégory Mounié <Gregory.Mounie <at> imag.fr>
Cc: 27978-done <at> debbugs.gnu.org
Subject: Re: bug#27978: Detection of section name in man.el
Date: Fri, 18 Aug 2017 11:49:57 +0300
> From: Grégory Mounié
> 	<Gregory.Mounie <at> imag.fr>
> Date: Sun, 6 Aug 2017 01:44:19 +0200
> 
>   When parsing manual in languages with non-ascii letters, the section 
> names using non-ascii letters are not added to the table of content.
> 
>   I noticed the bug reading the French bash manual: the quite useful 
> "COMMANDES INTERNES DE l'INTERPRÉTEUR" section does not appear (SHELL 
> BUILTIN COMMAND). (because of the É letter)
> 
>   I propose to use Character class instead of ascii interval in the 
> appropriate regexp defvar. It should not change anything for english 
> manual and it should work for many other languages.

Thanks, I pushed these changes with some minor adjustments.
Specifically:

> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
>    "Regular expression describing a manpage section within parentheses.")

I didn't change this one, because I think a section always uses only
ASCII letters and numbers, as in ".1n".  If you disagree, can you show
an example where this is not so?

> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
>    "Regular expression describing a manpage heading entry.")

I see no reason to replace 0-9 with [:digit:] here, since I think
non-ASCII digits will never be used in this context.  Do you agree?

Incidentally, I see quite a few similar regexps elsewhere in man.el,
did you audit all of them and established that they don't need similar
changes?  If not, would you like to propose similar changes there?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27978; Package emacs. (Fri, 18 Aug 2017 19:24:02 GMT) Full text and rfc822 format available.

Message #13 received at 27978 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Grégory Mounié <Gregory.Mounie <at> imag.fr>
Cc: 27978 <at> debbugs.gnu.org
Subject: Re: bug#27978: Detection of section name in man.el
Date: Fri, 18 Aug 2017 22:23:10 +0300
[Please keep the bug address on the CC list.]

> From: Grégory Mounié <Gregory.Mounie <at> imag.fr>
> Date: Fri, 18 Aug 2017 19:53:44 +0200
> 
>   In brief, I would not change the other a-zA-Z regexps (details below).
> 
>   But I would change the SEE ALSO regexp (around line 298) to add other 
> languages. Should I fill another bug report with another patch  ?
> 
> (defvar Man-see-also-regexp "SEE ALSO"
>    "Regular expression for SEE ALSO heading (or your equivalent).
> This regexp should not start with a `^' character.")
> 
>   using the debian manpages translation as référence, and using
>   "zgrep -h SH man*/*  | sort | uniq -c | sort -n" inside appropriate 
> /usr/share/man subdirectories to infer the values, I propose:
> 
>   "SEE ALSO\|VOIR AUSSI\|SIEHE AUCH\|VÉASE TAMBIÉN\|VEJA TAMBÉM\|VEDERE 
> ANCHE\|ZOBACZ TAKŻE\|İLGİLİ BELGELER\|参照|参见 SEE ALSO\|參見 SEE ALSO"
> 
>   (French, German, Spanish, Portugese, Italian, Polish, Turkish, 
> Japanese, Chinese CN, Chinese TW)

OK.  If no one objects, I will make this change soon.  Thanks.

> Details below about the a-zA-Z regexps:
> 
> Le 18/08/2017 à 10:49, Eli Zaretskii a écrit :
> > 
> > Thanks, I pushed these changes with some minor adjustments.
> > Specifically:
> > 
> >> -(defvar Man-section-regexp "[0-9][a-zA-Z0-9+]*\\|[LNln]"
> >> +(defvar Man-section-regexp "[[:digit:]][[:alnum:]+]*\\|[LNln]"
> >>     "Regular expression describing a manpage section within parentheses.")
> > 
> > I didn't change this one, because I think a section always uses only
> > ASCII letters and numbers, as in ".1n".  If you disagree, can you show
> > an example where this is not so?
> > 
> 
>   I have install the various multilingual standard manpages of my debian 
> and I have not grep a counter example so I guess it is perfect.
> 
> >> -(defvar Man-heading-regexp "^\\([A-Z][A-Z0-9 /-]+\\)$"
> >> +(defvar Man-heading-regexp "^\\([[:upper:]][[:upper:][:digit:] /-]+\\)$"
> >>     "Regular expression describing a manpage heading entry.")
> > 
> > I see no reason to replace 0-9 with [:digit:] here, since I think
> > non-ASCII digits will never be used in this context.  Do you agree?
> > 
> > Incidentally, I see quite a few similar regexps elsewhere in man.el,
> > did you audit all of them and established that they don't need similar
> > changes?  If not, would you like to propose similar changes there?
> > 
> 
>   There are 18 a-Z. They seem like a detection carefully crafted by 
> history, thus I would not change them without counter-example either.
> 
>   The first four a-zA-Z seems related to the parsing of external 
> command, with particularities in Windows port so I would not recommend 
> to change it.
>   The 5-18 a-zA-Z try to guess the manpage around POS. The main pattern
>   is "-a-zA-Z0-9._+:"
> 
>   With the same set of multi-lingual manpages, I have found only one 
> character used in manpage name and not in the set: "[" (man [ leads you 
> to test). I suspect that adding "[" would add more regressions than 
> solutions.
> 
>   Note that line 720 the pattern is slightly different (missing "-._:"). 
> I do not understand really why.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 16 Sep 2017 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 274 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.