GNU bug report logs - #30814
Please increase the value of MAX_MON_WIDTH in ls.c

Previous Next

Package: coreutils;

Reported by: Rafal Luzynski <digitalfreak <at> lingonborough.com>

Date: Wed, 14 Mar 2018 00:08:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30814 in the body.
You can then email your comments to 30814 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Wed, 14 Mar 2018 00:08:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Rafal Luzynski <digitalfreak <at> lingonborough.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 14 Mar 2018 00:08:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Rafal Luzynski <digitalfreak <at> lingonborough.com>
To: bug-coreutils <at> gnu.org
Subject: Please increase the value of MAX_MON_WIDTH in ls.c
Date: Wed, 14 Mar 2018 01:06:46 +0100 (CET)
As we have introduced the support of nominative and genitive
month names in glibc [1] and we are going to provide the updated
locale data for Catalan language [2] it has been discovered [3]
that the current limit of the maximum length of the abbreviated
month name as displayed by "ls -l" will not work with the new
data for Catalan.  It is obligatory to precede the month name
with "de " (note: the space) so the abbreviated month names limited
to 5 characters will be ambiguous and therefore unreadable:

de ma  (should be "de mar" at least)
d’abr  (correct)
de ma  (should be "de mai" at least)
de ju  (should be "de jun" at least)
de ju  (should be "de jul" at least)

Increasing the value of MAX_MON_WIDTH to 6 characters will fix
the problem. The location of the constant is here: [4]

Although it has been also suggested in the same bug report that
there should be no additional limit for the month length.

This bug may be related with the coreutils bug #29377. [5]

Regards,

Rafal Luzynski


[1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
[4] http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
[5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377




Information forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Wed, 14 Mar 2018 18:41:01 GMT) Full text and rfc822 format available.

Message #8 received at 30814 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Rafal Luzynski <digitalfreak <at> lingonborough.com>, 30814 <at> debbugs.gnu.org
Subject: Re: bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c
Date: Wed, 14 Mar 2018 11:40:31 -0700
[Message part 1 (text/plain, inline)]
On 13/03/18 17:06, Rafal Luzynski wrote:
> As we have introduced the support of nominative and genitive
> month names in glibc [1] and we are going to provide the updated
> locale data for Catalan language [2] it has been discovered [3]
> that the current limit of the maximum length of the abbreviated
> month name as displayed by "ls -l" will not work with the new
> data for Catalan.  It is obligatory to precede the month name
> with "de " (note: the space) so the abbreviated month names limited
> to 5 characters will be ambiguous and therefore unreadable:

It's a bit surprising that _abbreviations_ all need the "de " prefix,
but fair enough.

> de ma  (should be "de mar" at least)
> d’abr  (correct)
> de ma  (should be "de mai" at least)
> de ju  (should be "de jun" at least)
> de ju  (should be "de jul" at least)
> 
> Increasing the value of MAX_MON_WIDTH to 6 characters will fix
> the problem. The location of the constant is here: [4]
> 
> Although it has been also suggested in the same bug report that
> there should be no additional limit for the month length.
> 
> This bug may be related with the coreutils bug #29377. [5]
> 
> Regards,
> 
> Rafal Luzynski
> 
> 
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
> [2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
> [4] http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
> [5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377
> 
> 
> 
> 


Thanks for the careful analysis.

5 was chosen as a max width for abmon
as that was seen to be unambiguous and
also truncate overly long abbreviations.

One can browse the abbreviations by length using:

  locale -a | grep utf8 |
  while read l; do LC_ALL=$l locale abmon; done |
  tr ';' '\n' | sort -u | grep '.\{5,\}' |
  while read mon; do
    printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
  done |
  sort -n | less

That shows a couple of existing issues with the limit of 5.
ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be unambiguous,
while Arabic needs 12!
I don't remember arabic being so long at the time I implemented
the alignment/truncation in ls (9 years ago), but we should probably
expand to account for that.

$ LC_ALL=ln_CD.utf8 locale abmon
sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.

$ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
كانون الثاني
شباط
آذار
نيسان
نوار
حزيران
تموز
آب
أيلول
تشرين الأول
تشرين الثاني
كانون الأول

Given the increase in supported size should only impact relatively few languages
it probably makes sense to increase to 12. The attached does that
and also augments the test to find ambiguous cases.

cheers,
Pádraig
[ls-abmon-width.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Wed, 14 Mar 2018 22:54:01 GMT) Full text and rfc822 format available.

Message #11 received at 30814 <at> debbugs.gnu.org (full text, mbox):

From: Rafal Luzynski <digitalfreak <at> lingonborough.com>
To: 30814 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c
Date: Wed, 14 Mar 2018 23:53:20 +0100 (CET)
14.03.2018 19:40 Pádraig Brady <P <at> draigBrady.com> wrote:
> [...]
> One can browse the abbreviations by length using:
>
> locale -a | grep utf8 |
> while read l; do LC_ALL=$l locale abmon; done |
> tr ';' '\n' | sort -u | grep '.\{5,\}' |
> while read mon; do
> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
> done |
> sort -n | less
>
> That shows a couple of existing issues with the limit of 5.
> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be
> unambiguous,
> while Arabic needs 12!
> [...]
>
> $ LC_ALL=ln_CD.utf8 locale abmon
> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.

Nice, script, thank you. :-) The issue with ln_CD is no longer
true, it has been fixed in June/July 2017. Please see the output
on Fedora 28 (beta) with glibc 2.27:

$ LC_ALL=ln_CD.utf8 locale abmon
yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb

but it does not help because some Arabic languages still need 12.
Even worse, your script ran at the same machine gives the following
output (only the final lines):

...
11 siakwa kati
11 yahbra kati
11 تشرين الأول
11 كانون الأول
12 kakamuk kati
12 pastara kati
12 waupasa kati
12 تشرين الثاني
12 كانون الثاني
15 lî wainhka kati
15 lih mairin kati
(END)

Those with 15 characters come from miq_NI language which has been
introduced in September 2017 (glibc 2.27, released Feb 1, 2018):

$ LC_ALL=miq_NI.utf8 locale abmon
siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
$ LC_ALL=miq_NI.utf8 locale mon
siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati

But, as you can see, this locale data should be fixed because abmon
and mon are the same; at least " kati" which appears everywhere may
be probably removed. Also truncating the string to 12 characters
probably still makes it unambiguous.

While at this, I have not checked but does your tests/ls/abmon-align.sh
script check for the length required to make all abbreviated month
names unambiguous (i.e., how many letters can we truncate to ensure
that the month names are still unambiguous) or just the longest
abbreviated month name?

> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
> [...]

This is still true although again, mon and abmon seem to be the same
in ar_SY which is probably not the best we can have. I wish I could
fix it if I only knew how. :) (BTW, other Arabic variants seem to have
the abbreviated month names shorter.)

> [...]
> Given the increase in supported size should only impact relatively few
> languages
> it probably makes sense to increase to 12. The attached does that
> and also augments the test to find ambiguous cases.

12 is more than I asked for but that's definitely not destructive.
My only remark is: please remove "Lingala" from the commit comment
because it is no longer true. Otherwise the patch seems to be OK.

Thank you and best regards,

Rafal




Information forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Fri, 16 Mar 2018 10:16:02 GMT) Full text and rfc822 format available.

Message #14 received at 30814 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Rafal Luzynski <digitalfreak <at> lingonborough.com>, 30814 <at> debbugs.gnu.org
Subject: Re: bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c
Date: Fri, 16 Mar 2018 03:15:04 -0700
On 14/03/18 15:53, Rafal Luzynski wrote:
> 14.03.2018 19:40 Pádraig Brady <P <at> draigBrady.com> wrote:
>> [...]
>> One can browse the abbreviations by length using:
>>
>> locale -a | grep utf8 |
>> while read l; do LC_ALL=$l locale abmon; done |
>> tr ';' '\n' | sort -u | grep '.\{5,\}' |
>> while read mon; do
>> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
>> done |
>> sort -n | less
>>
>> That shows a couple of existing issues with the limit of 5.
>> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be
>> unambiguous,
>> while Arabic needs 12!
>> [...]
>>
>> $ LC_ALL=ln_CD.utf8 locale abmon
>> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
> 
> Nice, script, thank you. :-) The issue with ln_CD is no longer
> true, it has been fixed in June/July 2017. Please see the output
> on Fedora 28 (beta) with glibc 2.27:
> 
> $ LC_ALL=ln_CD.utf8 locale abmon
> yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb
> 
> but it does not help because some Arabic languages still need 12.
> Even worse, your script ran at the same machine gives the following
> output (only the final lines):
> 
> ...
> 11 siakwa kati
> 11 yahbra kati
> 11 تشرين الأول
> 11 كانون الأول
> 12 kakamuk kati
> 12 pastara kati
> 12 waupasa kati
> 12 تشرين الثاني
> 12 كانون الثاني
> 15 lî wainhka kati
> 15 lih mairin kati
> (END)
> 
> Those with 15 characters come from miq_NI language which has been
> introduced in September 2017 (glibc 2.27, released Feb 1, 2018):
> 
> $ LC_ALL=miq_NI.utf8 locale abmon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
> $ LC_ALL=miq_NI.utf8 locale mon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
> 
> But, as you can see, this locale data should be fixed because abmon
> and mon are the same;


> at least " kati" which appears everywhere may
> be probably removed. Also truncating the string to 12 characters
> probably still makes it unambiguous.

> 
> While at this, I have not checked but does your tests/ls/abmon-align.sh
> script check for the length required to make all abbreviated month
> names unambiguous (i.e., how many letters can we truncate to ensure
> that the month names are still unambiguous) or just the longest
> abbreviated month name?

It checks that 12 months for a few sample languages are unambiguous

> 
>> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
>> [...]
> 
> This is still true although again, mon and abmon seem to be the same
> in ar_SY which is probably not the best we can have. I wish I could
> fix it if I only knew how. :)

A patch to glibc would be most appreciated, but as for content I don't know.
I see ICU has narrow, short, long variants, but for ar_SY the narrow are
ambiguous, and the short are copies of the long ones:
http://demo.icu-project.org/icu-bin/locexp?d_=en&_=ar_SY

> (BTW, other Arabic variants seem to have
> the abbreviated month names shorter.)

Right, I see the long Arabic names are derived from Aramaic:
https://en.wikipedia.org/wiki/Arabic_names_of_calendar_months

>> [...]
>> Given the increase in supported size should only impact relatively few
>> languages
>> it probably makes sense to increase to 12. The attached does that
>> and also augments the test to find ambiguous cases.
> 
> 12 is more than I asked for but that's definitely not destructive.
> My only remark is: please remove "Lingala" from the commit comment
> because it is no longer true. Otherwise the patch seems to be OK.

Given this is usually a deficiency in the locale rather than inherent
in the language, I'm definitely not going above 12.
I'd even drop it to 8 if there were apparent short abmons for
all languages, but will leave at 12 as this isn't the case for ar_SY at least.

cheers,
Pádraig




bug closed, send any further explanations to 30814 <at> debbugs.gnu.org and Rafal Luzynski <digitalfreak <at> lingonborough.com> Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Fri, 16 Mar 2018 10:19:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Fri, 16 Mar 2018 12:32:02 GMT) Full text and rfc822 format available.

Message #19 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ruediger Meier <sweet_f_a <at> gmx.de>
To: bug-coreutils <at> gnu.org
Cc: 30814 <at> debbugs.gnu.org, Pádraig Brady <P <at> draigbrady.com>,
 Rafal Luzynski <digitalfreak <at> lingonborough.com>
Subject: Re: bug#30814: Please increase the value of MAX_MON_WIDTH in ls.c
Date: Fri, 16 Mar 2018 13:30:53 +0100
On Wednesday 14 March 2018, Pádraig Brady wrote:
> On 13/03/18 17:06, Rafal Luzynski wrote:
> > As we have introduced the support of nominative and genitive
> > month names in glibc [1] and we are going to provide the updated
> > locale data for Catalan language [2] it has been discovered [3]
> > that the current limit of the maximum length of the abbreviated
> > month name as displayed by "ls -l" will not work with the new
> > data for Catalan.  It is obligatory to precede the month name
> > with "de " (note: the space) so the abbreviated month names limited
> > to 5 characters will be ambiguous and therefore unreadable:
>
> It's a bit surprising that _abbreviations_ all need the "de " prefix,
> but fair enough.

Most used "abbreviations" in our locales do not follow the language 
rules anyways. Even in english we would need to add dots and some month 
abbreviations just do not exist.

Below 3 examples of the correct abbreviations for english, spanish, and 
german:

Jan. 	enero	Jan.
Feb. 	feb.	Feb.
Mar. 	marzo	März
Apr. 	abr.	Apr.
May 	mayo	Mai
June 	jun.	Jun.
July 	jul.	Jul.
Aug. 	agosto  Aug.
Sept. 	set.	Sept.
Oct. 	oct.	Okt.
Nov. 	nov.	Nov.
Dec. 	dic.	Dez.

Thankfully all 3 locales just use the first three letters. Note in 
spanish you would also need to add such genitive "de" but of course 
nobody wants to see it when printing short dates to a terminal.

While I see a benefit of having the correct abbreviations *somewhere* in 
the locale. I don't think they should be used in tools like ls by 
default.  The output should IMHO not longer than --time-style=long-iso 
or --full-time.

> > de ma  (should be "de mar" at least)
> > d’abr  (correct)
> > de ma  (should be "de mai" at least)
> > de ju  (should be "de jun" at least)
> > de ju  (should be "de jul" at least)

I don't speak Catalan, but I can't believe that "de jun" is a correct 
abbreviation following the language rules.


> > Increasing the value of MAX_MON_WIDTH to 6 characters will fix
> > the problem. The location of the constant is here: [4]
> >
> > Although it has been also suggested in the same bug report that
> > there should be no additional limit for the month length.
> >
> > This bug may be related with the coreutils bug #29377. [5]
> >
> > Regards,
> >
> > Rafal Luzynski
> >
> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
> > [2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
> > [3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
> > [4]
> > http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
> > [5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377
>
> Thanks for the careful analysis.
>
> 5 was chosen as a max width for abmon
> as that was seen to be unambiguous and
> also truncate overly long abbreviations.
>
> One can browse the abbreviations by length using:
>
>   locale -a | grep utf8 |
>   while read l; do LC_ALL=$l locale abmon; done |
>   tr ';' '\n' | sort -u | grep '.\{5,\}' |
>   while read mon; do
>     printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
>   done |
>   sort -n | less
>
> That shows a couple of existing issues with the limit of 5.
> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to
> be unambiguous, while Arabic needs 12!
> I don't remember arabic being so long at the time I implemented
> the alignment/truncation in ls (9 years ago), but we should probably
> expand to account for that.
>
> $ LC_ALL=ln_CD.utf8 locale abmon
> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá
>9.;sánz10.;sánzá11.;sánzá12.
>
> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
> كانون الثاني
> شباط
> آذار
> نيسان
> نوار
> حزيران
> تموز
> آب
> أيلول
> تشرين الأول
> تشرين الثاني
> كانون الأول
>
> Given the increase in supported size should only impact relatively
> few languages it probably makes sense to increase to 12. The attached
> does that and also augments the test to find ambiguous cases.
>
> cheers,
> Pádraig






Information forwarded to bug-coreutils <at> gnu.org:
bug#30814; Package coreutils. (Fri, 16 Mar 2018 12:32:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 14 Apr 2018 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 67 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.