GNU bug report logs -
#30814
Please increase the value of MAX_MON_WIDTH in ls.c
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30814 in the body.
You can then email your comments to 30814 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Wed, 14 Mar 2018 00:08:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Rafal Luzynski <digitalfreak <at> lingonborough.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 14 Mar 2018 00:08:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
As we have introduced the support of nominative and genitive
month names in glibc [1] and we are going to provide the updated
locale data for Catalan language [2] it has been discovered [3]
that the current limit of the maximum length of the abbreviated
month name as displayed by "ls -l" will not work with the new
data for Catalan. It is obligatory to precede the month name
with "de " (note: the space) so the abbreviated month names limited
to 5 characters will be ambiguous and therefore unreadable:
de ma (should be "de mar" at least)
d’abr (correct)
de ma (should be "de mai" at least)
de ju (should be "de jun" at least)
de ju (should be "de jul" at least)
Increasing the value of MAX_MON_WIDTH to 6 characters will fix
the problem. The location of the constant is here: [4]
Although it has been also suggested in the same bug report that
there should be no additional limit for the month length.
This bug may be related with the coreutils bug #29377. [5]
Regards,
Rafal Luzynski
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
[4] http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
[5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Wed, 14 Mar 2018 18:41:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 30814 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 13/03/18 17:06, Rafal Luzynski wrote:
> As we have introduced the support of nominative and genitive
> month names in glibc [1] and we are going to provide the updated
> locale data for Catalan language [2] it has been discovered [3]
> that the current limit of the maximum length of the abbreviated
> month name as displayed by "ls -l" will not work with the new
> data for Catalan. It is obligatory to precede the month name
> with "de " (note: the space) so the abbreviated month names limited
> to 5 characters will be ambiguous and therefore unreadable:
It's a bit surprising that _abbreviations_ all need the "de " prefix,
but fair enough.
> de ma (should be "de mar" at least)
> d’abr (correct)
> de ma (should be "de mai" at least)
> de ju (should be "de jun" at least)
> de ju (should be "de jul" at least)
>
> Increasing the value of MAX_MON_WIDTH to 6 characters will fix
> the problem. The location of the constant is here: [4]
>
> Although it has been also suggested in the same bug report that
> there should be no additional limit for the month length.
>
> This bug may be related with the coreutils bug #29377. [5]
>
> Regards,
>
> Rafal Luzynski
>
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
> [2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
> [4] http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
> [5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377
>
>
>
>
Thanks for the careful analysis.
5 was chosen as a max width for abmon
as that was seen to be unambiguous and
also truncate overly long abbreviations.
One can browse the abbreviations by length using:
locale -a | grep utf8 |
while read l; do LC_ALL=$l locale abmon; done |
tr ';' '\n' | sort -u | grep '.\{5,\}' |
while read mon; do
printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
done |
sort -n | less
That shows a couple of existing issues with the limit of 5.
ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be unambiguous,
while Arabic needs 12!
I don't remember arabic being so long at the time I implemented
the alignment/truncation in ls (9 years ago), but we should probably
expand to account for that.
$ LC_ALL=ln_CD.utf8 locale abmon
sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
$ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
كانون الثاني
شباط
آذار
نيسان
نوار
حزيران
تموز
آب
أيلول
تشرين الأول
تشرين الثاني
كانون الأول
Given the increase in supported size should only impact relatively few languages
it probably makes sense to increase to 12. The attached does that
and also augments the test to find ambiguous cases.
cheers,
Pádraig
[ls-abmon-width.patch (text/x-patch, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Wed, 14 Mar 2018 22:54:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 30814 <at> debbugs.gnu.org (full text, mbox):
14.03.2018 19:40 Pádraig Brady <P <at> draigBrady.com> wrote:
> [...]
> One can browse the abbreviations by length using:
>
> locale -a | grep utf8 |
> while read l; do LC_ALL=$l locale abmon; done |
> tr ';' '\n' | sort -u | grep '.\{5,\}' |
> while read mon; do
> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
> done |
> sort -n | less
>
> That shows a couple of existing issues with the limit of 5.
> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be
> unambiguous,
> while Arabic needs 12!
> [...]
>
> $ LC_ALL=ln_CD.utf8 locale abmon
> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
Nice, script, thank you. :-) The issue with ln_CD is no longer
true, it has been fixed in June/July 2017. Please see the output
on Fedora 28 (beta) with glibc 2.27:
$ LC_ALL=ln_CD.utf8 locale abmon
yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb
but it does not help because some Arabic languages still need 12.
Even worse, your script ran at the same machine gives the following
output (only the final lines):
...
11 siakwa kati
11 yahbra kati
11 تشرين الأول
11 كانون الأول
12 kakamuk kati
12 pastara kati
12 waupasa kati
12 تشرين الثاني
12 كانون الثاني
15 lî wainhka kati
15 lih mairin kati
(END)
Those with 15 characters come from miq_NI language which has been
introduced in September 2017 (glibc 2.27, released Feb 1, 2018):
$ LC_ALL=miq_NI.utf8 locale abmon
siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
$ LC_ALL=miq_NI.utf8 locale mon
siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
But, as you can see, this locale data should be fixed because abmon
and mon are the same; at least " kati" which appears everywhere may
be probably removed. Also truncating the string to 12 characters
probably still makes it unambiguous.
While at this, I have not checked but does your tests/ls/abmon-align.sh
script check for the length required to make all abbreviated month
names unambiguous (i.e., how many letters can we truncate to ensure
that the month names are still unambiguous) or just the longest
abbreviated month name?
> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
> [...]
This is still true although again, mon and abmon seem to be the same
in ar_SY which is probably not the best we can have. I wish I could
fix it if I only knew how. :) (BTW, other Arabic variants seem to have
the abbreviated month names shorter.)
> [...]
> Given the increase in supported size should only impact relatively few
> languages
> it probably makes sense to increase to 12. The attached does that
> and also augments the test to find ambiguous cases.
12 is more than I asked for but that's definitely not destructive.
My only remark is: please remove "Lingala" from the commit comment
because it is no longer true. Otherwise the patch seems to be OK.
Thank you and best regards,
Rafal
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Fri, 16 Mar 2018 10:16:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 30814 <at> debbugs.gnu.org (full text, mbox):
On 14/03/18 15:53, Rafal Luzynski wrote:
> 14.03.2018 19:40 Pádraig Brady <P <at> draigBrady.com> wrote:
>> [...]
>> One can browse the abbreviations by length using:
>>
>> locale -a | grep utf8 |
>> while read l; do LC_ALL=$l locale abmon; done |
>> tr ';' '\n' | sort -u | grep '.\{5,\}' |
>> while read mon; do
>> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
>> done |
>> sort -n | less
>>
>> That shows a couple of existing issues with the limit of 5.
>> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to be
>> unambiguous,
>> while Arabic needs 12!
>> [...]
>>
>> $ LC_ALL=ln_CD.utf8 locale abmon
>> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá9.;sánz10.;sánzá11.;sánzá12.
>
> Nice, script, thank you. :-) The issue with ln_CD is no longer
> true, it has been fixed in June/July 2017. Please see the output
> on Fedora 28 (beta) with glibc 2.27:
>
> $ LC_ALL=ln_CD.utf8 locale abmon
> yan;fbl;msi;apl;mai;yun;yul;agt;stb;ɔtb;nvb;dsb
>
> but it does not help because some Arabic languages still need 12.
> Even worse, your script ran at the same machine gives the following
> output (only the final lines):
>
> ...
> 11 siakwa kati
> 11 yahbra kati
> 11 تشرين الأول
> 11 كانون الأول
> 12 kakamuk kati
> 12 pastara kati
> 12 waupasa kati
> 12 تشرين الثاني
> 12 كانون الثاني
> 15 lî wainhka kati
> 15 lih mairin kati
> (END)
>
> Those with 15 characters come from miq_NI language which has been
> introduced in September 2017 (glibc 2.27, released Feb 1, 2018):
>
> $ LC_ALL=miq_NI.utf8 locale abmon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
> $ LC_ALL=miq_NI.utf8 locale mon
> siakwa kati;kuswa kati;kakamuk kati;lî wainhka kati;lih mairin kati;lî
> kati;pastara kati;sikla kati;wîs kati;waupasa kati;yahbra kati;trisu kati
>
> But, as you can see, this locale data should be fixed because abmon
> and mon are the same;
> at least " kati" which appears everywhere may
> be probably removed. Also truncating the string to 12 characters
> probably still makes it unambiguous.
>
> While at this, I have not checked but does your tests/ls/abmon-align.sh
> script check for the length required to make all abbreviated month
> names unambiguous (i.e., how many letters can we truncate to ensure
> that the month names are still unambiguous) or just the longest
> abbreviated month name?
It checks that 12 months for a few sample languages are unambiguous
>
>> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
>> [...]
>
> This is still true although again, mon and abmon seem to be the same
> in ar_SY which is probably not the best we can have. I wish I could
> fix it if I only knew how. :)
A patch to glibc would be most appreciated, but as for content I don't know.
I see ICU has narrow, short, long variants, but for ar_SY the narrow are
ambiguous, and the short are copies of the long ones:
http://demo.icu-project.org/icu-bin/locexp?d_=en&_=ar_SY
> (BTW, other Arabic variants seem to have
> the abbreviated month names shorter.)
Right, I see the long Arabic names are derived from Aramaic:
https://en.wikipedia.org/wiki/Arabic_names_of_calendar_months
>> [...]
>> Given the increase in supported size should only impact relatively few
>> languages
>> it probably makes sense to increase to 12. The attached does that
>> and also augments the test to find ambiguous cases.
>
> 12 is more than I asked for but that's definitely not destructive.
> My only remark is: please remove "Lingala" from the commit comment
> because it is no longer true. Otherwise the patch seems to be OK.
Given this is usually a deficiency in the locale rather than inherent
in the language, I'm definitely not going above 12.
I'd even drop it to 8 if there were apparent short abmons for
all languages, but will leave at 12 as this isn't the case for ar_SY at least.
cheers,
Pádraig
bug closed, send any further explanations to
30814 <at> debbugs.gnu.org and Rafal Luzynski <digitalfreak <at> lingonborough.com>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Fri, 16 Mar 2018 10:19:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Fri, 16 Mar 2018 12:32:02 GMT)
Full text and
rfc822 format available.
Message #19 received at submit <at> debbugs.gnu.org (full text, mbox):
On Wednesday 14 March 2018, Pádraig Brady wrote:
> On 13/03/18 17:06, Rafal Luzynski wrote:
> > As we have introduced the support of nominative and genitive
> > month names in glibc [1] and we are going to provide the updated
> > locale data for Catalan language [2] it has been discovered [3]
> > that the current limit of the maximum length of the abbreviated
> > month name as displayed by "ls -l" will not work with the new
> > data for Catalan. It is obligatory to precede the month name
> > with "de " (note: the space) so the abbreviated month names limited
> > to 5 characters will be ambiguous and therefore unreadable:
>
> It's a bit surprising that _abbreviations_ all need the "de " prefix,
> but fair enough.
Most used "abbreviations" in our locales do not follow the language
rules anyways. Even in english we would need to add dots and some month
abbreviations just do not exist.
Below 3 examples of the correct abbreviations for english, spanish, and
german:
Jan. enero Jan.
Feb. feb. Feb.
Mar. marzo März
Apr. abr. Apr.
May mayo Mai
June jun. Jun.
July jul. Jul.
Aug. agosto Aug.
Sept. set. Sept.
Oct. oct. Okt.
Nov. nov. Nov.
Dec. dic. Dez.
Thankfully all 3 locales just use the first three letters. Note in
spanish you would also need to add such genitive "de" but of course
nobody wants to see it when printing short dates to a terminal.
While I see a benefit of having the correct abbreviations *somewhere* in
the locale. I don't think they should be used in tools like ls by
default. The output should IMHO not longer than --time-style=long-iso
or --full-time.
> > de ma (should be "de mar" at least)
> > d’abr (correct)
> > de ma (should be "de mai" at least)
> > de ju (should be "de jun" at least)
> > de ju (should be "de jul" at least)
I don't speak Catalan, but I can't believe that "de jun" is a correct
abbreviation following the language rules.
> > Increasing the value of MAX_MON_WIDTH to 6 characters will fix
> > the problem. The location of the constant is here: [4]
> >
> > Although it has been also suggested in the same bug report that
> > there should be no additional limit for the month length.
> >
> > This bug may be related with the coreutils bug #29377. [5]
> >
> > Regards,
> >
> > Rafal Luzynski
> >
> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=10871
> > [2] https://sourceware.org/bugzilla/show_bug.cgi?id=22848
> > [3] https://sourceware.org/bugzilla/show_bug.cgi?id=22848#c6
> > [4]
> > http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n1099
> > [5] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29377
>
> Thanks for the careful analysis.
>
> 5 was chosen as a max width for abmon
> as that was seen to be unambiguous and
> also truncate overly long abbreviations.
>
> One can browse the abbreviations by length using:
>
> locale -a | grep utf8 |
> while read l; do LC_ALL=$l locale abmon; done |
> tr ';' '\n' | sort -u | grep '.\{5,\}' |
> while read mon; do
> printf '%02d %s\n' "$(echo "$mon" | wc -L)" "$mon"
> done |
> sort -n | less
>
> That shows a couple of existing issues with the limit of 5.
> ln_CD.utf8 (Democratic Republic of the Congo) needs a length of 7 to
> be unambiguous, while Arabic needs 12!
> I don't remember arabic being so long at the time I implemented
> the alignment/truncation in ls (9 years ago), but we should probably
> expand to account for that.
>
> $ LC_ALL=ln_CD.utf8 locale abmon
> sánzá1.;sánzá2.;sánzá3.;sánzá4.;sánzá5.;sánzá6.;sánzá7.;sánzá8.;sánzá
>9.;sánz10.;sánzá11.;sánzá12.
>
> $ LC_ALL=ar_SY.utf8 locale abmon | tr ';' '\n'
> كانون الثاني
> شباط
> آذار
> نيسان
> نوار
> حزيران
> تموز
> آب
> أيلول
> تشرين الأول
> تشرين الثاني
> كانون الأول
>
> Given the increase in supported size should only impact relatively
> few languages it probably makes sense to increase to 12. The attached
> does that and also augments the test to find ambiguous cases.
>
> cheers,
> Pádraig
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#30814
; Package
coreutils
.
(Fri, 16 Mar 2018 12:32:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 14 Apr 2018 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 67 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.