GNU bug report logs -
#6020
coreutils-8.x: a simple feature enhancement, and how to do it
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6020 in the body.
You can then email your comments to 6020 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Sat, 24 Apr 2010 01:31:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Nelson H. F. Beebe" <beebe <at> math.utah.edu>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 24 Apr 2010 01:31:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
In 1981, 29 years ago, Intel introduced the 8087 floating-point
coprocessor that implemented an early draft of the 1985 IEEE 754
standard for binary floating-point arithmetic. That chip, and all
subsequent Intel IA-32, IA-64, EM64T, and AMD AMD64 (aka x86_64)
architectures provide three floating-point formats in hardware:
32-bit 24-bit significand, number range ~= 1.4e-45 .. 3.40e38,
roughly 7 decimal digits
C type float
64-bit 53-bit significand, number range ~= 4.94e-324 .. 1.80e308
roughly 16 decimal digits
C type double
80-bit (variously stored in 10, 12, or 16-byte memory blocks)
64-bit significand, number range ~= 3.64e-4951 .. 1.19e+4932
roughly 19 decimal digits
C type long double
Several other CPU platforms provide a 128-bit format instead of the
80-bit format, with these properties:
128-bit 113-bit significand, number range ~= 3.64e-4951 .. 1.19e+4932,
roughly 34 decimal digits
C type long double
In 2009, the IEEE 754 Standard was revised to include the above, plus
decimal arithmetic, the latter with these properties:
32-bit 7 digits, number range 1e-101 .. 9.999_999e+96
64-bit 16 digits, number range 1e-398 .. 9.999_999_999_999_999e+384
128-bit 34 digits, number range 1e-6176 .. 9.999_999_999_999_999_999_999_999_999_999_999e+6144
At present, up to version 8.5, coreutils uses only type double in its
implementation of the -g sort-ordering option. The result is that it
is unable to correctly sort files that use the entire number range of
IEEE 754 binary arithmetic; indeed, the double format covers only
about 6% of the possible binary range, and 5% of the decimal range.
Please extend the next version of coreutils to use "long double"
instead of "double" in this operation. Here is a patch that worked
for one recent coreutils release:
*** src/sort.c.~1~ Sun Jan 3 10:06:20 2010
--- src/sort.c Mon Jan 18 08:24:18 2010
***************
*** 1792,1799 ****
--- 1792,1805 ----
char *ea;
char *eb;
+
+ #if 0
double a = strtod (sa, &ea);
double b = strtod (sb, &eb);
+ #else
+ long double a = strtold (sa, &ea);
+ long double b = strtold (sb, &eb);
+ #endif
/* Put conversion errors at the start of the collating sequence. */
if (sa == ea)
The "long double" type is required by both C89 and C99, but the
strtold() function appeared first in C99 (although many vendors
supplied it before then). If strtold() is absent, then
"long double x; if (sscanf(s, "%Lg", &x) == 1) {...}" is often
a reasonable replacement.
However, note that some aberrant systems implement "long double" as
"double" (e.g., DEC Alpha OSF/1 4.x, Minix, and most *BSD
distributions), and some implement it in doubled-double format, which
increases the precision, but leaves the range at that of double.
Examples of the latter include Apple Mac OS X on PowerPC, IBM AIX on
PowerPC, and SGI IRIX MIPS.
I suggest a configure-time check for strtold(), and if that works,
then use "long double" in sort.c.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe <at> math.utah.edu -
- 155 S 1400 E RM 233 beebe <at> acm.org beebe <at> computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Wed, 28 Apr 2010 11:11:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 6020 <at> debbugs.gnu.org (full text, mbox):
On 24/04/10 02:30, Nelson H. F. Beebe wrote:
[snipped very useful floating point info]
> At present, up to version 8.5, coreutils uses only type double in its
> implementation of the -g sort-ordering option. The result is that it
> is unable to correctly sort files that use the entire number range of
> IEEE 754 binary arithmetic; indeed, the double format covers only
> about 6% of the possible binary range, and 5% of the decimal range.
This should do it.
Using long double has no impact on performance on my pentium-m linux laptop.
Note I tried converting to use xstrtold(), but that added about 25% overhead :(
diff --git a/src/sort.c b/src/sort.c
index 6d47b79..a815244 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1855,10 +1855,16 @@ general_numcompare (const char *sa, const char *sb)
/* FIXME: maybe add option to try expensive FP conversion
only if A and B can't be compared more cheaply/accurately. */
+#if HAVE_C99_STRTOLD /* provided by c-strtold module. */
+# define STRTOD strtold
+#else
+# define STRTOD strtod
+#endif
+
char *ea;
char *eb;
- double a = strtod (sa, &ea);
- double b = strtod (sb, &eb);
+ long double a = STRTOD (sa, &ea);
+ long double b = STRTOD (sb, &eb);
/* Put conversion errors at the start of the collating sequence. */
if (sa == ea)
> However, note that some aberrant systems implement "long double" as
> "double" (e.g., DEC Alpha OSF/1 4.x, Minix, and most *BSD
> distributions), and some implement it in doubled-double format, which
> increases the precision, but leaves the range at that of double.
> Examples of the latter include Apple Mac OS X on PowerPC, IBM AIX on
> PowerPC, and SGI IRIX MIPS.
I was wondering about a test for this:
$ printf "3.64e-4951\n3.63e-4950\n" | ./sort -g
3.64e-4951
3.63e-4950
However I'm worried that will fail because of what you mention above.
I probably need to add LDBL_{MIN,MAX} to getlimits.
cheers,
Pádraig.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Wed, 28 Apr 2010 23:41:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 6020 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 28/04/10 21:01, Nelson H. F. Beebe wrote:
>>> ...
>>> I was wondering about a test for this:
>>>
>>> $ printf "3.64e-4951\n3.63e-4950\n" | ./sort -g
>>> 3.64e-4951
>>> 3.63e-4950
>>>
>>> However I'm worried that will fail because of what you mention above.
>>> I probably need to add LDBL_{MIN,MAX} to getlimits.
>>> ...
>
> Here is what I see with the version that I patched some time ago
> according to the proposal posted last week:
>
> % printf "3.64e-4951\n3.63e-4950\n" | sort-8.4 -g
> 3.64e-4951
> 3.63e-4950
>
> Why should getlimits() even be used? Surely it is enough to ask
> strtold() to just return its best answer for the conversion of a
> human-readable number string to (we hope the nearest) machine number.
getlimits is just used in our tests.
Because of the implicit rounding in strtold I'd need something
independent of `sort` to output LDBL_MIN and LDBL_MAX to verify that
sort is actually using long double if available on the platform.
> You should not worry about execution time; there is a current huge
> hole in the coverage of floating-point numbers with coreutil's "sort
> -g" option that badly needs repair. Getting the right answer a bit
> more slowly is much more important than getting the wrong answer fast.
I'm always wary of performance.
I was just pointing out that there is no slow down on my system.
I'll push the attached sometime tomorrow.
cheers,
Pádraig
[sort-long-double.diff (text/x-patch, attachment)]
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Thu, 29 Apr 2010 06:30:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 6020 <at> debbugs.gnu.org (full text, mbox):
Hi,
two nit-picks regarding the test script below:
On Thu, Apr 29, 2010 at 12:39:46AM +0100, Pádraig Brady wrote:
> [...]
> @@ -0,0 +1,51 @@
> +#!/bin/sh
> +# Ensure sort -g sorts floating point limits correctly
> [...]
> +if test "$VERBOSE" = yes; then
> + set -x
> + mv --version
^^
sort
would be nicer.
> +fi
> +
> +. $srcdir/test-lib.sh
> +getlimits_
> +
> +# See if sort should be using long doubles
> +grep '^#define HAVE_C99_STRTOLD 1' $CONFIG_HEADER > /dev/null ||
^^^^^^^^^^^
-q
would be more concise.
Regards,
Erik
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Thu, 29 Apr 2010 08:35:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 6020 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 29/04/10 07:26, Erik Auerswald wrote:
> Hi,
>
> two nit-picks regarding the test script below:
>
> On Thu, Apr 29, 2010 at 12:39:46AM +0100, Pádraig Brady wrote:
>> [...]
>> @@ -0,0 +1,51 @@
>> +#!/bin/sh
>> +# Ensure sort -g sorts floating point limits correctly
>> [...]
>> +if test "$VERBOSE" = yes; then
>> + set -x
>> + mv --version
> ^^
> sort
> would be nicer.
Heh, I noticed that :)
>> +# See if sort should be using long doubles
>> +grep '^#define HAVE_C99_STRTOLD 1' $CONFIG_HEADER > /dev/null ||
> ^^^^^^^^^^^
> -q
> would be more concise.
and efficient (it exits on first match).
However, even though POSIX specifies -q, it's not portable.
Solaris' grep for example, does not support -q.
We'll start using it at some stage though.
My latest patch is attached which corrects the info docs
to mention strtold() not strtod().
Also the test is updated to exclude floats in non standard formats
just in case, and also checks the fr_FR locale where the RADIXCHAR is ','
cheers,
Pádraig.
[sort-long-double.diff (text/x-patch, attachment)]
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Thu, 29 Apr 2010 09:05:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 6020 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady wrote:
> On 29/04/10 07:26, Erik Auerswald wrote:
>> Hi,
>>
>> two nit-picks regarding the test script below:
>>
>> On Thu, Apr 29, 2010 at 12:39:46AM +0100, Pádraig Brady wrote:
>>> [...]
>>> @@ -0,0 +1,51 @@
>>> +#!/bin/sh
>>> +# Ensure sort -g sorts floating point limits correctly
>>> [...]
>>> +if test "$VERBOSE" = yes; then
>>> + set -x
>>> + mv --version
>> ^^
>> sort
>> would be nicer.
>
> Heh, I noticed that :)
>
>>> +# See if sort should be using long doubles
>>> +grep '^#define HAVE_C99_STRTOLD 1' $CONFIG_HEADER > /dev/null ||
>> ^^^^^^^^^^^
>> -q
>> would be more concise.
>
> and efficient (it exits on first match).
When efficiency is important, sometimes I've used -l. (e.g., in maint.mk)
That is portable and also makes grep stop searching upon first match.
However, you'd probably still want to redirect its stdout.
Useful when grep would otherwise search much more or
generate much more output.
> However, even though POSIX specifies -q, it's not portable.
> Solaris' grep for example, does not support -q.
> We'll start using it at some stage though.
>
> My latest patch is attached which corrects the info docs
> to mention strtold() not strtod().
Thanks for writing that.
bug marked as fixed in version 8.6, send any further explanations to "Nelson H. F. Beebe" <beebe <at> math.utah.edu>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Thu, 29 Apr 2010 23:51:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Fri, 30 Apr 2010 18:23:01 GMT)
Full text and
rfc822 format available.
Message #25 received at 6020 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady <P <at> draigBrady.com> writes:
> +#if HAVE_C99_STRTOLD /* provided by c-strtold module. */
> +# define STRTOD strtold
> +#else
> +# define STRTOD strtod
> +#endif
> +
> char *ea;
> char *eb;
> - double a = strtod (sa, &ea);
> - double b = strtod (sb, &eb);
> + long double a = STRTOD (sa, &ea);
> + long double b = STRTOD (sb, &eb);
This could cause performance problems on machines that have slow
long-double operations (implemented via traps, say) and that lack
strtold.
How about doing something like this instead? It tries to move
as much of the mess as possible to the #if part.
#if HAVE_C99_STRTOLD
# define long_double long double
#else
# define long_double double
# undef strtold
# define strtold strtod
#endif
...
long_double a = strtold (sa, &ea);
long_double a = strtold (sa, &ea);
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Fri, 30 Apr 2010 18:37:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 6020 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert writes about the proposed use of long double instead of
double in sort's -g option:
>> ...
>> This could cause performance problems on machines that have slow
>> long-double operations (implemented via traps, say) and that lack
>> strtold.
>> ...
Because using double instead of long double cripples sorting of
numerical data by drastically reducing the number range, I would much
rather pay a premium in run time to get the right answer, rather than
a useless wrong answer as GNU sort currently does.
If you folks want to consider alternatives, we could have something
like this:
-g, --general-numeric-sort compare according to general numerical
value in type double
-gg same as before, but with type long double
-ggg same as before, but with general multiple-precision floating
arithmetic using the gmp library
However, I'd much prefer a single option, and correct output. Most
people don't need floating-point comparisons in their sorts, but for
those of us who do, correctness trumps speed every time.
There should not be a problem in using -lgmp on modern systems,
because (a) it is very portable, and (b) it is required by all gcc-4.x
installations (and we reached the first release of the gcc-4.6 family
on 16-Apr-2010). Indeed, -lgmp has been tested for back to at least
coreutils-7.0.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe <at> math.utah.edu -
- 155 S 1400 E RM 233 beebe <at> acm.org beebe <at> computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Fri, 30 Apr 2010 18:58:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 6020 <at> debbugs.gnu.org (full text, mbox):
"Nelson H. F. Beebe" <beebe <at> math.utah.edu> writes:
> Because using double instead of long double cripples sorting of
> numerical data by drastically reducing the number range, I would much
> rather pay a premium in run time to get the right answer, rather than
> a useless wrong answer as GNU sort currently does.
Yes, that's right. But my note was not about that. It was about
improving performance while not affecting behavior. If all we have is
strtod, there's no point doing a long-double compare.
> -g, --general-numeric-sort compare according to general numerical
> value in type double
> -gg same as before, but with type long double
> -ggg same as before, but with general multiple-precision floating
> arithmetic using the gmp library
Even better, just compare the numbers as text, without converting them
to internal format. This is already done for -n, and would be much
faster than any of the above approaches. If done right it would be just
as accurate as gmp. (It wouldn't be trivial, though.)
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Fri, 30 Apr 2010 22:13:01 GMT)
Full text and
rfc822 format available.
Message #34 received at 6020 <at> debbugs.gnu.org (full text, mbox):
>> If all we have is strtod, there's no point doing a long-double compare.
True, UNLESS sscanf("%lg", &x) works: I've had to use that alternative
in the past.
>> ...
>> Even better, just compare the numbers as text, without converting them
>> to internal format. This is already done for -n, and would be much
>> faster than any of the above approaches. If done right it would be just
>> as accurate as gmp. (It wouldn't be trivial, though.)
>> ...
Yes, that would work, but care needs to be taken to handle the formats
that strtold() supports (decimal 1.2345e+678 and C99 hexadecimal
0x1.f3dp-2047), and that essentially mentions having to convert from
hexadecimal to decimal, and effectively deal with many of the same
problems that strtold() handles. The code also must deal with issues
of leading and trailing zeros in significand and exponent.
Quick: how are these ordered?
1.00000000000000000000000000000000000000000000000000e-50
0.00000000000000000000000000000000000000000000000001e+00
Answer: they are equal.
And these:
+0.00000000000000000000000000000000000000000000000000e-50
-0.00000000000000000000000000000000000000000000000000e+00
Answer: both are zero, but the negative one should precede the
positive one in sort (on the grounds that negative zero really means
tiny negative that was too small to represent).
So, indeed, a string-based comparison is not trivial.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe <at> math.utah.edu -
- 155 S 1400 E RM 233 beebe <at> acm.org beebe <at> computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
Information forwarded
to
owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org
:
bug#6020
; Package
coreutils
.
(Fri, 30 Apr 2010 22:35:01 GMT)
Full text and
rfc822 format available.
Message #37 received at 6020 <at> debbugs.gnu.org (full text, mbox):
On 30/04/10 19:16, Paul Eggert wrote:
> Pádraig Brady <P <at> draigBrady.com> writes:
>
>> +#if HAVE_C99_STRTOLD /* provided by c-strtold module. */
>> +# define STRTOD strtold
>> +#else
>> +# define STRTOD strtod
>> +#endif
>> +
>> char *ea;
>> char *eb;
>> - double a = strtod (sa, &ea);
>> - double b = strtod (sb, &eb);
>> + long double a = STRTOD (sa, &ea);
>> + long double b = STRTOD (sb, &eb);
>
> This could cause performance problems on machines that have slow
> long-double operations (implemented via traps, say) and that lack
> strtold.
An unusual combination, buy you're right.
I'll push this soon in your name.
thanks for the review,
Pádraig.
commit 1a2afe2adde1f4e864ac098e5d0ede6fcc6b46db
Author: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Fri Apr 30 23:23:38 2010 +0100
sort: use long doubles only when effective
* src/sort.c (general_numcompare): Don't use long double if strtold
is not available, as it may introduce needless overhead.
diff --git a/src/sort.c b/src/sort.c
index a815244..54b97e2 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1856,15 +1856,17 @@ general_numcompare (const char *sa, const char *sb)
only if A and B can't be compared more cheaply/accurately. */
#if HAVE_C99_STRTOLD /* provided by c-strtold module. */
-# define STRTOD strtold
+# define long_double long double
#else
-# define STRTOD strtod
+# define long_double double
+# undef strtold
+# define strtold strtod
#endif
char *ea;
char *eb;
- long double a = STRTOD (sa, &ea);
- long double b = STRTOD (sb, &eb);
+ long_double a = strtold (sa, &ea);
+ long_double b = strtold (sb, &eb);
/* Put conversion errors at the start of the collating sequence. */
if (sa == ea)
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 29 May 2010 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 15 years and 83 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.