GNU bug report logs - #6789
propose renaming gnulib memxfrm to amemxfrm (naming collision with coreutils)

Reported by: Paul Eggert <eggert <at> CS.UCLA.EDU>

Date: Tue, 3 Aug 2010 19:47:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6789 in the body.
You can then email your comments to 6789 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Tue, 03 Aug 2010 19:47:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Eggert <eggert <at> CS.UCLA.EDU>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 03 Aug 2010 19:47:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bug-gnulib <bug-gnulib <at> gnu.org>
Cc: Bruno Haible <bruno <at> clisp.org>, Bug-coreutils <bug-coreutils <at> gnu.org>
Subject: propose renaming gnulib memxfrm to amemxfrm (naming collision with
	coreutils)
Date: Tue, 03 Aug 2010 12:46:21 -0700

On 2009-03-07 Bruno Haible wrote:

> Paul Eggert has written the module 'memcoll', which generalizes the 'strcoll'
> function to work on strings with embedded NULs.

> Here is the generalization of 'strxfrm' to strings with embedded NUL bytes.

Sorry, I didn't really notice this email until just now.  As it happens,
coreutils has had an memxfrm implementation since 2006, which
it never exported to gnulib.  The coreutils memxfrm is closer to how
strxfrm behaves, in that it does not allocate memory: it relies on the
caller to do memory allocation.  The signatures differ as follows:

  // coreutils returns number of bytes that were translated,
  // (or would be translated if there were enough room).
  // It also sets errno on error.
  size_t memxfrm (char *restrict dst, size_t dstsize,
		  char *restrict src, size_t srcsize);

  // gnulib returns pointer to destination, which is possibly-different if
  // the destination wasn't large enough.  It updates *DSTSIZEPTR to
  // the newly allocated size, if it allocated storage.  It returns
  // NULL (setting errno) on error.
  char *memxfrm (char *src, size_t srcsize, char *dst, size_t *dstsizeptr);

For coreutils, the coreutils interface is more memory-efficient,
because malloc is invoked at most once when comparing two lines.  If
the small buffer on the stack isn't large enough to hold the
translated output for both strings, the two calls to memxfrm will tell
sort.c exactly how big the buffer should be, and it can invoke malloc
just once and then invoke memxfrm again (twice) to successfully do the
translation.

The gnulib interface is more convenient for applications that don't
care about this sort of memory optimization, and I expect that for
some (large) cases it is faster because it sometimes avoids translating
the same chunk twice.  So it's useful as well.

However, the name "memxfrm" isn't well-chosen for the gnulib interface.
As a general rule, the mem* functions do not allocate memory, and
it's confusing that memxfrm is an exception.

So I propose that the gnulib memxfrm be renamed to something else, to
reflect the fact that it allocates memory.  I suggest the name
"amemxfrm", as a leading "a" is the usual convention for variants that
allocate memory (e.g., "asprintf").

I guess the coreutils memxfrm could also be migrated into gnulib,
afterwards.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Tue, 03 Aug 2010 23:34:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Wed, 4 Aug 2010 01:33:11 +0200

[Message part 1 (text/plain, inline)]

Hi Paul,

> > Here is the generalization of 'strxfrm' to strings with embedded NUL bytes.
> 
> Sorry, I didn't really notice this email until just now.  As it happens,
> coreutils has had an memxfrm implementation since 2006, which
> it never exported to gnulib.

And I'm sorry that I overlooked yours in coreutils when I contributed
memxfrm to gnulib in 2009.

> The coreutils memxfrm is closer to how 
> strxfrm behaves, in that it does not allocate memory: it relies on the
> caller to do memory allocation.  The signatures differ as follows:
> 
>   // coreutils returns number of bytes that were translated,
>   // (or would be translated if there were enough room).
>   // It also sets errno on error.
>   size_t memxfrm (char *restrict dst, size_t dstsize,
> 		  char *restrict src, size_t srcsize);
> 
>   // gnulib returns pointer to destination, which is possibly-different if
>   // the destination wasn't large enough.  It updates *DSTSIZEPTR to
>   // the newly allocated size, if it allocated storage.  It returns
>   // NULL (setting errno) on error.
>   char *memxfrm (char *src, size_t srcsize, char *dst, size_t *dstsizeptr);

Indeed the algorithm is virtually identical, and the only difference is
the calling convention.

> So I propose that the gnulib memxfrm be renamed to something else, to
> reflect the fact that it allocates memory.  I suggest the name
> "amemxfrm", as a leading "a" is the usual convention for variants that
> allocate memory (e.g., "asprintf").
> 
> I guess the coreutils memxfrm could also be migrated into gnulib,
> afterwards.

This approach would make sense if the two functions had different
functionality. But they effectively do the same, only with different
calling conventions. Therefore I believe gnulib should only have one
of these functions, either the best among the two, or a combination that
combines the best properties of the two.

> For coreutils, the coreutils interface is more memory-efficient,
> because malloc is invoked at most once when comparing two lines.  If
> the small buffer on the stack isn't large enough to hold the
> translated output for both strings, the two calls to memxfrm will tell
> sort.c exactly how big the buffer should be, and it can invoke malloc
> just once and then invoke memxfrm again (twice) to successfully do the
> translation.
> 
> The gnulib interface is more convenient for applications that don't
> care about this sort of memory optimization, and I expect that for
> some (large) cases it is faster because it sometimes avoids translating
> the same chunk twice.  So it's useful as well.

Since you want to let the two functions compete by performance, find
attached a program that exercises a small string 3 times with both,
then a large string 3 times with both. 1000 calls in each round.

Compiled like this:
$ gcc -O2 -Wall coreutils-memxfrm.c gnulib-memxfrm.c compare.c -I. -Drestrict=
I observe timings like this:
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036003
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 2,65217
Time for coreutils_memxfrm: 3,45622
Time for gnulib_memxfrm: 1,97612
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021

This means, when the stack buffer is sufficient - no mallocs needed on either
side - the timings are the same: 36 μsec per call on each side.

But when the stack buffer is not sufficient, then the use of coreutils memxfrm
is 30% to 70% slower than the use of gnulib memxfrm, with a difference of
700 μsec at least. You argue that the benefit of coreutils' memxfrm is that it
requires one less malloc. True, but a malloc of 40 KB is much much cheaper
than a call to memxfrm on 40 KB (think of all the locale dependent processing
that it must do). To get figures about this, I added an extra strdup + free to
the first loop in compare(). The timings are indistinguishable:

$ ./a.out 
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036003
Time for gnulib_memxfrm: 2,18814
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021

In summary, I think that gnulib memxfrm is more performant than coreutils
memxfrm. It is also easier to use: 3 lines of code for gnulib memxfrm vs.
7 lines of code for coreutils memxfrm.

I'd therefore suggest to keep the gnulib one, and that coreutils starts to use
the gnulib one (via a modified xmemxfrm wrapper).

Bruno

[compare.tar.gz (application/x-tgz, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Wed, 04 Aug 2010 23:22:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Wed, 04 Aug 2010 16:21:28 -0700

On 08/03/10 16:33, Bruno Haible wrote:
> But when the stack buffer is not sufficient, then the use of coreutils memxfrm
> is 30% to 70% slower than the use of gnulib memxfrm, with a difference of
> 700 μsec at least.

(Ooo! Ooo! Performance measurements! I love this stuff!)

It depends on the data. In the typical case, "sort" is applied to
text data, which does not contain NUL bytes. The data in that
benchmark contained many NUL bytes. If you take the same benchmark
and uniformly replace "\0" with "\t" in compare.c, then the situation
is much different: coreutils memxfrm is about 3 times faster than
gnulib memxfrm on the larger test cases (this platform is Ubuntu
10.04, gcc 4.5.0, 2.4 GHz Pentium 4):

503-penguin $ gcc -std=gnu99 -O2 -Wall coreutils-memxfrm.c gnulib-memxfrm.c compare1.c -I.
504-penguin $ ./a.out
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,028001
Time for gnulib_memxfrm: 0,024002
Time for coreutils_memxfrm: 0,024001
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 18,2051
Time for coreutils_memxfrm: 5,48834
Time for gnulib_memxfrm: 16,045
Time for coreutils_memxfrm: 5,50034
Time for gnulib_memxfrm: 15,837
Time for coreutils_memxfrm: 5,44834

I expect that this performance glitch in gnulib memxfrm could be
improved, as it shouldn't simply double buffer sizes when they're too
small, as at that point it already knows what the final buffer size
should be. Doing this should bring up gnulib memxfrm to be as fast as
coreutils xmemxfrm for this benchmark. Also, I agree that gnulib
memxfrm is faster in some important cases. Still, gnulib memxfrm is
problematic, because it insists on managing memory itself.

Come to think of it, looking at gnulib memxfrm gave me an idea
to improve the performance of GNU sort by bypassing the need for an
memxfrm-like function entirely. I pushed a patch to do that at
<http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=2b49b140cc13cf36ec5ee5acaca5ac7bfeed6366>.

This avoids any potential naming collision for now. The point
remains, though, that it's confusing that gnulib memxfrm's name begins
with "mem", as the mem* functions don't allocate memory. Would you
consider a patch that renames gnulib memxfrm to amemxfrm, or to some
other such name?

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Thu, 05 Aug 2010 00:40:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Simon Josefsson <simon <at> josefsson.org>
To: Paul Eggert <eggert <at> CS.UCLA.EDU>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Bug-gnulib <bug-gnulib <at> gnu.org>,
	Bruno Haible <bruno <at> clisp.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Thu, 05 Aug 2010 01:44:30 +0200

Paul Eggert <eggert <at> CS.UCLA.EDU> writes:

> Come to think of it, looking at gnulib memxfrm gave me an idea
> to improve the performance of GNU sort by bypassing the need for an
> memxfrm-like function entirely.  I pushed a patch to do that at
> <http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=2b49b140cc13cf36ec5ee5acaca5ac7bfeed6366>.

I don't know this code at all, but would your approach lead to problems
if two different strings have the same MD5 hash?  MD5 is broken, and
finding collisions takes just seconds on normal PC.  See:
http://en.wikipedia.org/wiki/MD5#Security

/Simon

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Thu, 05 Aug 2010 02:59:02 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: bug-gnulib <at> gnu.org, Bug-coreutils <bug-coreutils <at> gnu.org>, 
	Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Thu, 05 Aug 2010 04:58:26 +0200

On 08/05/2010 01:44 AM, Simon Josefsson wrote:
> Paul Eggert<eggert <at> CS.UCLA.EDU>  writes:
>
>> Come to think of it, looking at gnulib memxfrm gave me an idea
>> to improve the performance of GNU sort by bypassing the need for an
>> memxfrm-like function entirely.  I pushed a patch to do that at
>> <http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=2b49b140cc13cf36ec5ee5acaca5ac7bfeed6366>.
>
> I don't know this code at all, but would your approach lead to problems
> if two different strings have the same MD5 hash?  MD5 is broken, and
> finding collisions takes just seconds on normal PC.  See:
> http://en.wikipedia.org/wiki/MD5#Security

MD5 is used simply as a kind of "random key generator", so it doesn't 
matter.  I wonder two things however:

1) why bother with memxfrm as a tie-breaker? isn't memcmp good enough?

2) maybe there's something cheaper than md5 that can be used?  For 
example you could compare a^x and b^x where x is the output of a fast 
32-bit random number generator?  It doesn't need to be cryptographic, 
I'd pick http://en.wikipedia.org/wiki/Xorshift.

Paolo

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Thu, 05 Aug 2010 23:30:02 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, bug-gnulib <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Thu, 05 Aug 2010 16:29:37 -0700

On 08/04/10 19:58, Paolo Bonzini wrote:

> MD5 is used simply as a kind of "random key generator", so it doesn't
> matter.

That depends on what one is using "sort -R" for.  If one uses it to
choose data that are relevant for cryptographic purposes, it might
matter.  (Admittedly this is unlikely.)

I'm not that familiar with cracking MD5, but I would guess that the
cracking methods are rendered ineffective by the 128-bit salt that
"sort -R" uses.  If so, then there's no real problem.

If the fact that MD5 is crackable is a problem, it'd be trivial to
substitute (say) SHA256.  However, this would slow down 'sort -R'
considerably: switching to SHA256 would slow down 'sort -R' by a factor of
2.5 on the little million-line benchmark that I just tried it on ("seq
1000000", x86-64, Xeon E5620, gcc 4.5.1).

> 1) why bother with memxfrm as a tie-breaker? isn't memcmp good enough?

If two keys K1 and K2 compare equal, their random hashes are supposed
to compare equal too.  So if memcoll(K1,K2)==0, the random hashes must
be the same.  Hence we can't just do a memcmp on K1 and K2; we need to
do a memcmp on strxfrm(K1) and strxfrm(K2).

> 2) maybe there's something cheaper than md5 that can be used?  For
> example you could compare a^x and b^x where x is the output of a fast
> 32-bit random number generator?

That wouldn't be sufficiently random, even for non-cryptographic
purposes, since keys that are natively nearby would tend to sort near
to each other after being exclusive-ORed.

But I see your point: perhaps there is something faster than MD5 for
this sort of thing, and which is "secure" enough.  Perhaps the
ISAAC / ISAAC64 code that is already in GNU coreutils would work
for that?

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Fri, 06 Aug 2010 08:23:02 GMT) Full text and rfc822 format available.

Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Paul Eggert <eggert <at> CS.UCLA.EDU>
Cc: bug-gnulib <at> gnu.org, Bug-coreutils <bug-coreutils <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Fri, 06 Aug 2010 10:22:50 +0200

On 08/06/2010 01:29 AM, Paul Eggert wrote:

>> 1) why bother with memxfrm as a tie-breaker? isn't memcmp good enough?
>
> If two keys K1 and K2 compare equal, their random hashes are supposed
> to compare equal too.  So if memcoll(K1,K2)==0, the random hashes must
> be the same.  Hence we can't just do a memcmp on K1 and K2; we need to
> do a memcmp on strxfrm(K1) and strxfrm(K2).

I see.  In practice, this is because "you cannot separate straße and 
strasse".

>> 2) maybe there's something cheaper than md5 that can be used?  For
>> example you could compare a^x and b^x where x is the output of a fast
>> 32-bit random number generator?
>
> That wouldn't be sufficiently random, even for non-cryptographic
> purposes, since keys that are natively nearby would tend to sort near
> to each other after being exclusive-ORed.

You're right, keys that differ only in the leading or trailing bits 
would tend to stay respectively very far and very near, though you 
cannot say anything about the order.

> But I see your point: perhaps there is something faster than MD5 for
> this sort of thing, and which is "secure" enough.  Perhaps the
> ISAAC / ISAAC64 code that is already in GNU coreutils would work
> for that?

ISAAC is a RNG, so wouldn't that have the same problem above?  You 
definitely need to use a hash function, it's just that you do not need a 
cryptographic one.

Paolo

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Fri, 06 Aug 2010 17:54:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: bug-gnulib <at> gnu.org, Bug-coreutils <bug-coreutils <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Fri, 06 Aug 2010 10:53:55 -0700

On 08/06/10 01:22, Paolo Bonzini wrote:
> ISAAC is a RNG, so wouldn't that have the same problem above?  You
> definitely need to use a hash function, it's just that you do not need a
> cryptographic one.

I had been thinking of using ISAAC by making the key its seed, and
asking it to generate some random values, and then comparing the
random values.  Any RNG can be used (or abused :-) in this way.

I just now tried, that, though, and discovered that on my million line
benchmark the MD5 method is about 4 times faster than the ISAAC-based
method.  So that idea was not a good one.  I suppose we could try a
non-cryptographic hash function at some point.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sun, 08 Aug 2010 16:44:02 GMT) Full text and rfc822 format available.

Message #29 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Simon Josefsson <simon <at> josefsson.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, bug-coreutils <at> gnu.org
Subject: Re: MD5 is broken
Date: Sun, 8 Aug 2010 15:26:15 +0200

Simon Josefsson wrote:
> MD5 is broken, and
> finding collisions takes just seconds on normal PC.  See:
> http://en.wikipedia.org/wiki/MD5#Security

Here is a suggested patch to improve the awareness of this issue in
coreutils.
  - The documentation of md5sum currently says "modifying a file
    so as to retain its MD5 [is] considered infeasible at the moment",
    but the research results of 2008 mentioned in
    <http://en.wikipedia.org/wiki/MD5#Security> showed how to manipulate
    a digital certificate so that the validity of its MD5 signature can
    be retained.
  - The documentation of md5sum says "For more secure hashes, consider
    using SHA-1 or SHA-2." Well, researchers have already discovered
    security weaknesses in <http://en.wikipedia.org/wiki/SHA-1>, therefore
    it does not seem adequate to recommend SHA-1 any more.
  - The 'md5sum --help' output and, with it, the manual page are silent
    about the security problems.

Here is a proposed patch to make this clearer.


2010-08-08  Bruno Haible  <bruno <at> clisp.org>

	md5sum: Put more emphasis on security weaknesses.
	* doc/coreutils.texi (md5sum invocation): Mention currently known
	security problems. Don't recommend SHA-1 as alternative.
	* src/md5sum.c (usage): Mention that MD5 is not secure. Recommend
	SHA-2 as alternative.
	Reported by Simon Josefsson <simon <at> josefsson.org>.

--- doc/coreutils.texi.orig	Sun Aug  8 15:13:06 2010
+++ doc/coreutils.texi	Sun Aug  8 15:10:26 2010
@@ -3414,14 +3414,13 @@
 Note: The MD5 digest is more reliable than a simple CRC (provided by
 the @command{cksum} command) for detecting accidental file corruption,
 as the chances of accidentally having two files with identical MD5
-are vanishingly small.  However, it should not be considered truly
-secure against malicious tampering: although finding a file with a
-given MD5 fingerprint, or modifying a file so as to retain its MD5 are
-considered infeasible at the moment, it is known how to produce
-different files with identical MD5 (a ``collision''), something which
-can be a security issue in certain contexts.  For more secure hashes,
-consider using SHA-1 or SHA-2.  @xref{sha1sum invocation}, and
-@ref{sha2 utilities}.
+are vanishingly small.  However, it should not be considered secure
+against malicious tampering: although finding a file with a given MD5
+fingerprint is considered infeasible at the moment, it is known how
+to modify certain files, including digital certificates, so that they
+appear valid when signed with an MD5 digest.  (See
+@url{http://en.wikipedia.org/wiki/MD5#Security} for details.)
+For more secure hashes, consider using SHA-2.  @xref{sha2 utilities}.
 
 If a @var{file} is specified as @samp{-} or if no files are given
 @command{md5sum} computes the checksum for the standard input.
--- src/md5sum.c.orig	Sun Aug  8 15:13:06 2010
+++ src/md5sum.c	Sun Aug  8 14:48:57 2010
@@ -196,6 +196,15 @@
 a line with checksum, a character indicating type (`*' for binary, ` ' for\n\
 text), and name for each FILE.\n"),
               DIGEST_REFERENCE);
+#if HASH_ALGO_MD5
+      printf (_("\
+\n\
+The MD5 algorithm should not be used any more for security related purposes,\n\
+see <%s>.\n\
+Instead, better use an SHA-2 algorithm, implemented in the programs\n\
+sha224sum, sha256sum, sha384sum, sha512sum.\n"),
+              "http://en.wikipedia.org/wiki/MD5#Security");
+#endif
       emit_ancillary_info ();
     }

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sun, 08 Aug 2010 16:47:01 GMT) Full text and rfc822 format available.

Message #32 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Sun, 8 Aug 2010 12:31:29 +0200

Hi Paul,

> (Ooo!  Ooo!  Performance measurements!  I love this stuff!)

Me too :-)

> It depends on the data.  In the typical case, "sort" is applied to
> text data, which does not contain NUL bytes.  The data in that
> benchmark contained many NUL bytes.  If you take the same benchmark
> and uniformly replace "\0" with "\t" in compare.c, then the situation
> is much different: coreutils memxfrm is about 3 times faster than
> gnulib memxfrm on the larger test cases

Indeed. By changing the compare.c program to try
  - first, a small string, 3 times,
  - then, a 40 KB string with \t separators, 3 times,
  - then, the same string with \0 separators, 3 times,
I confirm your timings:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 10,9647
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 10,4247
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 10,4407
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,41621

The reason is that gnulib_memxfrm duplicates the allocated memory size, from
4 KB to 8 KB to 16 KB etc., ignoring the expected result size of 108 KB that
strxfrm is returned. After this is fixed, I get better timings.

> I expect that this performance glitch in gnulib memxfrm could be
> improved, as it shouldn't simply double buffer sizes when they're too
> small, as at that point it already knows what the final buffer size
> should be.  Doing this should bring up gnulib memxfrm to be as fast as
> coreutils xmemxfrm for this benchmark.

Yes:

Without strdup:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036003
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 4,29627
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 3,54422
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 3,54022
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,42421
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,42421
Time for gnulib_memxfrm: 1,98812
Time for coreutils_memxfrm: 3,42021

With strdup:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 4,39227
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 3,62823
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 3,64023
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,41621

So, this means that my estimation of the overhead of strdup was incorrect. It
is not unnoticeable. In the second of the three cases, it is about 2.5% of the
3.55 seconds. In other words, a malloc + memmove costs about 5% of one
strxfrm call.

But this means that reducing the average numbers of strxfrm calls from 2 to 1,
at the cost of some more malloc, will be a speed-up. This is what I did in
the patch below, and now gnulib_memxfrm is consistently the winner:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 2,90418
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,97212
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,99212
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,84011
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,82811
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,83611
Time for coreutils_memxfrm: 3,41621

> Still, gnulib memxfrm is
> problematic, because it insists on managing memory itself.

But by doing so it is now 44% faster than the coreutils_memxfrm on the
case of long strings, with or without NULs.

Starting from a certain complexity, doing malloc as part of the API is
a win. Only people who do some kind of embedded systems programming
require to be able to do memory allocation statically. For example,
GNU libiconv now has a function iconv_open_into, that is like iconv_open
without memory allocation. But very very few people need that.

Even the most basic functions in libunistring (unistr.in.h: u8_to_u16,
u8_to_u32, ..., uniconv.in.h: u8_conv_from_encoding, u8_conv_to_encoding,
etc.) rely on implicit dynamic memory allocation. No one has complained
about this.

> The point
> remains, though, that it's confusing that gnulib memxfrm's name begins
> with "mem", as the mem* functions don't allocate memory.  Would you
> consider a patch that renames gnulib memxfrm to amemxfrm, or to some
> other such name?

No, this is not good. The variant which never allocates memory by itself
would be more complex to use and slower on average that gnulib's function.
Also, functions like 'strdup', 'putenv', 'setenv', 'scandir', 'fts', all
do dynamic memory allocation without having an 'a' in their name to
indicate this.


2010-08-08  Bruno Haible  <bruno <at> clisp.org>

	memxfrm: Speed up.
	* lib/memxfrm.c (memxfrm): Allocate enough memory ahead of time, so
	that usually only one call to strxfrm is necessary for each string
	part.
	Reported by Paul Eggert <eggert <at> cs.ucla.edu>.

--- lib/memxfrm.c.orig	Sun Aug  8 11:56:19 2010
+++ lib/memxfrm.c	Sun Aug  8 11:55:46 2010
@@ -64,12 +64,40 @@
     for (;;)
       {
         /* Search next NUL byte.  */
-        const char *q = p + strlen (p);
+        size_t l = strlen (p);
 
         for (;;)
           {
             size_t k;
 
+            /* A call to strxfrm costs about 20 times more than a call to
+               strdup of the result.  Therefore it is worth to try to avoid
+               calling strxfrm more than once on a given string, by making
+               enough room before calling strxfrm.
+               The size of the strxfrm result, k, is likely to be between
+               l and 3 * l.  */
+            if (3 * l >= allocated - length)
+              {
+                /* Grow the result buffer.  */
+                size_t new_allocated;
+                char *new_result;
+
+                new_allocated = length + 3 * l + 1;
+                if (new_allocated < 2 * allocated)
+                  new_allocated = 2 * allocated;
+                if (new_allocated < 64)
+                  new_allocated = 64;
+                if (result == resultbuf)
+                  new_result = (char *) malloc (new_allocated);
+                else
+                  new_result = (char *) realloc (result, new_allocated);
+                if (new_result != NULL)
+                  {
+                    allocated = new_allocated;
+                    result = new_result;
+                  }
+              }
+
             errno = 0;
             k = strxfrm (result + length, p, allocated - length);
             if (errno != 0)
@@ -77,17 +105,21 @@
             if (k >= allocated - length)
               {
                 /* Grow the result buffer.  */
+                size_t new_allocated;
                 char *new_result;
 
-                allocated = 2 * allocated;
-                if (allocated < 64)
-                  allocated = 64;
+                new_allocated = length + k + 1;
+                if (new_allocated < 2 * allocated)
+                  new_allocated = 2 * allocated;
+                if (new_allocated < 64)
+                  new_allocated = 64;
                 if (result == resultbuf)
-                  new_result = (char *) malloc (allocated);
+                  new_result = (char *) malloc (new_allocated);
                 else
-                  new_result = (char *) realloc (result, allocated);
+                  new_result = (char *) realloc (result, new_allocated);
                 if (new_result == NULL)
                   goto out_of_memory_1;
+                allocated = new_allocated;
                 result = new_result;
               }
             else
@@ -97,7 +129,7 @@
               }
           }
 
-        p = q + 1;
+        p = p + l + 1;
         if (p == p_end)
           break;
         result[length] = '\0';
@@ -105,12 +137,23 @@
       }
   }
 
-  /* Shrink the allocated memory if possible.  */
-  if (result != resultbuf && (length > 0 ? length : 1) < allocated)
+  /* Shrink the allocated memory if possible.
+     It is not worth calling realloc when length + 1 == allocated; it would
+     save just one byte.  */
+  if (result != resultbuf && length + 1 < allocated)
     {
-      char *memory = (char *) realloc (result, length > 0 ? length : 1);
-      if (memory != NULL)
-        result = memory;
+      if ((length > 0 ? length : 1) <= *lengthp)
+        {
+          memcpy (resultbuf, result, length);
+          free (result);
+          result = resultbuf;
+        }
+      else
+        {
+          char *memory = (char *) realloc (result, length > 0 ? length : 1);
+          if (memory != NULL)
+            result = memory;
+        }
     }
 
   s[n] = orig_sentinel;

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sun, 08 Aug 2010 17:27:02 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Sun, 8 Aug 2010 14:24:40 +0200

[Message part 1 (text/plain, inline)]

Hi Paul,

> I pushed a patch to do that at
> <http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=2b49b140cc13cf36ec5ee5acaca5ac7bfeed6366>.

The idea to allocate enough memory before calling strxfrm also gives
a speedup in this case. Done through the attached patch.

I called 'sort' like this:
  $ for i in `seq 10`; do
      time LC_ALL=de_DE.UTF-8 ./sort -R < input100 > output;
    done

where the input100 file contains 100 copies of the attached 2-lines file.
Timings before the patch:

real    0m9.512s
user    0m18.401s
sys     0m0.468s

real    0m8.871s
user    0m17.033s
sys     0m0.544s

real    0m8.742s
user    0m16.777s
sys     0m0.472s

real    0m8.784s
user    0m16.829s
sys     0m0.480s

real    0m8.657s
user    0m16.665s
sys     0m0.452s

real    0m8.700s
user    0m16.737s
sys     0m0.484s

real    0m8.665s
user    0m16.569s
sys     0m0.500s

real    0m8.826s
user    0m16.937s
sys     0m0.464s

real    0m8.827s
user    0m16.985s
sys     0m0.428s

real    0m8.680s
user    0m16.765s
sys     0m0.356s

Timings with the patch:

real    0m5.886s
user    0m11.161s
sys     0m0.384s

real    0m5.137s
user    0m9.705s
sys     0m0.408s

real    0m5.150s
user    0m9.753s
sys     0m0.404s

real    0m5.090s
user    0m9.697s
sys     0m0.348s

real    0m5.158s
user    0m9.753s
sys     0m0.420s

real    0m5.149s
user    0m9.825s
sys     0m0.360s

real    0m5.134s
user    0m9.765s
sys     0m0.364s

real    0m5.080s
user    0m9.669s
sys     0m0.332s

real    0m5.052s
user    0m9.625s
sys     0m0.336s

real    0m5.084s
user    0m9.713s
sys     0m0.288s

Total user time before:         169.698 sec
Total user time with the patch:  98.666 sec
Speedup: factor 1.72.


2010-08-08  Bruno Haible  <bruno <at> clisp.org>

	sort: reduce number of strxfrm calls
	* src/sort.c (compare_random): Allocate enough memory ahead of time, so
        that usually only one call to strxfrm is necessary for each string
        part.

*** src/sort.c.orig	Sun Aug  8 13:11:01 2010
--- src/sort.c	Sun Aug  8 13:10:45 2010
***************
*** 2047,2052 ****
--- 2047,2080 ----
  
            /* Store the transformed data into a big-enough buffer.  */
  
+           /* A call to strxfrm costs about 20 times more than a call to
+              strdup of the result.  Therefore it is worth to try to avoid
+              calling strxfrm more than once on a given string, by making
+              enough room before calling strxfrm.
+              The size of the strxfrm result of a string of length len is
+              likely to be between len and 3 * len.  */
+           if (lena + lenb >= lena && lena + lenb < SIZE_MAX / 3)
+             {
+               size_t new_bufsize = 3 * (lena + lenb) + 1; /* no overflow */
+               if (new_bufsize > bufsize)
+                 {
+                   if (bufsize < SIZE_MAX / 3 * 2)
+                     {
+                       /* Ensure proportional growth of bufsize.  */
+                       if (new_bufsize < bufsize + bufsize / 2)
+                         new_bufsize = bufsize + bufsize / 2;
+                     }
+                   char *new_buf = malloc (new_bufsize);
+                   if (new_buf != NULL)
+                     {
+                       if (buf != stackbuf)
+                         free (buf);
+                       buf = new_buf;
+                       bufsize = new_bufsize;
+                     }
+                 }
+             }
+ 
            size_t sizea =
              (texta < lima ? xstrxfrm (buf, texta, bufsize) + 1 : 0);
            bool a_fits = sizea <= bufsize;

[input (text/html, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sun, 08 Aug 2010 17:27:03 GMT) Full text and rfc822 format available.

Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Simon Josefsson <simon <at> josefsson.org>
To: Paul Eggert <eggert <at> CS.UCLA.EDU>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Paolo Bonzini <bonzini <at> gnu.org>,
	bug-gnulib <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Sat, 07 Aug 2010 13:40:46 +0200

Paul Eggert <eggert <at> CS.UCLA.EDU> writes:

> On 08/06/10 01:22, Paolo Bonzini wrote:
>> ISAAC is a RNG, so wouldn't that have the same problem above?  You
>> definitely need to use a hash function, it's just that you do not need a
>> cryptographic one.
>
> I had been thinking of using ISAAC by making the key its seed, and
> asking it to generate some random values, and then comparing the
> random values.  Any RNG can be used (or abused :-) in this way.
>
> I just now tried, that, though, and discovered that on my million line
> benchmark the MD5 method is about 4 times faster than the ISAAC-based
> method.  So that idea was not a good one.  I suppose we could try a
> non-cryptographic hash function at some point.

I suspect FNV or Xorshift would be faster, since they are so simple:

http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://en.wikipedia.org/wiki/Xorshift

/Simon

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Mon, 09 Aug 2010 06:22:01 GMT) Full text and rfc822 format available.

Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-coreutils <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Sun, 08 Aug 2010 23:21:29 -0700

On 08/08/10 05:24, Bruno Haible wrote:
> sort: reduce number of strxfrm calls

Thanks for that suggestion.  Amusingly enough, it made 'sort -R'
slower on the first benchmark I tried it on, which was 'sort -R *'.
But that's an unfair benchmark, since '*' expanded to executables and
other non-text files.  Overall, it's a good idea.  However, the code
need not be quite that long, since there's no need to do size_t
overflow checking.  I pushed this:

From 0061819c7e1bbc26586cc5977ea96da016f7cea2 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Sun, 8 Aug 2010 23:14:38 -0700
Subject: [PATCH] sort: speed up -R with long lines in hard locales

* src/sort.c (compare_random): Guess that the output will be
3X the input.  This avoids the overhead of calling strxfrm
twice on typical implementations.  Suggested by Bruno Haible.
---
 src/sort.c |   18 +++++++++++++-----
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index dcfd24f..148ed3e 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2024,6 +2024,7 @@ compare_random (char *restrict texta, size_t lena,
   char stackbuf[4000];
   char *buf = stackbuf;
   size_t bufsize = sizeof stackbuf;
+  void *allocated = NULL;
   uint32_t dig[2][MD5_DIGEST_SIZE / sizeof (uint32_t)];
   struct md5_ctx s[2];
   s[0] = s[1] = random_md5_state;
@@ -2047,6 +2048,16 @@ compare_random (char *restrict texta, size_t lena,
 
           /* Store the transformed data into a big-enough buffer.  */
 
+          /* A 3X size guess avoids the overhead of calling strxfrm
+             twice on typical implementations.  Don't worry about
+             size_t overflow, as the guess need not be correct.  */
+          size_t guess_bufsize = 3 * (lena + lenb) + 2;
+          if (bufsize < guess_bufsize)
+            {
+              bufsize = MAX (guess_bufsize, bufsize * 3 / 2);
+              buf = allocated = xrealloc (allocated, bufsize);
+            }
+
           size_t sizea =
             (texta < lima ? xstrxfrm (buf, texta, bufsize) + 1 : 0);
           bool a_fits = sizea <= bufsize;
@@ -2062,9 +2073,7 @@ compare_random (char *restrict texta, size_t lena,
               bufsize = sizea + sizeb;
               if (bufsize < SIZE_MAX / 3)
                 bufsize = bufsize * 3 / 2;
-              buf = (buf == stackbuf
-                     ? xmalloc (bufsize)
-                     : xrealloc (buf, bufsize));
+              buf = allocated = xrealloc (allocated, bufsize);
               if (texta < lima)
                 strxfrm (buf, texta, sizea);
               if (textb < limb)
@@ -2119,8 +2128,7 @@ compare_random (char *restrict texta, size_t lena,
       diff = xfrm_diff;
     }
 
-  if (buf != stackbuf)
-    free (buf);
+  free (allocated);
 
   return diff;
 }
-- 
1.7.2

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Mon, 09 Aug 2010 06:29:02 GMT) Full text and rfc822 format available.

Message #44 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Simon Josefsson <simon <at> josefsson.org>, bug-coreutils <at> gnu.org
Subject: Re: MD5 is broken
Date: Sun, 08 Aug 2010 23:28:34 -0700

On 08/08/10 06:26, Bruno Haible wrote:
> Here is a proposed patch to make this clearer.

I like this patch, except I have qualms about
putting a Wikipedia URL in the documentation, as
Wikipedia is not that stable.  Perhaps
<http://www.kb.cert.org/vuls/id/836068> would
be a better URL.  Also, the --help output shouldn't
point to Wikipedia (or to CERT, for that matter);
it should at most refer to the coreutils manual.

Jim and/or Pádraig may have better advice here.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Mon, 09 Aug 2010 06:41:01 GMT) Full text and rfc822 format available.

Message #47 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Bug-coreutils <bug-coreutils <at> gnu.org>, Bug-gnulib <bug-gnulib <at> gnu.org>
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Sun, 08 Aug 2010 23:41:23 -0700

On 08/08/10 03:31, Bruno Haible wrote:

>> The point
>> remains, though, that it's confusing that gnulib memxfrm's name begins
>> with "mem", as the mem* functions don't allocate memory.  Would you
>> consider a patch that renames gnulib memxfrm to amemxfrm, or to some
>> other such name?
> 
> No, this is not good. The variant which never allocates memory by itself
> would be more complex to use and slower on average that gnulib's function.

Sorry, but this doesn't seem to address the point.  The name for
gnulib's strxfrm variant should be chosen so that it's not confusing,
regardless of whether some other strxfrm variant exists.  Currently,
no other variant exists in coreutils and I think it unlikely that
coreutils will use any similar variant any time soon, but removing
coreutils memxfrm didn't fix the gnulib confusion.

> Also, functions like 'strdup', 'putenv', 'setenv', 'scandir', 'fts', all
> do dynamic memory allocation without having an 'a' in their name to
> indicate this.

My point was not that the function must start with "a".  After all,
lots of functions allocate memory without having "a" at the front: malloc
is just one example.  All I'm saying is that the gnulib variant shouldn't
use a name starting with "mem", because the mem* names have similar
properties and the gnulib variant departs dramatically from these
properties.

The "strdup"/"strndup" functions are cases in point.  Their names were
controversial, and they had quite some trouble getting into POSIX, precisely
because their names began with "str" but (unlike the other str* functions)
they allocated memory.  It would be better to not go down that same road
again.

Thanks for improving the performance of the gnulib variant, by the way.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Mon, 09 Aug 2010 07:25:02 GMT) Full text and rfc822 format available.

Message #50 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Mon, 9 Aug 2010 09:25:04 +0200

Hi Paul,

> +              buf = allocated = xrealloc (allocated, bufsize);

The contents of the 'allocated' buffer is scratch, therefore malloc + free
should be faster than realloc (except maybe on Linux systems, due to the
mremap() system call).

Also, the '3 * (lena + lenb)' guess is pessimistic; it is possible that
it may return with ENOMEM when in fact strxfrm's real needs would not
lead to ENOMEM.

Bruno

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Tue, 10 Aug 2010 01:06:02 GMT) Full text and rfc822 format available.

Message #53 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> CS.UCLA.EDU>
Cc: Simon Josefsson <simon <at> josefsson.org>, bug-coreutils <at> gnu.org,
	Bruno Haible <bruno <at> clisp.org>
Subject: Re: bug#6789: MD5 is broken
Date: Tue, 10 Aug 2010 02:06:18 +0100

On 09/08/10 07:28, Paul Eggert wrote:
> On 08/08/10 06:26, Bruno Haible wrote:
>> Here is a proposed patch to make this clearer.
> 
> I like this patch, except I have qualms about
> putting a Wikipedia URL in the documentation, as
> Wikipedia is not that stable.  Perhaps
> <http://www.kb.cert.org/vuls/id/836068> would
> be a better URL.  Also, the --help output shouldn't
> point to Wikipedia (or to CERT, for that matter);
> it should at most refer to the coreutils manual.
> 
> Jim and/or Pádraig may have better advice here.

We don't need to hand hold people interested
in the details of MD5 weaknesses. They'll be well
able to find the pertinent info. Therefore in the
amended patch below I've just removed the URL.
I also removed the addition to --help
(and consequently the man page), as I think it's overkill.
If we were to add something to --help it should
probably be also done for sha1sum, but the amended
texinfo is enough I think.

cheers,
Pádraig.

commit 4caf1adec8e6ce0cb7ab75365ab312411b2d47bd
Author: Bruno Haible <bruno <at> clisp.org>
Date:   Tue Aug 10 01:56:36 2010 +0100

    doc: improve the info on md5sum security weaknesses

    * doc/coreutils.texi (md5sum invocation): Mention currently known
    security problems. Don't recommend SHA-1 as alternative.
    Reported by Simon Josefsson

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 942978f..e0e308b 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3414,14 +3414,12 @@ options}.
 Note: The MD5 digest is more reliable than a simple CRC (provided by
 the @command{cksum} command) for detecting accidental file corruption,
 as the chances of accidentally having two files with identical MD5
-are vanishingly small.  However, it should not be considered truly
-secure against malicious tampering: although finding a file with a
-given MD5 fingerprint, or modifying a file so as to retain its MD5 are
-considered infeasible at the moment, it is known how to produce
-different files with identical MD5 (a ``collision''), something which
-can be a security issue in certain contexts.  For more secure hashes,
-consider using SHA-1 or SHA-2.  @xref{sha1sum invocation}, and
-@ref{sha2 utilities}.
+are vanishingly small.  However, it should not be considered secure
+against malicious tampering: although finding a file with a given MD5
+fingerprint is considered infeasible at the moment, it is known how
+to modify certain files, including digital certificates, so that they
+appear valid when signed with an MD5 digest.
+For more secure hashes, consider using SHA-2.  @xref{sha2 utilities}.

 If a @var{file} is specified as @samp{-} or if no files are given
 @command{md5sum} computes the checksum for the standard input.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Tue, 10 Aug 2010 20:54:01 GMT) Full text and rfc822 format available.

Message #56 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-coreutils <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Tue, 10 Aug 2010 22:53:41 +0200

On 08/09/10 09:25, Bruno Haible wrote:

> The contents of the 'allocated' buffer is scratch, therefore malloc + free
> should be faster than realloc...
> 
> Also, the '3 * (lena + lenb)' guess is pessimistic; it is possible that
> it may return with ENOMEM when in fact strxfrm's real needs would not
> lead to ENOMEM.

Thanks again; I installed this:

* src/sort.c (compare_random): Use free/xmalloc rather than
xrealloc, since the old buffer contents need not be preserved.
Also, don't fail if the guessed-sized malloc fails.  Suggested by
Bruno Haible.
---
 src/sort.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 084f4e3..3dc7ae0 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2056,7 +2056,13 @@ compare_random (char *restrict texta, size_t lena,
           if (bufsize < guess_bufsize)
             {
               bufsize = MAX (guess_bufsize, bufsize * 3 / 2);
-              buf = allocated = xrealloc (allocated, bufsize);
+              free (allocated);
+              buf = allocated = malloc (bufsize);
+              if (! buf)
+                {
+                  buf = stackbuf;
+                  bufsize = sizeof stackbuf;
+                }
             }
 
           size_t sizea =
@@ -2074,7 +2080,8 @@ compare_random (char *restrict texta, size_t lena,
               bufsize = sizea + sizeb;
               if (bufsize < SIZE_MAX / 3)
                 bufsize = bufsize * 3 / 2;
-              buf = allocated = xrealloc (allocated, bufsize);
+              free (allocated);
+              buf = allocated = xmalloc (bufsize);
               if (texta < lima)
                 strxfrm (buf, texta, sizea);
               if (textb < limb)
-- 
1.7.2

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Wed, 11 Aug 2010 00:38:01 GMT) Full text and rfc822 format available.

Message #59 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Wed, 11 Aug 2010 02:38:11 +0200

Hi Paul,

> All I'm saying is that the gnulib variant shouldn't
> use a name starting with "mem", because the mem* names have similar
> properties and the gnulib variant departs dramatically from these
> properties.
> 
> The "strdup"/"strndup" functions are cases in point.  Their names were
> controversial, and they had quite some trouble getting into POSIX, precisely
> because their names began with "str" but (unlike the other str* functions)
> they allocated memory.

But now they are in POSIX. So, precedent is existing.

On the other hand, it has now appeared that strxfrm would be easier to use
efficiently if it had a wrapper that incorporated the "allocate 3 * len
bytes before calling strxfrm" heuristic. If we add such a wrapper to gnulib,
it could be called 'astrxfrm'

  extern char * astrxfrm (const char *s, char *resultbuf, size_t *lengthp);

and then I would agree to renaming memxfrm -> amemxfrm, for consistency.

Bruno

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Wed, 11 Aug 2010 03:42:02 GMT) Full text and rfc822 format available.

Message #62 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> CS.UCLA.EDU>
To: Bruno Haible <bruno <at> clisp.org>
Cc: bug-coreutils <at> gnu.org, bug-gnulib <at> gnu.org
Subject: Re: propose renaming gnulib memxfrm to amemxfrm (naming collision
	with coreutils)
Date: Wed, 11 Aug 2010 05:42:10 +0200

On 08/11/10 02:38, Bruno Haible wrote:
>   extern char * astrxfrm (const char *s, char *resultbuf, size_t *lengthp);

Yes, that looks like a useful addition.  Thanks for the suggestion.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sat, 14 Aug 2010 17:19:02 GMT) Full text and rfc822 format available.

Message #65 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: Simon Josefsson <simon <at> josefsson.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
	bug-coreutils <at> gnu.org
Subject: Re: bug#6789: MD5 is broken
Date: Sat, 14 Aug 2010 19:19:04 +0200

Hi Pádraig,

> I also removed the addition to --help
> (and consequently the man page), as I think it's overkill.

It's common to list important issues with a program or function
in the BUGS section of the manual page. For example,

  $ man 3 tempnam
  ...
  BUGS
  ...
         Never use this function.  Use mkstemp(3) or tmpfile(3) instead.

In particular if the use of a program may have severe security implications,
I would expect to know about it from the manual page.

> If we were to add something to --help it should
> probably be also done for sha1sum

The attacks on SHA-1 are less advanced than those on MD5, currently.
But if you would warn against use of SHA-1 also, please go ahead.

> commit 4caf1adec8e6ce0cb7ab75365ab312411b2d47bd
> Author: Bruno Haible <bruno <at> clisp.org>
> Date:   Tue Aug 10 01:56:36 2010 +0100
> 
>     doc: improve the info on md5sum security weaknesses
> 
>     * doc/coreutils.texi (md5sum invocation): Mention currently known
>     security problems. Don't recommend SHA-1 as alternative.
>     Reported by Simon Josefsson

You haven't pushed this so far, I think?

Bruno

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6789; Package coreutils. (Sat, 14 Aug 2010 22:57:01 GMT) Full text and rfc822 format available.

Message #68 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: Report bugs to <bug-coreutils <at> gnu.org>
Subject: Re: bug#6789: MD5 is broken
Date: Sat, 14 Aug 2010 23:56:59 +0100

On 14/08/10 18:19, Bruno Haible wrote:
> Hi Pádraig,
> 
>> I also removed the addition to --help
>> (and consequently the man page), as I think it's overkill.
> 
> It's common to list important issues with a program or function
> in the BUGS section of the manual page. For example,
> 
>   $ man 3 tempnam
>   ...
>   BUGS
>   ...
>          Never use this function.  Use mkstemp(3) or tmpfile(3) instead.
> 
> In particular if the use of a program may have severe security implications,
> I would expect to know about it from the manual page.

OK cool. I was thinking that warnings would be more appropriate
in library docs rather than the user util, but I will add
the warning to BUGS in man/md5sum.x and leave --help unchanged.

>> If we were to add something to --help it should
>> probably be also done for sha1sum
> 
> The attacks on SHA-1 are less advanced than those on MD5, currently.
> But if you would warn against use of SHA-1 also, please go ahead.
> 
>> commit 4caf1adec8e6ce0cb7ab75365ab312411b2d47bd
>> Author: Bruno Haible <bruno <at> clisp.org>
>> Date:   Tue Aug 10 01:56:36 2010 +0100
>>
>>     doc: improve the info on md5sum security weaknesses
>>
>>     * doc/coreutils.texi (md5sum invocation): Mention currently known
>>     security problems. Don't recommend SHA-1 as alternative.
>>     Reported by Simon Josefsson
> 
> You haven't pushed this so far, I think?

I only added it to my local queue in case there was
feedback on my amendments. I will apply the update now.

thanks,
Pádraig.

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 07 Aug 2011 15:54:02 GMT) Full text and rfc822 format available.

Notification sent to Paul Eggert <eggert <at> CS.UCLA.EDU>:
bug acknowledged by developer. (Sun, 07 Aug 2011 15:54:02 GMT) Full text and rfc822 format available.

Message #73 received at 6789-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> CS.UCLA.EDU>
Cc: 6789-done <at> debbugs.gnu.org
Subject: Re: bug#6789: propose renaming gnulib memxfrm to amemxfrm (naming
	collision with coreutils)
Date: Sun, 07 Aug 2011 17:52:24 +0200

Paul Eggert wrote:
> On 08/09/10 09:25, Bruno Haible wrote:
>
>> The contents of the 'allocated' buffer is scratch, therefore malloc + free
>> should be faster than realloc...
>>
>> Also, the '3 * (lena + lenb)' guess is pessimistic; it is possible that
>> it may return with ENOMEM when in fact strxfrm's real needs would not
>> lead to ENOMEM.
>
> Thanks again; I installed this:
>
> * src/sort.c (compare_random): Use free/xmalloc rather than
> xrealloc, since the old buffer contents need not be preserved.
> Also, don't fail if the guessed-sized malloc fails.  Suggested by
> Bruno Haible.

This was resolved a year ago.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 05 Sep 2011 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 14 years and 5 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #6789 propose renaming gnulib memxfrm to amemxfrm (naming collision with coreutils)

GNU bug report logs - #6789
propose renaming gnulib memxfrm to amemxfrm (naming collision with coreutils)