GNU bug report logs - #32993
Pathologically slow operation

Previous Next

Package: diffutils;

Reported by: Stefan Monnier <monnier <at> IRO.UMontreal.CA>

Date: Mon, 8 Oct 2018 21:35:01 UTC

Severity: normal

To reply to this bug, email your comments to 32993 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-diffutils <at> gnu.org:
bug#32993; Package diffutils. (Mon, 08 Oct 2018 21:35:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> IRO.UMontreal.CA>:
New bug report received and forwarded. Copy sent to bug-diffutils <at> gnu.org. (Mon, 08 Oct 2018 21:35:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: bug-diffutils <at> gnu.org
Subject: Pathologically slow operation
Date: Mon, 08 Oct 2018 17:34:18 -0400
I recently bumped into a `diff` operation that I killed after several
minutes while diffing two files (on 3.7GHz core i3, which is the fastest
machine I have).

These files were generated as part of Emacs's "refine-hunk" processing
which tries to do word-level diffs (by basically turning every word
into N copies of this word, each one on its own line (where N is the
number of chars in the word, used to indicate to `diff` that long words
are "more costly" than short ones)).

So the files's sizes were:

    % wc tmp/diff-bug-* 
    1038026  851160 4963190 tmp/diff-bug-1
      65041   54877  314788 tmp/diff-bug-2
    1103067  906037 5277978 total
    %

With --speed-large-files, diff still took almost a minute to return an
answer (which is 973026 lines long).

Those file aren't exactly security sensitive, but they contain personal
info that I'd rather not make public (I can make send them in private
upon request, tho).  Is there a chance this performance behavior is the
result of a performance bug, or is the algorithm really that costly?


        Stefan




Information forwarded to bug-diffutils <at> gnu.org:
bug#32993; Package diffutils. (Mon, 08 Oct 2018 22:50:02 GMT) Full text and rfc822 format available.

Message #8 received at 32993 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>, 32993 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#32993: Pathologically slow operation
Date: Mon, 8 Oct 2018 15:49:09 -0700
On 10/8/18 2:34 PM, Stefan Monnier wrote:
> Is there a chance this performance behavior is the
> result of a performance bug, or is the algorithm really that costly?

It's O(N*D), where N is the input length and D is the number of 
differences. In your case that might be about 10*11, which is getting up 
there.

> tries to do word-level diffs (by basically turning every word
> into N copies of this word, each one on its own line (where N is the
> number of chars in the word, used to indicate to `diff` that long words
> are "more costly" than short ones)

Ouch. That's a really inefficient way of mucking with the cost 
algorithm. Diff should be able to do taht directly, without your having 
to repeat the words. (Just a simple matter of programming....)





Information forwarded to bug-diffutils <at> gnu.org:
bug#32993; Package diffutils. (Tue, 09 Oct 2018 00:16:02 GMT) Full text and rfc822 format available.

Message #11 received at 32993 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 32993 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#32993: Pathologically slow operation
Date: Mon, 08 Oct 2018 20:14:27 -0400
>> tries to do word-level diffs (by basically turning every word
>> into N copies of this word, each one on its own line (where N is the
>> number of chars in the word, used to indicate to `diff` that long words
>> are "more costly" than short ones)
> Ouch.  That's a really inefficient way of mucking with the cost
> algorithm.

What other way is there?

> Diff should be able to do taht directly, without your having to
> repeat the words. (Just a simple matter of programming....)

"man diff" doesn't seem to offer much room for "programming".


        Stefan




Information forwarded to bug-diffutils <at> gnu.org:
bug#32993; Package diffutils. (Tue, 09 Oct 2018 00:48:01 GMT) Full text and rfc822 format available.

Message #14 received at 32993 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: 32993 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#32993: Pathologically slow operation
Date: Mon, 8 Oct 2018 17:46:57 -0700
On 10/8/18 5:14 PM, Stefan Monnier wrote:
> What other way is there?

It'd require hacking the diff source code, alas. Also, understanding the 
algorithm. Although the best guy to do it would probably be the 
algorithm's co-inventor Eugene Myers, he's kind of busy these days 
<https://en.wikipedia.org/wiki/Eugene_Myers>.





This bug report was last modified 6 years and 251 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.