GNU bug report logs -
#32993
Pathologically slow operation
Previous Next
To reply to this bug, email your comments to 32993 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-diffutils <at> gnu.org
:
bug#32993
; Package
diffutils
.
(Mon, 08 Oct 2018 21:35:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Monnier <monnier <at> IRO.UMontreal.CA>
:
New bug report received and forwarded. Copy sent to
bug-diffutils <at> gnu.org
.
(Mon, 08 Oct 2018 21:35:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
I recently bumped into a `diff` operation that I killed after several
minutes while diffing two files (on 3.7GHz core i3, which is the fastest
machine I have).
These files were generated as part of Emacs's "refine-hunk" processing
which tries to do word-level diffs (by basically turning every word
into N copies of this word, each one on its own line (where N is the
number of chars in the word, used to indicate to `diff` that long words
are "more costly" than short ones)).
So the files's sizes were:
% wc tmp/diff-bug-*
1038026 851160 4963190 tmp/diff-bug-1
65041 54877 314788 tmp/diff-bug-2
1103067 906037 5277978 total
%
With --speed-large-files, diff still took almost a minute to return an
answer (which is 973026 lines long).
Those file aren't exactly security sensitive, but they contain personal
info that I'd rather not make public (I can make send them in private
upon request, tho). Is there a chance this performance behavior is the
result of a performance bug, or is the algorithm really that costly?
Stefan
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#32993
; Package
diffutils
.
(Mon, 08 Oct 2018 22:50:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 32993 <at> debbugs.gnu.org (full text, mbox):
On 10/8/18 2:34 PM, Stefan Monnier wrote:
> Is there a chance this performance behavior is the
> result of a performance bug, or is the algorithm really that costly?
It's O(N*D), where N is the input length and D is the number of
differences. In your case that might be about 10*11, which is getting up
there.
> tries to do word-level diffs (by basically turning every word
> into N copies of this word, each one on its own line (where N is the
> number of chars in the word, used to indicate to `diff` that long words
> are "more costly" than short ones)
Ouch. That's a really inefficient way of mucking with the cost
algorithm. Diff should be able to do taht directly, without your having
to repeat the words. (Just a simple matter of programming....)
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#32993
; Package
diffutils
.
(Tue, 09 Oct 2018 00:16:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 32993 <at> debbugs.gnu.org (full text, mbox):
>> tries to do word-level diffs (by basically turning every word
>> into N copies of this word, each one on its own line (where N is the
>> number of chars in the word, used to indicate to `diff` that long words
>> are "more costly" than short ones)
> Ouch. That's a really inefficient way of mucking with the cost
> algorithm.
What other way is there?
> Diff should be able to do taht directly, without your having to
> repeat the words. (Just a simple matter of programming....)
"man diff" doesn't seem to offer much room for "programming".
Stefan
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#32993
; Package
diffutils
.
(Tue, 09 Oct 2018 00:48:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 32993 <at> debbugs.gnu.org (full text, mbox):
On 10/8/18 5:14 PM, Stefan Monnier wrote:
> What other way is there?
It'd require hacking the diff source code, alas. Also, understanding the
algorithm. Although the best guy to do it would probably be the
algorithm's co-inventor Eugene Myers, he's kind of busy these days
<https://en.wikipedia.org/wiki/Eugene_Myers>.
This bug report was last modified 6 years and 251 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.