GNU bug report logs - #17492
diff handles moved text poorly; should not need --minimal

Previous Next

Package: diffutils;

Reported by: Scott McPeak <smcpeak <at> coverity.com>

Date: Wed, 14 May 2014 15:32:03 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17492 in the body.
You can then email your comments to 17492 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-diffutils <at> gnu.org:
bug#17492; Package diffutils. (Wed, 14 May 2014 15:32:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Scott McPeak <smcpeak <at> coverity.com>:
New bug report received and forwarded. Copy sent to bug-diffutils <at> gnu.org. (Wed, 14 May 2014 15:32:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Scott McPeak <smcpeak <at> coverity.com>
To: <bug-diffutils <at> gnu.org>
Subject: diff handles moved text poorly; should not need --minimal
Date: Wed, 14 May 2014 02:08:02 -0700
GNU diff appears to handle moved text poorly using its default diff 
algorithm.  At the end, I show a script that will reproduce the problem. 
 The essence is you have a large file that contains many blank lines, 
then move a large-ish block of text (~10% of file) from near the top to 
near the bottom.  The default diff algorithm treats all the blank lines 
as unchanged, thereby treating virtually all of the non-blank lines in 
the file as changed.  This makes the diff useless for comprehending the 
change and basically destroys the history shown by a command like "git 
annotate".

Now, after spending considerable time investigating this flaw, I have 
discovered the --minimal flag, and see that it will get the right 
answer, and that git accepts this flag in many (all?) cases where it 
matters.  But it is burdensome to add this flag to every command that 
uses diff (as I would not know in advance that any particular change 
will be affected by this bug), and few people know about the flag, so 
even if I use it, I still can't realistically move text in my files and 
expect others to be able to understand the resulting diff.  Furthermore, 
the manual warns that --minimal can be very slow; "git annotate" is 
already quite slow (10s+) on realistically sized repos, so compounding 
that with --minimal on every invocation is not appealing.

I think diff should, by default, handle the common case of moved text 
better.

Naively, it does not seem like detecting moved text should be very 
difficult or expensive.  Since the presence of the blank lines is a 
required element, a simple hack of avoiding anchoring on blank lines 
might go a long way here.  Better would be to avoid anchoring on any 
line content that is common.

I tested with diff 2.8.1 and diff 3.3 (latest as of writing).

-Scott

#!/bin/sh
# diff-bug-repro.sh: demonstrate diff bug with moving text

# Print lines with contents [$1,$2], separated by blank lines.
#
# The blank lines are important to this bug, because they seem
# to be treated by diff as anchor points: lines that haven't
# changed and hence the diff should try to work around them.
# Without the blank lines, diff works correctly.
genlines() {
  n=$1;
  while [ $n -le $2 ]; do
    echo $n;
    echo;
    n=`expr $n + 1`;
  done
}

# Simulates some original file that happens to have a lot of blank
# lines, as one might expect in source code, prose, HTML, etc.
echo "creating orig.txt: file with [1,5000], separated by blank lines..."
genlines 1 5000 > orig.txt

# Simulates a modification of orig.txt where a large block of text,
# represented by [11,499], is moved from near the beginning of the
# file to near the end of the file.
echo "creating new.txt: file with [1,10][500,4990][11,499][4991,5000], 
separated by blank lines..."
genlines 1 10 > new.txt
genlines 500 4990 >> new.txt
genlines 11 499 >> new.txt
genlines 4991 5000 >> new.txt

# The diff output contains 14948 lines, which is the bug.  It should
# only need about 2000 lines: 1000 to describe deleting the block from
# the start and 1000 lines to describe adding it at the end.
echo "diff -u orig.txt new.txt | wc -l"
diff -u orig.txt new.txt | wc -l

# EOF




Information forwarded to bug-diffutils <at> gnu.org:
bug#17492; Package diffutils. (Thu, 15 May 2014 05:13:02 GMT) Full text and rfc822 format available.

Message #8 received at 17492 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Scott McPeak <smcpeak <at> coverity.com>, 17492 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#17492: diff handles moved text poorly; should
 not need --minimal
Date: Wed, 14 May 2014 22:12:21 -0700
Thanks.  This sounds like Bug#16848, which was fixed in February after 
the latest diffutils release.  Can you build a copy of diffutils from 
the git repository and try it out?  Or perhaps you can cherry-pick the 
patches:

http://bugs.gnu.org/16848




Information forwarded to bug-diffutils <at> gnu.org:
bug#17492; Package diffutils. (Thu, 15 May 2014 18:30:02 GMT) Full text and rfc822 format available.

Message #11 received at 17492 <at> debbugs.gnu.org (full text, mbox):

From: Scott McPeak <smcpeak <at> coverity.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, <17492 <at> debbugs.gnu.org>
Subject: Re: [bug-diffutils] bug#17492: diff handles moved text poorly; should
 not need --minimal
Date: Thu, 15 May 2014 11:29:29 -0700
On 05/14/2014 10:12 PM, Paul Eggert wrote:
> Thanks.  This sounds like Bug#16848, which was fixed in February after
> the latest diffutils release.

It looks different to me.  In the case I reported, diff never shows 
inserting and removing the same line next to itself.

> Can you build a copy of diffutils from
> the git repository and try it out?

After about 20 minutes of chasing build dependencies, I timed out with a 
compile error about set_binary_mode.  To reproduce what I reported, one 
only has to run the short shell script I included, so perhaps someone 
with a working dev build could try that?

-Scott




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 16 May 2014 02:24:02 GMT) Full text and rfc822 format available.

Notification sent to Scott McPeak <smcpeak <at> coverity.com>:
bug acknowledged by developer. (Fri, 16 May 2014 02:24:02 GMT) Full text and rfc822 format available.

Message #16 received at 17492-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Scott McPeak <smcpeak <at> coverity.com>, 17492-done <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#17492: diff handles moved text poorly; should
 not need --minimal
Date: Thu, 15 May 2014 19:23:26 -0700
Scott McPeak wrote:
> To reproduce what I reported, one only has to run the short shell script
> I included, so perhaps someone with a working dev build could try that?

I tried it and 'diff' worked for me:

$ diff -u orig.txt new.txt | wc -l
1972

So I'll mark the bug as fixed.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 13 Jun 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 65 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.