From unknown Sat Aug 16 19:20:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#17492: diff handles moved text poorly; should not need --minimal Resent-From: Scott McPeak Original-Sender: "Debbugs-submit" Resent-CC: bug-diffutils@gnu.org Resent-Date: Wed, 14 May 2014 15:32:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 17492 X-GNU-PR-Package: diffutils X-GNU-PR-Keywords: To: 17492@debbugs.gnu.org X-Debbugs-Original-To: Received: via spool by submit@debbugs.gnu.org id=B.140008151314918 (code B ref -1); Wed, 14 May 2014 15:32:03 +0000 Received: (at submit) by debbugs.gnu.org; 14 May 2014 15:31:53 +0000 Received: from localhost ([127.0.0.1]:35321 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wkb9r-0003sX-JT for submit@debbugs.gnu.org; Wed, 14 May 2014 11:31:53 -0400 Received: from eggs.gnu.org ([208.118.235.92]:58792) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WkVfd-0001Ht-Gp for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WkVfO-0004D7-C6 for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:12 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: *** X-Spam-Status: No, score=3.3 required=5.0 tests=BAYES_50, RECEIVED_FROM_WINDOWS_HOST autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:43544) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVfO-0004D3-9z for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:02 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60407) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVfG-000551-PT for bug-diffutils@gnu.org; Wed, 14 May 2014 05:40:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WkVf9-00045d-6j for bug-diffutils@gnu.org; Wed, 14 May 2014 05:39:54 -0400 Received: from mail-by2lp0244.outbound.protection.outlook.com ([207.46.163.244]:57938 helo=na01-by2-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVf8-00045C-UT for bug-diffutils@gnu.org; Wed, 14 May 2014 05:39:47 -0400 Received: from BY2PRD0310HT004.namprd03.prod.outlook.com (157.56.236.5) by DM2PR0501MB809.namprd05.prod.outlook.com (10.242.115.139) with Microsoft SMTP Server (TLS) id 15.0.939.12; Wed, 14 May 2014 09:08:04 +0000 Message-ID: <53733272.4000904@coverity.com> Date: Wed, 14 May 2014 02:08:02 -0700 From: Scott McPeak User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [157.56.236.5] X-ClientProxiedBy: BLUPR05CA003.namprd05.prod.outlook.com (10.255.219.161) To DM2PR0501MB809.namprd05.prod.outlook.com (10.242.115.139) X-Forefront-PRVS: 0211965D06 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009001)(6009001)(428001)(189002)(199002)(77982001)(46102001)(101416001)(76482001)(87976001)(87266999)(64706001)(54356999)(83506001)(42186004)(66066001)(21056001)(4396001)(59896001)(19580395003)(102836001)(83072002)(74502001)(85852003)(81342001)(99396002)(83322001)(50466002)(31966008)(79102001)(80022001)(23756003)(65956001)(65806001)(47776003)(74662001)(86362001)(50986999)(64126003)(92726001)(20776003)(36756003)(92566001)(65816999)(62816006); DIR:OUT; SFP:1101; SCL:1; SRVR:DM2PR0501MB809; H:BY2PRD0310HT004.namprd03.prod.outlook.com; FPR:; MLV:sfv; PTR:InfoNoRecords; MX:1; A:1; LANG:en; Received-SPF: None (: coverity.com does not designate permitted sender hosts) Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=smcpeak@coverity.com; X-OriginatorOrg: coverity.com X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Mailman-Approved-At: Wed, 14 May 2014 11:31:50 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) GNU diff appears to handle moved text poorly using its default diff algorithm. At the end, I show a script that will reproduce the problem. The essence is you have a large file that contains many blank lines, then move a large-ish block of text (~10% of file) from near the top to near the bottom. The default diff algorithm treats all the blank lines as unchanged, thereby treating virtually all of the non-blank lines in the file as changed. This makes the diff useless for comprehending the change and basically destroys the history shown by a command like "git annotate". Now, after spending considerable time investigating this flaw, I have discovered the --minimal flag, and see that it will get the right answer, and that git accepts this flag in many (all?) cases where it matters. But it is burdensome to add this flag to every command that uses diff (as I would not know in advance that any particular change will be affected by this bug), and few people know about the flag, so even if I use it, I still can't realistically move text in my files and expect others to be able to understand the resulting diff. Furthermore, the manual warns that --minimal can be very slow; "git annotate" is already quite slow (10s+) on realistically sized repos, so compounding that with --minimal on every invocation is not appealing. I think diff should, by default, handle the common case of moved text better. Naively, it does not seem like detecting moved text should be very difficult or expensive. Since the presence of the blank lines is a required element, a simple hack of avoiding anchoring on blank lines might go a long way here. Better would be to avoid anchoring on any line content that is common. I tested with diff 2.8.1 and diff 3.3 (latest as of writing). -Scott #!/bin/sh # diff-bug-repro.sh: demonstrate diff bug with moving text # Print lines with contents [$1,$2], separated by blank lines. # # The blank lines are important to this bug, because they seem # to be treated by diff as anchor points: lines that haven't # changed and hence the diff should try to work around them. # Without the blank lines, diff works correctly. genlines() { n=$1; while [ $n -le $2 ]; do echo $n; echo; n=`expr $n + 1`; done } # Simulates some original file that happens to have a lot of blank # lines, as one might expect in source code, prose, HTML, etc. echo "creating orig.txt: file with [1,5000], separated by blank lines..." genlines 1 5000 > orig.txt # Simulates a modification of orig.txt where a large block of text, # represented by [11,499], is moved from near the beginning of the # file to near the end of the file. echo "creating new.txt: file with [1,10][500,4990][11,499][4991,5000], separated by blank lines..." genlines 1 10 > new.txt genlines 500 4990 >> new.txt genlines 11 499 >> new.txt genlines 4991 5000 >> new.txt # The diff output contains 14948 lines, which is the bug. It should # only need about 2000 lines: 1000 to describe deleting the block from # the start and 1000 lines to describe adding it at the end. echo "diff -u orig.txt new.txt | wc -l" diff -u orig.txt new.txt | wc -l # EOF From unknown Sat Aug 16 19:20:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#17492: [bug-diffutils] bug#17492: diff handles moved text poorly; should not need --minimal Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-diffutils@gnu.org Resent-Date: Thu, 15 May 2014 05:13:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 17492 X-GNU-PR-Package: diffutils X-GNU-PR-Keywords: To: Scott McPeak , 17492@debbugs.gnu.org Received: via spool by 17492-submit@debbugs.gnu.org id=B17492.140013075516536 (code B ref 17492); Thu, 15 May 2014 05:13:02 +0000 Received: (at 17492) by debbugs.gnu.org; 15 May 2014 05:12:35 +0000 Received: from localhost ([127.0.0.1]:35114 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wkny6-0004Ic-NY for submit@debbugs.gnu.org; Thu, 15 May 2014 01:12:35 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:49871) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wkny4-0004IK-Am for 17492@debbugs.gnu.org; Thu, 15 May 2014 01:12:32 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 711B7A6004E; Wed, 14 May 2014 22:12:26 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NY2U0RRN8+VT; Wed, 14 May 2014 22:12:21 -0700 (PDT) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id B5AD2A60049; Wed, 14 May 2014 22:12:21 -0700 (PDT) Message-ID: <53744CB5.4030208@cs.ucla.edu> Date: Wed, 14 May 2014 22:12:21 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 References: <53733272.4000904@coverity.com> In-Reply-To: <53733272.4000904@coverity.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) Thanks. This sounds like Bug#16848, which was fixed in February after the latest diffutils release. Can you build a copy of diffutils from the git repository and try it out? Or perhaps you can cherry-pick the patches: http://bugs.gnu.org/16848 From unknown Sat Aug 16 19:20:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#17492: [bug-diffutils] bug#17492: diff handles moved text poorly; should not need --minimal Resent-From: Scott McPeak Original-Sender: "Debbugs-submit" Resent-CC: bug-diffutils@gnu.org Resent-Date: Thu, 15 May 2014 18:30:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 17492 X-GNU-PR-Package: diffutils X-GNU-PR-Keywords: To: Paul Eggert , <17492@debbugs.gnu.org> Received: via spool by 17492-submit@debbugs.gnu.org id=B17492.140017858415702 (code B ref 17492); Thu, 15 May 2014 18:30:02 +0000 Received: (at 17492) by debbugs.gnu.org; 15 May 2014 18:29:44 +0000 Received: from localhost ([127.0.0.1]:36335 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wl0PX-00045B-ID for submit@debbugs.gnu.org; Thu, 15 May 2014 14:29:43 -0400 Received: from mail-bn1blp0185.outbound.protection.outlook.com ([207.46.163.185]:3988 helo=na01-bn1-obe.outbound.protection.outlook.com) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wl0PV-00044u-9N for 17492@debbugs.gnu.org; Thu, 15 May 2014 14:29:42 -0400 Received: from CH1PRD0310HT005.namprd03.prod.outlook.com (157.56.244.37) by DM2PR0501MB810.namprd05.prod.outlook.com (10.242.115.140) with Microsoft SMTP Server (TLS) id 15.0.944.11; Thu, 15 May 2014 18:29:33 +0000 Message-ID: <53750789.2090300@coverity.com> Date: Thu, 15 May 2014 11:29:29 -0700 From: Scott McPeak User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 References: <53733272.4000904@coverity.com> <53744CB5.4030208@cs.ucla.edu> In-Reply-To: <53744CB5.4030208@cs.ucla.edu> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [157.56.244.37] X-ClientProxiedBy: CO2PR05CA010.namprd05.prod.outlook.com (10.141.194.158) To DM2PR0501MB810.namprd05.prod.outlook.com (10.242.115.140) X-Forefront-PRVS: 0212BDE3BE X-Forefront-Antispam-Report: SFV:NSPM; SFS:(6009001)(428001)(189002)(199002)(377454003)(479174003)(24454002)(42186004)(74662001)(31966008)(74502001)(46102001)(21056001)(83322001)(64126003)(50466002)(79102001)(66066001)(81542001)(81342001)(4396001)(36756003)(65956001)(64706001)(80022001)(47776003)(65806001)(20776003)(65816999)(87976001)(2171001)(92566001)(83072002)(85852003)(92726001)(86362001)(59896001)(77982001)(76482001)(102836001)(99396002)(99136001)(83506001)(101416001)(76176999)(87266999)(23676002)(50986999)(54356999)(62816006); DIR:OUT; SFP:; SCL:1; SRVR:DM2PR0501MB810; H:CH1PRD0310HT005.namprd03.prod.outlook.com; FPR:; MLV:sfv; PTR:InfoNoRecords; MX:1; A:1; LANG:en; Received-SPF: None (: coverity.com does not designate permitted sender hosts) Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=smcpeak@coverity.com; X-OriginatorOrg: coverity.com X-Spam-Score: 1.2 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On 05/14/2014 10:12 PM, Paul Eggert wrote: > Thanks. This sounds like Bug#16848, which was fixed in February after > the latest diffutils release. It looks different to me. In the case I reported, diff never shows inserting and removing the same line next to itself. [...] Content analysis details: (1.2 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [207.46.163.185 listed in list.dnswl.org] 1.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 SPF_PASS SPF: sender matches SPF record X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.2 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On 05/14/2014 10:12 PM, Paul Eggert wrote: > Thanks. This sounds like Bug#16848, which was fixed in February after > the latest diffutils release. It looks different to me. In the case I reported, diff never shows inserting and removing the same line next to itself. [...] Content analysis details: (1.2 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [207.46.163.185 listed in list.dnswl.org] 1.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 SPF_PASS SPF: sender matches SPF record On 05/14/2014 10:12 PM, Paul Eggert wrote: > Thanks. This sounds like Bug#16848, which was fixed in February after > the latest diffutils release. It looks different to me. In the case I reported, diff never shows inserting and removing the same line next to itself. > Can you build a copy of diffutils from > the git repository and try it out? After about 20 minutes of chasing build dependencies, I timed out with a compile error about set_binary_mode. To reproduce what I reported, one only has to run the short shell script I included, so perhaps someone with a working dev build could try that? -Scott From unknown Sat Aug 16 19:20:00 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.503 (Entity 5.503) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Scott McPeak Subject: bug#17492: closed (Re: [bug-diffutils] bug#17492: diff handles moved text poorly; should not need --minimal) Message-ID: References: <5375769E.6080108@cs.ucla.edu> <53733272.4000904@coverity.com> X-Gnu-PR-Message: they-closed 17492 X-Gnu-PR-Package: diffutils Reply-To: 17492@debbugs.gnu.org Date: Fri, 16 May 2014 02:24:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1400207043-10655-1" This is a multi-part message in MIME format... ------------=_1400207043-10655-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #17492: diff handles moved text poorly; should not need --minimal which was filed against the diffutils package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 17492@debbugs.gnu.org. --=20 17492: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D17492 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1400207043-10655-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 17492-done) by debbugs.gnu.org; 16 May 2014 02:23:43 +0000 Received: from localhost ([127.0.0.1]:36538 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wl7oE-0002lC-Ra for submit@debbugs.gnu.org; Thu, 15 May 2014 22:23:43 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:52003) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wl7oC-0002ks-Nb for 17492-done@debbugs.gnu.org; Thu, 15 May 2014 22:23:41 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 5562EA6007F; Thu, 15 May 2014 19:23:35 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CfKRMt0AIIeB; Thu, 15 May 2014 19:23:26 -0700 (PDT) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id BF414A60073; Thu, 15 May 2014 19:23:26 -0700 (PDT) Message-ID: <5375769E.6080108@cs.ucla.edu> Date: Thu, 15 May 2014 19:23:26 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Scott McPeak , 17492-done@debbugs.gnu.org Subject: Re: [bug-diffutils] bug#17492: diff handles moved text poorly; should not need --minimal References: <53733272.4000904@coverity.com> <53744CB5.4030208@cs.ucla.edu> <53750789.2090300@coverity.com> In-Reply-To: <53750789.2090300@coverity.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-Debbugs-Envelope-To: 17492-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) Scott McPeak wrote: > To reproduce what I reported, one only has to run the short shell script > I included, so perhaps someone with a working dev build could try that? I tried it and 'diff' worked for me: $ diff -u orig.txt new.txt | wc -l 1972 So I'll mark the bug as fixed. ------------=_1400207043-10655-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 14 May 2014 15:31:53 +0000 Received: from localhost ([127.0.0.1]:35321 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Wkb9r-0003sX-JT for submit@debbugs.gnu.org; Wed, 14 May 2014 11:31:53 -0400 Received: from eggs.gnu.org ([208.118.235.92]:58792) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WkVfd-0001Ht-Gp for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WkVfO-0004D7-C6 for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:12 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: *** X-Spam-Status: No, score=3.3 required=5.0 tests=BAYES_50, RECEIVED_FROM_WINDOWS_HOST autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:43544) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVfO-0004D3-9z for submit@debbugs.gnu.org; Wed, 14 May 2014 05:40:02 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60407) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVfG-000551-PT for bug-diffutils@gnu.org; Wed, 14 May 2014 05:40:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WkVf9-00045d-6j for bug-diffutils@gnu.org; Wed, 14 May 2014 05:39:54 -0400 Received: from mail-by2lp0244.outbound.protection.outlook.com ([207.46.163.244]:57938 helo=na01-by2-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WkVf8-00045C-UT for bug-diffutils@gnu.org; Wed, 14 May 2014 05:39:47 -0400 Received: from BY2PRD0310HT004.namprd03.prod.outlook.com (157.56.236.5) by DM2PR0501MB809.namprd05.prod.outlook.com (10.242.115.139) with Microsoft SMTP Server (TLS) id 15.0.939.12; Wed, 14 May 2014 09:08:04 +0000 Message-ID: <53733272.4000904@coverity.com> Date: Wed, 14 May 2014 02:08:02 -0700 From: Scott McPeak User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Subject: diff handles moved text poorly; should not need --minimal Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [157.56.236.5] X-ClientProxiedBy: BLUPR05CA003.namprd05.prod.outlook.com (10.255.219.161) To DM2PR0501MB809.namprd05.prod.outlook.com (10.242.115.139) X-Forefront-PRVS: 0211965D06 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009001)(6009001)(428001)(189002)(199002)(77982001)(46102001)(101416001)(76482001)(87976001)(87266999)(64706001)(54356999)(83506001)(42186004)(66066001)(21056001)(4396001)(59896001)(19580395003)(102836001)(83072002)(74502001)(85852003)(81342001)(99396002)(83322001)(50466002)(31966008)(79102001)(80022001)(23756003)(65956001)(65806001)(47776003)(74662001)(86362001)(50986999)(64126003)(92726001)(20776003)(36756003)(92566001)(65816999)(62816006); DIR:OUT; SFP:1101; SCL:1; SRVR:DM2PR0501MB809; H:BY2PRD0310HT004.namprd03.prod.outlook.com; FPR:; MLV:sfv; PTR:InfoNoRecords; MX:1; A:1; LANG:en; Received-SPF: None (: coverity.com does not designate permitted sender hosts) Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=smcpeak@coverity.com; X-OriginatorOrg: coverity.com X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 14 May 2014 11:31:50 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) GNU diff appears to handle moved text poorly using its default diff algorithm. At the end, I show a script that will reproduce the problem. The essence is you have a large file that contains many blank lines, then move a large-ish block of text (~10% of file) from near the top to near the bottom. The default diff algorithm treats all the blank lines as unchanged, thereby treating virtually all of the non-blank lines in the file as changed. This makes the diff useless for comprehending the change and basically destroys the history shown by a command like "git annotate". Now, after spending considerable time investigating this flaw, I have discovered the --minimal flag, and see that it will get the right answer, and that git accepts this flag in many (all?) cases where it matters. But it is burdensome to add this flag to every command that uses diff (as I would not know in advance that any particular change will be affected by this bug), and few people know about the flag, so even if I use it, I still can't realistically move text in my files and expect others to be able to understand the resulting diff. Furthermore, the manual warns that --minimal can be very slow; "git annotate" is already quite slow (10s+) on realistically sized repos, so compounding that with --minimal on every invocation is not appealing. I think diff should, by default, handle the common case of moved text better. Naively, it does not seem like detecting moved text should be very difficult or expensive. Since the presence of the blank lines is a required element, a simple hack of avoiding anchoring on blank lines might go a long way here. Better would be to avoid anchoring on any line content that is common. I tested with diff 2.8.1 and diff 3.3 (latest as of writing). -Scott #!/bin/sh # diff-bug-repro.sh: demonstrate diff bug with moving text # Print lines with contents [$1,$2], separated by blank lines. # # The blank lines are important to this bug, because they seem # to be treated by diff as anchor points: lines that haven't # changed and hence the diff should try to work around them. # Without the blank lines, diff works correctly. genlines() { n=$1; while [ $n -le $2 ]; do echo $n; echo; n=`expr $n + 1`; done } # Simulates some original file that happens to have a lot of blank # lines, as one might expect in source code, prose, HTML, etc. echo "creating orig.txt: file with [1,5000], separated by blank lines..." genlines 1 5000 > orig.txt # Simulates a modification of orig.txt where a large block of text, # represented by [11,499], is moved from near the beginning of the # file to near the end of the file. echo "creating new.txt: file with [1,10][500,4990][11,499][4991,5000], separated by blank lines..." genlines 1 10 > new.txt genlines 500 4990 >> new.txt genlines 11 499 >> new.txt genlines 4991 5000 >> new.txt # The diff output contains 14948 lines, which is the bug. It should # only need about 2000 lines: 1000 to describe deleting the block from # the start and 1000 lines to describe adding it at the end. echo "diff -u orig.txt new.txt | wc -l" diff -u orig.txt new.txt | wc -l # EOF ------------=_1400207043-10655-1--