GNU bug report logs - #69748
Does diff not work on big enough files?

Previous Next

Package: diffutils;

Reported by: Robert Boyer <robertstephenboyer <at> gmail.com>

Date: Tue, 12 Mar 2024 15:20:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69748 in the body.
You can then email your comments to 69748 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-diffutils <at> gnu.org:
bug#69748; Package diffutils. (Tue, 12 Mar 2024 15:20:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Robert Boyer <robertstephenboyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-diffutils <at> gnu.org. (Tue, 12 Mar 2024 15:20:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Robert Boyer <robertstephenboyer <at> gmail.com>
To: bug-diffutils <at> gnu.org
Cc: rms <at> gnu.org
Subject: Does diff not work on big enough files?
Date: Tue, 12 Mar 2024 10:17:35 -0500
[Message part 1 (text/plain, inline)]
I am not sure whether to call this a bug, but it is a difficulty for me.
It is simply incredible to me that diff might not work!
If one cannot count on diff to work, is there anything one can count on? Does
diff just not work on big enough files? Apparently yes.

> diff the-primes-below-10000000000.lisp billion-primes.txt
diff: the-primes-below-10000000000.lisp: Cannot allocate memory
> ls -l the-primes-below-10000000000.lisp billion-primes.txt
-rw-r----- 1 bob chronos-access  501959790 Mar 10 14:08 billion-primes.txt
-rw-r----- 1 bob chronos-access 5403267048 Mar 12 09:55
the-primes-below-10000000000.lisp
>

> free
               total        used        free      shared  buff/cache
available
Mem:         6736088     1458180     5060628       16568      217280
5277908
Swap:              0           0           0
>

I am running on a $300 Lenovo Chromebook using their default Gnu Linux.

Bob

> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 156
model name : Intel(R) Celeron(R) N4500 @ 1.10GHz
stepping : 0
microcode : 0x1
cpu MHz : 1113.600
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave rdrand hypervisor lahf_lm 3dnowprefetch
cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
ept vpid ept_ad fsgsbase tsc_adjust smep erms rdseed smap clflushopt clwb
sha_ni xsaveopt xsavec xgetbv1 xsaves arat umip gfni rdpid movdiri
movdir64b md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad
ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid
unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips : 2227.20
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 156
model name : Intel(R) Celeron(R) N4500 @ 1.10GHz
stepping : 0
microcode : 0x1
cpu MHz : 1113.600
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave rdrand hypervisor lahf_lm 3dnowprefetch
cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
ept vpid ept_ad fsgsbase tsc_adjust smep erms rdseed smap clflushopt clwb
sha_ni xsaveopt xsavec xgetbv1 xsaves arat umip gfni rdpid movdiri
movdir64b md_clear arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad
ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid
unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips : 2227.20
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

>
[Message part 2 (text/html, inline)]

Information forwarded to bug-diffutils <at> gnu.org:
bug#69748; Package diffutils. (Tue, 12 Mar 2024 20:00:03 GMT) Full text and rfc822 format available.

Message #8 received at 69748 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Robert Boyer <robertstephenboyer <at> gmail.com>, 69748 <at> debbugs.gnu.org
Cc: rms <at> gnu.org
Subject: Re: [bug-diffutils] bug#69748: Does diff not work on big enough files?
Date: Tue, 12 Mar 2024 12:58:39 -0700
On 3/12/24 08:17, Robert Boyer wrote:

> It is simply incredible to me that diff might not work!

Like any other program, 'diff' needs enough resources to run. You're 
trying to compare a 5 GiB file on a Chromebook that has (let me guess) 4 
GiB of RAM and 32 GB of flash, most of which is occupied by ChromeOS and 
other stuff. If so, there isn't enough room for 'diff' to do its job 
with its current algorithm and you'll have to either use a bigger 
machine or solve a smaller problem.

It's possible to imagine a different 'diff' algorithm that would take 
less RAM but a lot more time, presumably because it would do more I/O to 
a temporary file. But if the available flash is small enough, even that 
wouldn't work. I doubt whether it'd be worth the time to develop the 
code for this alternative approach.




Added tag(s) notabug. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 12 Mar 2024 20:21:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 69748 <at> debbugs.gnu.org and Robert Boyer <robertstephenboyer <at> gmail.com> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 12 Mar 2024 20:21:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-diffutils <at> gnu.org:
bug#69748; Package diffutils. (Tue, 12 Mar 2024 20:25:02 GMT) Full text and rfc822 format available.

Message #15 received at 69748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Boyer <robertstephenboyer <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 69748 <at> debbugs.gnu.org, rms <at> gnu.org
Subject: Re: [bug-diffutils] bug#69748: Does diff not work on big enough files?
Date: Tue, 12 Mar 2024 15:22:12 -0500
[Message part 1 (text/plain, inline)]
Are you trying to be funny? Or are you simply stupid?  You are much too
brilliant and famous
to be stupid, so I am assuming you were trying to be funny, a parody of the
overworked bug fixer.

In an almost immediate follow up message, I already solved the problem, and
it
worked perfectly for me trying to compare an old file of the primes below a
billion with a new
file of the primes below ten billion.  Fortunately, this little gem of a
program helped me
believe that I had computed at least the primes below a billion correctly.
What a relief!

> there isn't enough room for 'diff' to do its job with its current
algorithm

Probably very sadly true, so you must improve your algorithm, and here is
how. It won't hurt, I promise.

From my previous message:

Here is a better version of diff, better only in the sense that it works on
all files.  But what do I know?  Nothing.

This is Common Lisp.  I was running in SBCL.

(defun my-diff (file1 file2)
  (let ((s1 (open file1 :element-type '(integer 0 255)))
        (s2 (open file2 :element-type '(integer 0 255)))
        (c1 0)
        (c2 0))
    (declare (fixnum c1 c2))
    (loop
     (setq c1 (read-byte s1 nil 256))
     (setq c2 (read-byte s2 nil 256))
     (cond ((and (eql c1 256) (eql c2 256)) (return "no difference")))
     (cond ((eql c1 256) (return "file1 hit eof first")))
     (cond ((eql c2 256) (return "file2 hit eof first")))
     (cond ((eql c1 c2))
           (t (return (format nil
                              "difference at position ~s; c1 = ~s, c2 = ~s."
                              (file-position s1) c1 c2)))))))

On Tue, Mar 12, 2024 at 2:58 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 3/12/24 08:17, Robert Boyer wrote:
>
> > It is simply incredible to me that diff might not work!
>
> Like any other program, 'diff' needs enough resources to run. You're
> trying to compare a 5 GiB file on a Chromebook that has (let me guess) 4
> GiB of RAM and 32 GB of flash, most of which is occupied by ChromeOS and
> other stuff. If so, there isn't enough room for 'diff' to do its job
> with its current algorithm and you'll have to either use a bigger
> machine or solve a smaller problem.
>
> It's possible to imagine a different 'diff' algorithm that would take
> less RAM but a lot more time, presumably because it would do more I/O to
> a temporary file. But if the available flash is small enough, even that
> wouldn't work. I doubt whether it'd be worth the time to develop the
> code for this alternative approach.
>
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 10 Apr 2024 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 149 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.