Package: coreutils;
Reported by: TJ Luoma <luomat <at> gmail.com>
Date: Sun, 15 Dec 2019 08:42:02 UTC
Severity: normal
Tags: notabug
Done: Bernhard Voelker <mail <at> bernhard-voelker.de>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38621 in the body.
You can then email your comments to 38621 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Sun, 15 Dec 2019 08:42:02 GMT) Full text and rfc822 format available.TJ Luoma <luomat <at> gmail.com>
:bug-coreutils <at> gnu.org
.
(Sun, 15 Dec 2019 08:42:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: TJ Luoma <luomat <at> gmail.com> To: bug-coreutils <at> gnu.org Subject: gdu showing different sizes Date: Sun, 15 Dec 2019 00:15:05 -0500
I ended up with two version of the same file 'StreamDeck-4.4.2.12189.pkg' and 'Stream_Deck_4.4.2.12189.pkg' and wanted to check to see if they were the same file. I checked the size with `gdu` like so: % /usr/local/bin/gdu --si -s *pkg 101M StreamDeck-4.4.2.12189.pkg 102M Stream_Deck_4.4.2.12189.pkg Which led me to think they were different files / sizes. But when I used `ls -l` I was surprised to see this: % command ls -l *pkg -rw-r--r-- 1 tjluoma staff 88885047 Dec 15 00:00 StreamDeck-4.4.2.12189.pkg -rw-r--r--@ 1 tjluoma staff 88885047 Dec 15 00:02 Stream_Deck_4.4.2.12189.pkg So they _are_ the same size. Are they the same file? I used `md5` to check % command md5 -r *pkg 98ac563a36386ca3aa87f62893302b4f StreamDeck-4.4.2.12189.pkg 98ac563a36386ca3aa87f62893302b4f Stream_Deck_4.4.2.12189.pkg OK, so these are exactly the same file. So… why did `gdu` tell me they are different sizes? % gdu --version du (GNU coreutils) 8.31 Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Torbjorn Granlund, David MacKenzie, Paul Eggert, and Jim Meyering. I'm using Mac OS X 10.14.6 (18G2022) with `coreutils` installed via `brew`. Any help would be appreciated. Tj
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Sun, 15 Dec 2019 21:20:02 GMT) Full text and rfc822 format available.Message #8 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: Bernhard Voelker <mail <at> bernhard-voelker.de> To: TJ Luoma <luomat <at> gmail.com>, 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Sun, 15 Dec 2019 22:19:29 +0100
tag 38621 notabug close 38621 stop On 2019-12-15 06:15, TJ Luoma wrote: > I ended up with two version of the same file > 'StreamDeck-4.4.2.12189.pkg' and 'Stream_Deck_4.4.2.12189.pkg' and > wanted to check to see if they were the same file. > > I checked the size with `gdu` like so: > > % /usr/local/bin/gdu --si -s *pkg > 101M StreamDeck-4.4.2.12189.pkg > 102M Stream_Deck_4.4.2.12189.pkg > > Which led me to think they were different files / sizes. But when I > used `ls -l` I was surprised to see this: > > % command ls -l *pkg > -rw-r--r-- 1 tjluoma staff 88885047 Dec 15 00:00 StreamDeck-4.4.2.12189.pkg > -rw-r--r--@ 1 tjluoma staff 88885047 Dec 15 00:02 Stream_Deck_4.4.2.12189.pkg > > So they _are_ the same size. Are they the same file? I used `md5` to check > > % command md5 -r *pkg > 98ac563a36386ca3aa87f62893302b4f StreamDeck-4.4.2.12189.pkg > 98ac563a36386ca3aa87f62893302b4f Stream_Deck_4.4.2.12189.pkg > > OK, so these are exactly the same file. So… why did `gdu` tell me they > are different sizes? > > % gdu --version > du (GNU coreutils) 8.31 > Copyright (C) 2019 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>. > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. > > Written by Torbjorn Granlund, David MacKenzie, Paul Eggert, > and Jim Meyering. > > I'm using Mac OS X 10.14.6 (18G2022) with `coreutils` installed via `brew`. > > Any help would be appreciated. This is a "sparse" file, i.e., a file with longer sequences of Zeroes somewhere in between which can be stored more efficient on the disk. Any application reading the data will get the correct number of Zeroes, while some disk space is saved. E.g. the following creates a 300M file, with the first 100M and the last 100M with random data, and the 100M between is a "hole": # Write the 1st 100M (as usual). $ dd bs=1M count=100 if=/dev/urandom of=f 100+ 0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.466356 s, 225 MB/s # Write another 100M, but starting at a position of 200M, # thus leaving Zeroes in between. $ dd bs=1M seek=200 count=100 if=/dev/urandom of=f 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.462072 s, 227 MB/s $ ls -logh f -rw-r--r-- 1 300M Dec 15 18:17 f $ du -h f # shows the space occupied on disk. 200M f $ du --apparent-size -h f # shows the size applications would read. 300M f See the documentation of 'cp' and 'du': https://www.gnu.org/software/coreutils/cp (the --sparse option) https://www.gnu.org/software/coreutils/du (the --apparent-size option) As this is not a bug in du(1), I'm marking this as such, and close the ticket in our bug tracker. The discussion can continue, of course. Have a nice day, Berny
Bernhard Voelker <mail <at> bernhard-voelker.de>
to control <at> debbugs.gnu.org
.
(Sun, 15 Dec 2019 21:20:02 GMT) Full text and rfc822 format available.Bernhard Voelker <mail <at> bernhard-voelker.de>
to control <at> debbugs.gnu.org
.
(Sun, 15 Dec 2019 21:20:02 GMT) Full text and rfc822 format available.bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Mon, 16 Dec 2019 06:26:02 GMT) Full text and rfc822 format available.Message #15 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: TJ Luoma <luomat <at> gmail.com> To: Bernhard Voelker <mail <at> bernhard-voelker.de> Cc: 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Mon, 16 Dec 2019 01:25:40 -0500
[Message part 1 (text/plain, inline)]
I sort of followed most of the technical part of that but I still don’t understand why it’s not a bug to show different information about two identical files. Which may indicate that I didn’t understand the technical part very well. As an end user, it’s hard to understand how that inconsistency isn’t both undesirable and a bug. I could maybe see if they were two files with the same byte-count but different composition that made the calculations off by 1, but this is an identical file and it’s showing up with two different sizes, in a tool meant to report sizes. That just seems “obviously” wrong even if it’s somehow technically explainable. TjL On Sun, Dec 15, 2019 at 4:19 PM Bernhard Voelker <mail <at> bernhard-voelker.de> wrote: > tag 38621 notabug > close 38621 > stop > > On 2019-12-15 06:15, TJ Luoma wrote: > > I ended up with two version of the same file > > 'StreamDeck-4.4.2.12189.pkg' and 'Stream_Deck_4.4.2.12189.pkg' and > > wanted to check to see if they were the same file. > > > > I checked the size with `gdu` like so: > > > > % /usr/local/bin/gdu --si -s *pkg > > 101M StreamDeck-4.4.2.12189.pkg > > 102M Stream_Deck_4.4.2.12189.pkg > > > > Which led me to think they were different files / sizes. But when I > > used `ls -l` I was surprised to see this: > > > > % command ls -l *pkg > > -rw-r--r-- 1 tjluoma staff 88885047 Dec 15 00:00 > StreamDeck-4.4.2.12189.pkg > > -rw-r--r--@ 1 tjluoma staff 88885047 Dec 15 00:02 > Stream_Deck_4.4.2.12189.pkg > > > > So they _are_ the same size. Are they the same file? I used `md5` to > check > > > > % command md5 -r *pkg > > 98ac563a36386ca3aa87f62893302b4f StreamDeck-4.4.2.12189.pkg > > 98ac563a36386ca3aa87f62893302b4f Stream_Deck_4.4.2.12189.pkg > > > > OK, so these are exactly the same file. So… why did `gdu` tell me they > > are different sizes? > > > > % gdu --version > > du (GNU coreutils) 8.31 > > Copyright (C) 2019 Free Software Foundation, Inc. > > License GPLv3+: GNU GPL version 3 or later < > https://gnu.org/licenses/gpl.html>. > > This is free software: you are free to change and redistribute it. > > There is NO WARRANTY, to the extent permitted by law. > > > > Written by Torbjorn Granlund, David MacKenzie, Paul Eggert, > > and Jim Meyering. > > > > I'm using Mac OS X 10.14.6 (18G2022) with `coreutils` installed via > `brew`. > > > > Any help would be appreciated. > > This is a "sparse" file, i.e., a file with longer sequences of Zeroes > somewhere in between which can be stored more efficient on the disk. > Any application reading the data will get the correct number of Zeroes, > while some disk space is saved. > > E.g. the following creates a 300M file, with the first 100M and the last > 100M > with random data, and the 100M between is a "hole": > > # Write the 1st 100M (as usual). > $ dd bs=1M count=100 if=/dev/urandom of=f > 100+ 0 records in > 100+0 records out > 104857600 bytes (105 MB, 100 MiB) copied, 0.466356 s, 225 MB/s > > # Write another 100M, but starting at a position of 200M, > # thus leaving Zeroes in between. > $ dd bs=1M seek=200 count=100 if=/dev/urandom of=f > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB, 100 MiB) copied, 0.462072 s, 227 MB/s > > $ ls -logh f > -rw-r--r-- 1 300M Dec 15 18:17 f > > $ du -h f # shows the space occupied on disk. > 200M f > > $ du --apparent-size -h f # shows the size applications would read. > 300M f > > See the documentation of 'cp' and 'du': > https://www.gnu.org/software/coreutils/cp (the --sparse option) > https://www.gnu.org/software/coreutils/du (the --apparent-size option) > > As this is not a bug in du(1), I'm marking this as such, and close the > ticket > in our bug tracker. The discussion can continue, of course. > > Have a nice day, > Berny >
[Message part 2 (text/html, inline)]
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Mon, 16 Dec 2019 07:48:02 GMT) Full text and rfc822 format available.Message #18 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: Bernhard Voelker <mail <at> bernhard-voelker.de> To: TJ Luoma <luomat <at> gmail.com> Cc: 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Mon, 16 Dec 2019 08:47:10 +0100
On 2019-12-16 07:25, TJ Luoma wrote: > I sort of followed most of the technical part of that but I still don’t > understand why it’s not a bug to show different information about two > identical files. > > Which may indicate that I didn’t understand the technical part very well. > > As an end user, it’s hard to understand how that inconsistency isn’t both > undesirable and a bug. > > I could maybe see if they were two files with the same byte-count but > different composition that made the calculations off by 1, but this is an > identical file and it’s showing up with two different sizes, in a tool > meant to report sizes. > > That just seems “obviously” wrong even if it’s somehow technically > explainable. Thanks for following up on this for further clarifications. I think the problem is the word "size": while 'ls' and 'du --apparent-size' show the length of the content of a file, 'du' (without --apparent-size') reports the space the file needs on disk. $ du --help | sed 3q Usage: du [OPTION]... [FILE]... or: du [OPTION]... --files0-from=F Summarize disk usage of the set of FILEs, recursively for directories. ____________^^^^^^^^^^ One reason for those sizes to differ are "holes". As an extreme case, one can create a 4 Terabyte file (just NULs) on a filesystem which is much smaller than that: # Filesystem size. $ df -h --out=size,target . Size Mounted on 591G /mnt # Create a NUL-only file of size 4 Terabyte. $ truncate -s4T f2 # 'ls' shows the 4T of file size. $ ls -logh f2 -rw-r--r-- 1 4.0T Dec 16 08:36 f2 # 'du' shows that the file does not even require any disk usage. $ du -h f2 0 f2 # ... but with '--apparent-size' reports the real (content) size. $ du -h --apparent-size f2 4.0T f2 # Any program will see the 4T content transparently. $ wc -c < f2 4398046511104 In your case, the file was a mixture of regular data and holes, and 'cp' (without --sparse=always) tried to automatically determine if the target file should have holes or not (see 'man cp'). Therefore, your 2 files had a different disk usage, but the net length of the content is identical, of course. Have a nice day, Berny
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Mon, 16 Dec 2019 19:45:01 GMT) Full text and rfc822 format available.Message #21 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: TJ Luoma <luomat <at> gmail.com> To: Bernhard Voelker <mail <at> bernhard-voelker.de> Cc: 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Mon, 16 Dec 2019 14:43:37 -0500
AHA! Ok, now I understand a little better. I have seen the difference between "size" and "size on disk" and did not realize that applied here. I'm still not 100% clear on _why_ two "identical" files would have different results for "size on disk" (it _seems_ like those should be identical) but I suspect that the answer is probably of a technical nature that would be "over my head" so to speak, and truthfully, all I really need to know is "sometimes that happens" rather than understanding the technical details of why. I appreciate you taking the time to educate me further about this. Cheers Tj On Mon, Dec 16, 2019 at 2:47 AM Bernhard Voelker <mail <at> bernhard-voelker.de> wrote: > > On 2019-12-16 07:25, TJ Luoma wrote: > > I sort of followed most of the technical part of that but I still don’t > > understand why it’s not a bug to show different information about two > > identical files. > > > > Which may indicate that I didn’t understand the technical part very well. > > > > As an end user, it’s hard to understand how that inconsistency isn’t both > > undesirable and a bug. > > > > I could maybe see if they were two files with the same byte-count but > > different composition that made the calculations off by 1, but this is an > > identical file and it’s showing up with two different sizes, in a tool > > meant to report sizes. > > > > That just seems “obviously” wrong even if it’s somehow technically > > explainable. > > Thanks for following up on this for further clarifications. > > I think the problem is the word "size": > while 'ls' and 'du --apparent-size' show the length of the content of > a file, 'du' (without --apparent-size') reports the space the file > needs on disk. > > $ du --help | sed 3q > Usage: du [OPTION]... [FILE]... > or: du [OPTION]... --files0-from=F > Summarize disk usage of the set of FILEs, recursively for directories. > ____________^^^^^^^^^^ > > One reason for those sizes to differ are "holes". As an extreme case, > one can create a 4 Terabyte file (just NULs) on a filesystem which is > much smaller than that: > > # Filesystem size. > $ df -h --out=size,target . > Size Mounted on > 591G /mnt > > # Create a NUL-only file of size 4 Terabyte. > $ truncate -s4T f2 > > # 'ls' shows the 4T of file size. > $ ls -logh f2 > -rw-r--r-- 1 4.0T Dec 16 08:36 f2 > > # 'du' shows that the file does not even require any disk usage. > $ du -h f2 > 0 f2 > > # ... but with '--apparent-size' reports the real (content) size. > $ du -h --apparent-size f2 > 4.0T f2 > > # Any program will see the 4T content transparently. > $ wc -c < f2 > 4398046511104 > > In your case, the file was a mixture of regular data and holes, > and 'cp' (without --sparse=always) tried to automatically determine > if the target file should have holes or not (see 'man cp'). > Therefore, your 2 files had a different disk usage, but the net length > of the content is identical, of course. > > Have a nice day, > Berny
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Mon, 16 Dec 2019 20:52:02 GMT) Full text and rfc822 format available.Message #24 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: Bob Proulx <bob <at> proulx.com> To: TJ Luoma <luomat <at> gmail.com> Cc: 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Mon, 16 Dec 2019 13:51:38 -0700
TJ Luoma wrote: > AHA! Ok, now I understand a little better. I have seen the difference > between "size" and "size on disk" and did not realize that applied > here. > > I'm still not 100% clear on _why_ two "identical" files would have > different results for "size on disk" (it _seems_ like those should be > identical) but I suspect that the answer is probably of a technical > nature that would be "over my head" so to speak, and truthfully, all I > really need to know is "sometimes that happens" rather than > understanding the technical details of why. I think at the start is where the confusion began. Because the commands are named to show that they were intended to show different things. 'du' is named for showing disk usage 'ls' is named for listing files And those are rather different things! Let's dig into the details. The long format for information says: ‘-l’ ‘--format=long’ ‘--format=verbose’ In addition to the name of each file, print the file type, file mode bits, number of hard links, owner name, group name, size, and timestamp (*note Formatting file timestamps::), normally the modification timestamp (the mtime, *note File timestamps::). Print question marks for information that cannot be determined. So we know that ls lists the size of the file. But let me specifically say that this is tagged to the *file*. It's file centric. There is also the -s option. ‘-s’ ‘--size’ Print the disk allocation of each file to the left of the file name. This is the amount of disk space used by the file, which is usually a bit more than the file’s size, but it can be less if the file has holes. This displays how much disk space the file consumes instead of the size of the file. The two being different things. And then the 'du' documentation says: ‘du’ reports the amount of disk space used by the set of specified files And so du is the disk used by the file. But as we know the amount of disk used is dependent upon the file system holding the file. Different file systems will have different storage methods and the amount of disk space being consumed by a file will be different and somewhat unrelated to the size of the file. Disk space consumed to hold the file could be larger or smaller than the file size. In particular if the file is sparse then there are "holes" in the middle that are all zero data and do not need to be stored. Thereby saving the space. In which case it will be smaller. Or since files are stored in blocks the final block will have some fragment of space at the end that is past the end of the file but too small to be used for other files. In which case it will be larger. Therefore it is not surprising that the numbers displayed for disk usage is not the same as the file content size. They would really only line up exactly if the file content size is a multiple of the file system storage block size and every block is fully represented on disk. Otherwise they will always be at least somewhat different in number. As long as I am here I should mention 'df' which shows disk free space information. One sometimes thinks that adding up the file content size should add up to du disk usage size, but it doesn't. And one sometimes thinks that adding up all of the du disk usage sizes should add up to the df disk free sizes, but it doesn't. That is due to a similar reason. File systems reserve a min-free amount of space for superuser level processes to ensure continued operation even if the disk is fulling up from non-privileged processes. Also file system efficiency and performance drops dramatically as the file system fills up. Therefore the file system reports space with the min-free reserved space in mind. And once again this is different on different file systems. But let me return to your first bit of information. The ls long listing of the files. Your version of ls gave an indication that something was different about the second file. > % command ls -l *pkg > -rw-r--r-- 1 tjluoma staff 88885047 Dec 15 00:00 StreamDeck-4.4.2.12189.pkg > -rw-r--r--@ 1 tjluoma staff 88885047 Dec 15 00:02 Stream_Deck_4.4.2.12189.pkg See that '@' in that position? The GNU ls coreutils 8.30 documentation I am looking at says: Following the file mode bits is a single character that specifies whether an alternate access method such as an access control list applies to the file. When the character following the file mode bits is a space, there is no alternate access method. When it is a printing character, then there is such a method. GNU ‘ls’ uses a ‘.’ character to indicate a file with a security context, but no other alternate access method. A file with any other combination of alternate access methods is marked with a ‘+’ character. I did not see anywhere that documented what an '@' means. Therefore it is likely something applied in a downstream patch. Likely a software distribution specific modification. But I don't really know. I live under a rock and don't get out much. But likely meaning that the second file listed with the file mode '@' is not stored on disk in a typical way. That's probably the first clue that it is different. But actually I do not know as I do not see files listed that way here. Bob
bug-coreutils <at> gnu.org
:bug#38621
; Package coreutils
.
(Mon, 16 Dec 2019 23:39:01 GMT) Full text and rfc822 format available.Message #27 received at 38621 <at> debbugs.gnu.org (full text, mbox):
From: Bernhard Voelker <mail <at> bernhard-voelker.de> To: TJ Luoma <luomat <at> gmail.com> Cc: 38621 <at> debbugs.gnu.org Subject: Re: bug#38621: gdu showing different sizes Date: Tue, 17 Dec 2019 00:38:11 +0100
On 2019-12-16 20:43, TJ Luoma wrote: > AHA! Ok, now I understand a little better. I have seen the difference > between "size" and "size on disk" and did not realize that applied > here. Thanks for confirming. > I'm still not 100% clear on _why_ two "identical" files would have > different results for "size on disk" (it _seems_ like those should be > identical) but I suspect that the answer is probably of a technical > nature that would be "over my head" so to speak, and truthfully, all I > really need to know is "sometimes that happens" rather than > understanding the technical details of why. Actually the difference is a matter of choice, i.e., how the user wants to save the file (obviously, most programs come with a certain default preference). Suppose one writes a file with an "A" at the beginning, then e.g. 1.000.000 NUL characters, and then a "B". Then the storing algorithm may decide to either explicitly write all NULs separately (here displayed as '.') to disk; e.g. 'cp --sparse=never' would do so: - write "A", - write 1.000.000 times a NUL, - write "B". or to try to save some disk space by writing it as a "sparse" file; e.g. 'cp --sparse=always' would (try to) do so: - write an "A", - then tell the filesystem that there are 1.000.000 NULs (which takes just a few bytes physically), - write a "B" The latter method needs support from both the tool and the file system where the file is stored. Or with your words: "sometimes that happens". ;-) > I appreciate you taking the time to educate me further about this. No worries. If there's one user who got confused, then there is the chance that also others might fall into the same issue. Therefore, if you think we could improve something, e.g. a clarifying word in the documentation, then this would help us all. Thanks & have a nice day, Berny
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Tue, 14 Jan 2020 12:24:05 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.