GNU bug report logs - #34110
du: add dual-column showing apparent-size and disk-size

Previous Next

Package: coreutils;

Reported by: René J.V. Bertin <rjvbertin <at> gmail.com>

Date: Wed, 16 Jan 2019 22:04:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 34110 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Wed, 16 Jan 2019 22:04:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to René J.V. Bertin <rjvbertin <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 16 Jan 2019 22:04:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: René J.V. Bertin <rjvbertin <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: feature request: dual-column du output,
 showing "real" and "on-disk" sizes (and about that "apparent-size"
 concept)
Date: Wed, 16 Jan 2019 21:13:15 +0100
Hi,

I hope feature requests are acceptable here.

Now that more and more filesystems have support for compression it becomes more interesting the comparre actual file/directory (content) size and the corresponding on-disk size. Currently you have to call du twice to do that, which quickly becomes cumbersome in practice (commandlines, parsing the output) and requires repeating the same IO operations twice.

The code obtains both size values at the same time so it would make sense to do both calculations at the same time, and provide an option to display the regular and "apparent-size" values in column output. My guess would be that the cost of calculating both output values at the same time is negligible w.r.t. the cost of the stat() call (and thus that there's no need to complexify the code with "calculate this and/or that" conditionals).

The option could be called --both, --colums (-C) or --two (-T).

I'd also reconsider the "apparent-size" term as I think it is confusing and ambiguous. Consider this, taken from a ZFS dataset with gzip-9 compression (and copies=1; du v8.30):

%> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size /Volumes/nif64/tmp/.npm/
340M    /Volumes/nif64/tmp/.npm/
180M    /Volumes/nif64/tmp/.npm/

Same folder on btrfs (mounted with compress=lzo):
%> du -hcs /mnt/.npm/ ; du -hcs --apparent-size  /mnt/.npm
198M    /mnt/.npm/
181M    /mnt/.npm

According to `du --help`, the apparent-size option reports a size that is not the actual disk usage. The numbers above seem to show the opposite.
If anything, I find the concept of "apparent size" more appropriate to the size a file occupies on the storage medium because ultimately that storage device will not give you more than "struct stat : st_size" bytes for uncompressed filesystems. 
Another way to say it: with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint).

For comparison; same folder,  on Mac with HFS+
%> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm
198M    /Volumes/VMs/.npm
181M    /Volumes/VMs/.npm

Idem, with HFS+ compression (zip-9)
%> du -hcs /Volumes/VMs/.npm ; du -hcs --apparent-size /Volumes/VMs/.npm
115M    /Volumes/VMs/.npm
148M    /Volumes/VMs/.npm

Thoughts?

Thanks,
R.





Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Wed, 16 Jan 2019 23:08:01 GMT) Full text and rfc822 format available.

Message #8 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: René J.V. Bertin <rjvbertin <at> gmail.com>,
 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Wed, 16 Jan 2019 16:06:50 -0700
Hello,

I'll address only the "apparent-size" issue (not the two-columns, or 
compressed file-systems):

On 2019-01-16 1:13 p.m., René J.V. Bertin wrote:

> According to `du --help`, the apparent-size option reports a size that is not the actual disk usage. The numbers above seem to show the opposite.
> If anything, I find the concept of "apparent size" more appropriate to the size a file occupies on the storage medium because ultimately that storage device will not give you more than "struct stat : st_size" bytes for uncompressed filesystems.
> Another way to say it: with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint).

"apparent-size" shows how much content/data the file has.
without "apparent-size" du shows the amount of storage consumed (or 
"wasted"?) on the storage medium (accounting sparse file holes, though 
I'm not sure about compression).

To illustrate, create three files with specific sizes:

  $ head --bytes=1700 /dev/zero > a
  $ head --bytes=4097 /dev/zero > b
  $ truncate --size=1050000 c        # will be a sparse file

These are their sizes, as in the amount of bytes they contain:

  $ ls -log
  total 12
  -rw-r--r-- 1    1700 Jan 16 15:36 a
  -rw-r--r-- 1    4097 Jan 16 15:36 b
  -rw-r--r-- 1 1050000 Jan 16 15:37 c


These are their "apparent-sizes", rounded up to the nearest
1K block:

  $ du --apparent-size a b c
  2     a
  5     b
  1026  c

e.g. file "a" is 1700 bytes, rounded-up to 2K, and "du --apparent-size"
shows "2".

Using "--apparent-size --block-size=1" (and its equivalent, "--bytes")
will show the exact sizes:

  $ du --apparent-size --block-size=1 a b c
  1700     a
  4097     b
  1050000  c

Without "--apparent-size", du shows how much storage space is actually 
used/wasted/consumed on the storage medium by the files:

  $ du a b c
  4    a
  8    b
  0    c

How are these numbers calculated?

The simplest case is file "c" - it is completely sparse - so despite
logically containing 1,050,000 zeros, on the actual storage medium it 
consumes zero data blocks (ignoring inodes blocks and somesuch).

File "a" has 1,700 bytes of data.
On my filesystem the basic block size is 4096, as shown by "stat -f":

  $ stat -f /
    File: "/"
      ID: 5a2cade519bada6a Namelen: 255     Type: ext2/ext3
->Block size: 4096       Fundamental block size: 4096    <-----
  Blocks: Total: 27559017   Free: 18845977   Available: 17435289
  Inodes: Total: 7036928    Free: 6496730

Therefore, any file from size 1 to size 4096 will consume exactly one
disk block. On most common filesystems, disk blocks can not be shared
between files. Meaning that this block is fully consumed.

That's why for file "a" du shows "4" - meaning 4K bytes (exactly one
block) is consumed on the storage medium by this file.

Similarly for file "b" - its size is 4097, which is 1 byte more than one
filesystem block. Hence, file "b" consumes 2 blocks, coming up to 8K.
du then shows "8" for file "b".


Now to your examples:

> %> du -hcs /Volumes/nif64/tmp/.npm/ ; du -hcs --apparent-size
/Volumes/nif64/tmp/.npm/
> 340M    /Volumes/nif64/tmp/.npm/ > 180M    /Volumes/nif64/tmp/.npm/
> Same folder on btrfs (mounted with compress=lzo): > %> du -hcs /mnt/.npm/ ; du -hcs --apparent-size  /mnt/.npm> 198M 
/mnt/.npm/> 181M    /mnt/.npm

In both cases, "du --apparent-size" shows about 180MB of actual data 
(181MB in the second example). That is the amount of actual content
(number of total bytes in these files).

In the first case, these files consume 340MB of space on your disk.
In the second case, these files consume 198MB of space on your disk.
The reason they consume MORE than their actual data is explained above
with the file-system blocks.

This suggest to me that compression is not accounted for in these
values. If it was, then the consumed size (without "--apparent-size")
should've been less than the actual size (with "--apparent-size").

A quick on-line search shows that btrsf's default block size is 16K,
while ZFS's default record-size is 128KB. That might explain
why similar amount of data (and I assume, similar number of files and
sizes) consume more disk space on ZFS (Could be wrong, though, comments
are welcomed).


I hope this helps to clarify "apparent-size".

I'll leave it to others to comment on how compressed file systems
come into play with du.

regards,
 - assaf







Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Thu, 17 Jan 2019 00:29:01 GMT) Full text and rfc822 format available.

Message #11 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: René J.V. Bertin <rjvbertin <at> gmail.com>,
 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Wed, 16 Jan 2019 16:28:14 -0800
I like the idea of two columns at once.

> with "--apparent-size", du returns the actual file size; without, it returns how large the file appears to be (judging from its disk footprint).

The "apparent" size is the size that "ls -l" outputs, and is the size 
that traditional I/O operations like 'read' and 'write' deal with, 
regardless of the underlying implementation (where the size might be 
smaller or larger than the "apparent" size). In contrast the "disk 
usage" size is whatever the filesystem tells us it is. I wouldn't call 
either size the "actual" size these days, as even the disk usage (or 
"disk footprint") might be virtual blocks stored in a lower-level 
compressed device, and there's no way "du" can find out how much of the 
lower-level device is being used.





Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Thu, 17 Jan 2019 10:14:02 GMT) Full text and rfc822 format available.

Message #14 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: René J.V. Bertin <rjvbertin <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output,
 showing "real" and "on-disk" sizes (and about that "apparent-size"
 concept)
Date: Thu, 17 Jan 2019 11:13:11 +0100
On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote:

Hello,

Yes, I used the exact same directory in all comparisons. It's a nodejs cache (or whatever) directory as you may have guessed; I picked it because it's a good example of the sort of directory found these days which can create considerable overhead. Small enough it'd tend to get dismissed as significant, but containing a large number of files (almost 8000 in my case), most of them tiny.

>I hope this helps to clarify "apparent-size".

Yes and no :) I understand what "apparent-size" does (and have dug through the code looking for ideas how to do similar things in one of my own apps).

My whole point is that there might be a better name. I know one should distinguish every-day language and technical terms but if the latter start to appear (pun intended) like the former (and lack a shorthand) then they'd best be chosen such that they don't require thinking about their interpretation.

Paul's comment about not being able to know what happens underneath only makes this argument stronger IMHO. On the one hand, du can only report how big a item would appear to be on disk (based on what stat() reports). In addition, how would it handle knowledge about the number of disks that a given file is written to? On the other hand, the actual content size is a given that shouldn't change and that is not subject to any existential questions. (Though as my examples show, this isn't necessarily true when du'in directories, and esp. so for HFS+ with compression.)

I realise that you cannot really call the content size observable "real size" when reporting from a disk-usage viewpoint, but "content size" (--content-size, -C) should be clear enough? "Estimated on-disk size" would be good enough as a header for the other observable (an estimate can be 100% accurate after all).

Cheers,
R.




Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Fri, 18 Jan 2019 06:44:01 GMT) Full text and rfc822 format available.

Message #17 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: René J.V. Bertin <rjvbertin <at> gmail.com>
Cc: 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Thu, 17 Jan 2019 23:43:39 -0700
severity 34110 wishlist
retitle 34110 du: add dual-column showing apparent-size and disk-size
stop

Hello,

On 2019-01-17 3:13 a.m., René J.V. Bertin wrote:
> On Wednesday January 16 2019 16:06:50 Assaf Gordon wrote:
> 
>> I hope this helps to clarify "apparent-size".
> 
> Yes and no :) I understand what "apparent-size" does [....] 
> My whole point is that there might be a better name. 

The parameter name "--apparent-size" is not likely to be changed.
It has been named so for about 16 years (since 'fileutils 4.5.8'
which is even before 'coreutils' was created as a unified package).

Changing it would break existing scripts and user expectations.

> I realise that you cannot really call the content size observable "real size" when reporting from a disk-usage viewpoint, but "content size" (--content-size, -C) should be clear enough?

Creating a second alias to "--apparent-size" is possible, but I'm not
sure it's warranted.

---

I think the discussion about "--apparent-size" is mostly concluded,
but the idea to have two-columns is an interesting feature request.

I'm marking this as a "wish list" item.
Concrete patches are welcomed.

regards,
 - assaf








Changed bug title to 'du: add dual-column showing apparent-size and disk-size' from 'feature request: dual-column du output, showing "real" and "on-disk" sizes (and about that "apparent-size" concept)' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 18 Jan 2019 06:44:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Fri, 18 Jan 2019 09:57:02 GMT) Full text and rfc822 format available.

Message #22 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: René J.V. Bertin <rjvbertin <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output,
 showing "real" and "on-disk" sizes (and about that "apparent-size"
 concept)
Date: Fri, 18 Jan 2019 10:56:07 +0100
On Thursday January 17 2019 23:43:39 Assaf Gordon wrote:
>The parameter name "--apparent-size" is not likely to be changed.

I was thinking of making it an alias for a more aptly named parameter (long) before (possibly) phasing out the current name. 

>It has been named so for about 16 years (since 'fileutils 4.5.8'

A lot has happened on the filesystem front since that time. Just saying :)

>Concrete patches are welcomed.

I bet. We'll see who finds the time for that (the code isn't the most welcoming to dive into I've ever seen ;))

Cheers,
R.




Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Fri, 18 Jan 2019 10:02:02 GMT) Full text and rfc822 format available.

Message #25 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: René J.V. Bertin <rjvbertin <at> gmail.com>
Cc: 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Fri, 18 Jan 2019 03:01:45 -0700
Hello,

On 2019-01-18 2:56 a.m., René J.V. Bertin wrote:
> 
> the code isn't the most welcoming to dive into I've ever seen ;)

Two online resources that might help in exploring the code:

  http://www.maizure.org/projects/decoded-gnu-coreutils/

  https://opengrok.housegordon.com/source/xref/coreutils/

regards,
 - assaf




Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Tue, 22 Jan 2019 22:41:02 GMT) Full text and rfc822 format available.

Message #28 received at 34110 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: René J.V. Bertin <rjvbertin <at> gmail.com>,
 Assaf Gordon <assafgordon <at> gmail.com>
Cc: 34110 <at> debbugs.gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Tue, 22 Jan 2019 23:40:29 +0100
On 1/17/19 11:13 AM, René J.V. Bertin wrote:
> I realise that you cannot really call the content size observable "real size
> when reporting from a disk-usage viewpoint, but "content size"
> (--content-size, -C) should be clear enough?

This sounds to me as if you wanted 'du' to read() the content of each file
to get the 'correct' statistics.  That is more in the domain of wc(1).

Have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#34110; Package coreutils. (Tue, 22 Jan 2019 23:21:02 GMT) Full text and rfc822 format available.

Message #31 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#34110: feature request: dual-column du output, showing "real"
 and "on-disk" sizes (and about that "apparent-size" concept)
Date: Tue, 22 Jan 2019 15:20:43 -0800
On 1/22/19 2:40 PM, Bernhard Voelker wrote:
> This sounds to me as if you wanted 'du' to read() the content of each file
> to get the 'correct' statistics.  That is more in the domain of wc(1).

du already has an --apparent-size option that gives the same size that 
'read' would give. As I understand it, this part of the request was to 
change the (arguably confusing) name of this option to a different (and 
also arguably confusing :-) name. As the option name has been that way 
for quite some time and the proposed name is not that much less 
confusing than the old, I think we'll stand pat.





This bug report was last modified 6 years and 150 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.