This topic came up again on the Austin Group call today, with no good resolution yet. On 12/18/2011 03:03 PM, Paul Eggert wrote: > Eric Blake's Option 1 does not appear to be tenable, as du > traditionally preserved hashes of duplicate files across all > of its operands. 7th Edition Unix 'du' did that, and (as > Jilles Tjoelker pointed out) so do at least two current 'du' > implementations, namely, FreeBSD and GNU. > > The idea behind Eric's Option 2 is better, but its wording > is unclear partly because of another issue Jilles raised: > whether a file's disk space should be counted multiple times > if the file occurs multiple times and its link count is 1. > For example: > > mkdir d > cd d > cp /bin/sh w > cp w y > ln y ../y > ln -s w x > ln -s y z > du -aL > > This analyzes a directory with two regular files, 'w' and > 'y'. GNU and Solaris du count these files once each, with > an accurate sum of non-symlink disk usage under the current > directory. But w's link count is 1 so FreeBSD counts 'w' > twice, thus overcounting disk usage. > > The current POSIX wording does not say what to do for this > example, but the intent is to avoid overcounting disk usage, > and the GNU and Solaris behavior supports this intent better. > (The 7th Edition Unix behavior agrees with FreeBSD, but this > predates symbolic links so the behavior is now dubious.) One of the points made is that the standard currently requires elision only for files with link counts > 1. An interesting example with FreeBSD du: $ echo > a $ du -a a a 2 a 2 a $ ln a b $ du -a a a 2 a $ That is, the second argument was elided when the inode for 'a' is found in the hash, which means the hash is preserved across arguments; but the inode for 'a' is only put in the hash if the link count is > 1. > > Given all the above, the standard's wording could be > improved in several different ways, all elaborations of > Option 2. Here are two possibilities: > > Option 2A - require that files be hashed among all > operands, and that disk usage be counted at most once. > > Change line 84170 [du DESCRIPTION] from: > > Files with multiple links shall be counted and written > for only one entry. > > to: > > A file that occurs multiple times shall be counted and > written for only one entry, even if the occurrences > are under different file operands. > > Option 2B - leave unspecified whether files are hashed > among all operands, and leave unspecified whether disk > usage is counted multiple times for files whose link > count does not exceed 1. From the user's point of view, > this means du's output is a reliable count of disk usage > only if du is invoked without -L and with -x and with at > most one operand. > > Change line 84170 [du DESCRIPTION] from: > > Files with multiple links shall be counted and written > for only one entry. > > to: > > A file that occurs multiple times under one file > operand and that has a link count greater than 1 shall > be counted and written for only one entry. It is > implementation-defined whether a file that has a link > count no greater than 1 is counted and written just > once, or is counted and written for each occurrence. > It is implementation-defined whether a file that > occurs under one file operand is counted for other > file operands. > > Option 2A is simpler and clearer, but it invalidates many > existing implementations. Option 2B modifies the standard > to describe how existing implementations actually work, but > is more complicated and more of a hassle to use reliably. > > Eric raised one other issue: the description of the -a > option implies that "du A B" must always list B. This > implication is incorrect for 7th edition Unix du, GNU du, > and (I expect) FreeBSD du, so it should be fixed as well. > Here's one possible fix, which is independent of the > abovementioned changes. > > Change line ????? [du OPTIONS] from: > > Regardless of the presence of the -a option, > non-directories given as file operands shall always > be listed. > > to: > > The -a option does not affect whether > non-directories given as file operands are listed. > > (Sorry, I don't know the line number here; I don't have a > PDF copy of the current standard and don't know offhand how > to get one.) It boils down to a decision of whether we want to standardize a useful behavior, and whether that behavior avoids over-counting, but possibly invalidating existing implementations (in which case, it is better targetted to Issue 8), or whether we give up and declare things unspecified when encountering files with link count of 1 through multiple locations (in which case we could make the changes in TC2 of Issue 7, and still make recommendations on the underlying goal of avoiding over-counting). The call today also mentioned that cpio may have a similar issue on overcounting. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org