GNU bug report logs - #6524
du now uses less than half as much memory, sometimes

Previous Next

Package: coreutils;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Sun, 27 Jun 2010 22:10:02 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: Paul Eggert <eggert <at> CS.UCLA.EDU>, bug-gnulib <at> gnu.org, Jos, 6524 <at> debbugs.gnu.org, Bruno Haible <bruno <at> clisp.org>, é Marchesi <jemarch <at> gnu.org>
Subject: bug#6524: RFC: modules for generic unordered sets and mappings
Date: Fri, 02 Jul 2010 10:21:25 +0200
Eric Blake wrote:

> On 07/01/2010 06:44 PM, Paul Eggert wrote:
>> I've just gone through "du", and it's fresh in my
>> mind, so I thought I'd run through what I needed there, to see how
>> well it maps to the proposal.
>>
>> I needed three kinds of hash tables.
>>
>> * Table A is indexed by device number (a dev_t value) and the value is a
>>   secondary hash table.  Typically the number of devices is relatively
>>   small.
>>
>> * Table B is indexed by inode number (an ino_t value) and has no value.
>>   All that matters is whether the key is present.  So it is a set,
>>   not a general mapping.  Often there are many, many inode numbers.
>>
>> * The gnulib hash package uses void * keys, but having a void * value
>>   that exists only to point to an ino_t value wastes memory.  So, if
>>   an inode number is small (less than M, say), it maps to itself.  If
>>   it is large, it is a key into a third table (table C) whose values
>>   are mapped inode numbers.  (Values in table C are always M or
>>   greater.)  The mapped inode number (whether taken from table C, or
>>   used directly) is cast to void * and then used as the index into
>>   table B.  In practice, most inode numbers are less than M so this is
>>   a storage win.
>
> We need to be careful on cygwin.
>
> $ ls -i | head
>  1125899907178954 ABOUT-NLS
> 28147497671067760 AUTHORS
>  1970324837308495 COPYING
>  9851624184873962 ChangeLog
>  4785074604197847 ChangeLog-2005
>  1970324837180710 ChangeLog-2006
>  1970324837078885 ChangeLog-2007
>  2251799813904421 ChangeLog-2008
>   281474977732635 GNUmakefile
>  1125899907781258 GNUmakefile~
>
> sizeof(void*) == sizeof(size_t) == 4, but sizeof(ino_t) == 8, and most
> inodes are quite randomly dispersed but definitely larger than 4 bytes.
>  Does your scheme work well at mapping cygwin's 8-byte inodes into
> 4-byte pointers?

It's better not to modify algorithms in order to optimize
for systems with 4-byte pointers.

We can find/use a set/hash package that can efficiently
handle keys of type uintmax_t regardless of pointer size.

Combine that with a slight re-design suggested by Paul
and (before him, privately) by Tim Waugh -- to map device
numbers to inode hash tables rather than to small integers
that need to be encoded -- and you get all the benefits
with much less complexity.




This bug report was last modified 14 years and 100 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.