GNU bug report logs - #13127
[PATCH] cut: use only one data strucutre

Previous Next

Package: coreutils;

Reported by: xojoc <at> gmx.com

Date: Sun, 9 Dec 2012 10:29:01 UTC

Severity: normal

Tags: patch

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 13127 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Cojocaru Alexandru <xojoc <at> gmx.com>
Cc: 13127 <at> debbugs.gnu.org
Subject: Re: bug#13127: [PATCH] cut: use only one data strucutre
Date: Sun, 09 Dec 2012 21:45:03 +0100
Cojocaru Alexandru wrote:
> Subject: [PATCH] cut: use only one data structure
>
> The current implementation of cut, uses a bit array,
> an array of `struct range_pair's, and (when --output-delimiter
> is specified) a hash_table. The new implementation will use
> only an array of `struct range_pair's.
> The old implementation is inefficient for the following reasons:
>  1. When -b with a big num is specified, it allocates a lot of useless
>     memory for `printable_field'.
>  2. When --output-delimiter is specified, it will allocate 31 buckets.
>     Even if a few ranges are specified.
...
> -/* Given the list of field or byte range specifications FIELDSTR, set
> -   MAX_RANGE_ENDPOINT and allocate and initialize the PRINTABLE_FIELD
> -   array.  If there is a right-open-ended range, set EOL_RANGE_START
> -   to its starting index.  FIELDSTR should be composed of one or more
> -   numbers or ranges of numbers, separated by blanks or commas.
> -   Incomplete ranges may be given: '-m' means '1-m'; 'n-' means 'n'
> -   through end of line.  Return true if FIELDSTR contains at least
> -   one field specification, false otherwise.  */
> -
> -/* FIXME-someday:  What if the user wants to cut out the 1,000,000-th
> -   field of some huge input file?  This function shouldn't have to
> -   allocate a table of a million bits just so we can test every
> -   field < 10^6 with an array dereference.  Instead, consider using
> -   an adaptive approach: if the range of selected fields is too large,
> -   but only a few fields/byte-offsets are actually selected, use a
> -   hash table.  If the range of selected fields is too large, and
> -   too many are selected, then resort to using the range-pairs (the
> -   'rp' array) directly.  */

Thanks for the patch.
This is large enough that you'll have to file a copyright assignment.
For details, see the "Copyright assignment" section in the file
named HACKING.

Have you considered performance in the common case?
I suspect that a byte or field number larger than 1000 is
not common.  That is why, in the FIXME comment above,
I suggested to use an adaptive approach.  I had the feeling
(don't remember if I profiled it) that testing a bit per
input field would be more efficient than an in-range test.

If you construct test cases and gather timing data, please do so
in a reproducible manner and include details when you report back,
so we can compare on different types of systems.  Cache matters a
lot, these days.




This bug report was last modified 12 years and 76 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.