Package: emacs;
Reported by: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>
Date: Thu, 4 Jan 2024 16:29:02 UTC
Severity: wishlist
View this message in rfc822 format
From: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com> To: Stefan Monnier <monnier <at> iro.umontreal.ca> Cc: Dmitry Gutov <dmitry <at> gutov.dev>, Eli Zaretskii <eliz <at> gnu.org>, 68244 <at> debbugs.gnu.org Subject: bug#68244: hash-table improvements Date: Mon, 8 Jan 2024 19:26:21 +0100
7 jan. 2024 kl. 20.10 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>: > The change gives good results for small tables but less so for big ones. > I don't have a good intuition for why that would be: none of the > operations directly involved seem to be more costly for large tables, so > my best guess is that it leads to more collisions somehow, tho I don't > have a good idea about why that would be. Yes, I wondered about that too. It could simply be bad sampling. The benchmarks with bad results used sequential numbers (0, 1, ...) as keys so let's start with that: | count | size idxsiz 1st% avg max | size idxsiz 1st% avg max |--------+------------------------------+----------------------------- | 10000 | 12466 15343 100.0 1.00 1 | 12288 16384 100.0 1.00 1 | 20000 | 28048 34523 100.0 1.00 1 | 24576 32768 98.8 1.01 2 | 30000 | 42072 51781 100.0 1.00 1 | 49152 65536 99.1 1.01 2 | 40000 | 42072 51781 100.0 1.00 1 | 49152 65536 94.5 1.06 2 | 50000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 60000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 70000 | 94662 116507 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 80000 | 94662 116507 100.0 1.00 1 | 98304 131072 96.0 1.04 2 | 90000 | 94662 116507 100.0 1.00 1 | 98304 131072 89.3 1.11 2 | 100000 | 141993 174761 100.0 1.00 1 | 196608 262144 92.9 1.07 2 | 110000 | 141993 174761 100.0 1.00 1 | 196608 262144 90.9 1.09 2 | 120000 | 141993 174761 100.0 1.00 1 | 196608 262144 89.3 1.11 2 | 130000 | 141993 174761 100.0 1.00 1 | 196608 262144 87.9 1.12 2 | 140000 | 141993 174761 100.0 1.00 1 | 196608 262144 86.7 1.13 2 | 150000 | 212989 262141 100.0 1.00 1 | 196608 262144 85.7 1.14 2 | 160000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.8 1.15 2 | 170000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.0 1.16 2 | 180000 | 212989 262141 100.0 1.00 1 | 196608 262144 83.3 1.17 2 | 190000 | 212989 262141 100.0 1.00 1 | 196608 262144 82.7 1.17 2 | 200000 | 212989 262141 100.0 1.00 1 | 393216 524288 100.0 1.00 1 The left columns are for the standard hash tables with remainder-based range reduction, the right ones with Knuth reduction. 'size' is the table size, 'idxsiz' the index vector size, '1st%' the portion of entries accessed with a single lookup, 'avg' the average number of accesses and 'max' the maximum number of accesses required. The old code looks perfect (no collisions!), but the new shiny code is a disappointment. Then again, sequential numbers are best-case for remainder-based indexing: Emacs hashes (smallish) fixnums to themselves so we are guaranteed a minimum number of collisions, actually zero since the index vector is always larger than the number of entries. But sequential keys would be a somewhat unusual use of hash tables; a vector is a lot more efficient. Let's try with random fixnums instead, using the same seed for each table: | count | size idxsiz 1st% avg max | size idxsiz 1st% avg max |--------+-----------------------------+---------------------------- | 10000 | 12466 15343 73.5 1.32 6 | 12288 16384 75.4 1.30 6 | 20000 | 28048 34523 75.7 1.29 6 | 24576 32768 75.2 1.30 6 | 30000 | 42072 51781 75.8 1.29 5 | 49152 65536 80.6 1.22 5 | 40000 | 42072 51781 69.6 1.39 7 | 49152 65536 75.0 1.30 6 | 50000 | 63108 77671 73.6 1.32 6 | 98304 131072 83.5 1.19 6 | 60000 | 63108 77671 69.6 1.39 7 | 98304 131072 80.5 1.23 6 | 70000 | 94662 116507 75.2 1.30 6 | 98304 131072 77.5 1.27 6 | 80000 | 94662 116507 72.4 1.34 7 | 98304 131072 75.0 1.30 7 | 90000 | 94662 116507 69.8 1.38 7 | 98304 131072 72.6 1.34 7 | 100000 | 141993 174761 76.2 1.29 6 | 196608 262144 83.3 1.19 6 | 110000 | 141993 174761 74.2 1.32 6 | 196608 262144 81.8 1.21 6 | 120000 | 141993 174761 72.4 1.34 7 | 196608 262144 80.3 1.23 6 | 130000 | 141993 174761 70.6 1.37 7 | 196608 262144 78.8 1.25 6 | 140000 | 141993 174761 68.8 1.40 7 | 196608 262144 77.6 1.27 6 | 150000 | 212989 262141 76.1 1.29 6 | 196608 262144 76.2 1.29 6 | 160000 | 212989 262141 74.8 1.31 6 | 196608 262144 74.9 1.30 6 | 170000 | 212989 262141 73.5 1.33 6 | 196608 262144 73.6 1.32 6 | 180000 | 212989 262141 72.3 1.35 7 | 196608 262144 72.3 1.34 6 | 190000 | 212989 262141 71.1 1.36 7 | 196608 262144 71.1 1.36 7 | 200000 | 212989 262141 69.9 1.38 7 | 393216 524288 83.1 1.19 5 Here the new code looks a bit better, but it could just be the bigger index vectors. Not sure what to make of this. We could try switching to a high-quality hash function (or family thereof), like Murmur or Jenkins. Then range reduction is just a matter of masking off the required number of bits.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.