GNU bug report logs - #72445
shuf with both input-range and head-count biased

Previous Next

Package: coreutils;

Reported by: Daniel Carpenter <dansebpub <at> gmail.com>

Date: Sat, 3 Aug 2024 17:24:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Daniel Carpenter <dansebpub <at> gmail.com>
Subject: bug#72445: closed (Re: bug#72445: shuf with both input-range and
 head-count biased)
Date: Sun, 04 Aug 2024 07:12:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#72445: shuf with both input-range and head-count biased

which was filed against the coreutils package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 72445 <at> debbugs.gnu.org.

-- 
72445: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=72445
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Daniel Carpenter <dansebpub <at> gmail.com>
Cc: 72445-done <at> debbugs.gnu.org
Subject: Re: bug#72445: shuf with both input-range and head-count biased
Date: Sun, 4 Aug 2024 00:10:56 -0700
[Message part 3 (text/plain, inline)]
Thanks for the bug report. The bug appears to be due to a weakness in 
ISAAC that was reported in 2006 by Jean-Philippe Aumasson of the 
University of Applied Sciences Northwestern Switzerland. Although 
Aumasson wrote a paper about it, nobody seems to have connected the 
paper with coreutils.

I installed the attached patch, which fixed things for me, and am 
marking the bug as fixed.
[0001-shuf-fix-randomness-bug.patch (text/x-patch, attachment)]
[Message part 5 (message/rfc822, inline)]
From: Daniel Carpenter <dansebpub <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: shuf with both input-range and head-count biased
Date: Sat, 3 Aug 2024 10:19:09 +0200
[Message part 6 (text/plain, inline)]
The above options allow me to use shuf to efficiently simulate a dice roll,
but there is a clear bias when I do so, for example:

$ for i in {1..10000}; do shuf --input-range=1-6 --head-count=1; done |
sort | uniq --count
   1730 1
   1411 2
   1882 3
   1809 4
   1520 5
   1648 6

Using seq instead of input-range does not appear biased:

$ for i in {1..10000}; do seq 6 |  shuf --head-count=1; done | sort | uniq
--count
   1652 1
   1696 2
   1674 3
   1638 4
   1713 5
   1627 6

Same for head:

$ for i in {1..10000}; do shuf --input-range=1-6 | head --lines=1; done |
sort | uniq --count
   1639 1
   1674 2
   1655 3
   1669 4
   1688 5
   1675 6

It seems that somehow combining both options affects the distribution. I
assume there's some performance optimization in that case since shuf
doesn't need to permute the entire input range.
[Message part 7 (text/html, inline)]

This bug report was last modified 288 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.