GNU bug report logs -
#58472
[PATCH] Make `message-unique-id' less prone to collisions
Previous Next
Reported by: Stefan Kangas <stefankangas <at> gmail.com>
Date: Wed, 12 Oct 2022 16:09:01 UTC
Severity: wishlist
Tags: patch
Done: Stefan Kangas <stefankangas <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #31 received at 58472 <at> debbugs.gnu.org (full text, mbox):
Stefan Kangas <stefankangas <at> gmail.com> writes:
> Matt Armstrong <matt <at> rfc20.org> writes:
>
>> Most email I get today to use a UUID or UUID-like Message-ID, like this:
>>
>> Message-ID: <736d10a6-001f-4a29-a1d4-554f58733b69 <at> dfw1s10mta1086.xt.local>
>> Message-ID: <1815053947.8446619.1665544925708 <at> lor1-app45123.prod.linkedin.com>
>> Message-ID: <01000183b9eaa6f8-411d1f4c-b573-472d-b45f-47b0c4eb6ace-000000 <at> email.amazonses.com>
>> Message-ID: <CABqZ1wa8MxrieVKZ11adZUV2qB_CnpMJoFEn-U3d5CQ7z7smWw <at> mail.gmail.com>
>
> Those are 30-51 characters in length. I also note that Gmail uses both
> lower case and upper case characters.
Yes, the Google example is probably a crypto secure hash base64url
encoded.
>>> If we limit the length of the time string to 12 characters, and the
>>> total length to 25 characters (including the ".gnu" part), we still have
>>> a guaranteed 9 characters of random data, or 46 bits of entropy.
>>
>> I suspect that most mailers use more randomness than that.
>
> So I guess we might as well bump this up to 30 characters in total,
> which gives us 72 bits. The Message-IDs would look like:
>
> cnkrs75yamag1k7x8rnt3y50za.gnu <at> stefankangas.se
> cnkrifkirauwuwfkzs3rcit8cq.gnu <at> stefankangas.se
You said you like the timestamp prefix. How many times would that
actually be useful, and for what?
I am not seeing the utility myself, but I might be missing something.
> We could go longer, but it's also nice to have something which is not an
> absolute abomination to look at.
>
> If we add in upper case characters too, we can encode the time with one
> less character. So we end up with 89 bits of randomness and this:
>
> 1Z2KnqE1t2bSgUWkcu53M34Y4y.gnu <at> stefankangas.se
> 1Z2KbUgleGoe0WRJ3jbiM0mE7W.gnu <at> stefankangas.se
>
> If we don't want to always start the Message-ID with the same characters
> (which makes them more distinct, at a glance), we could just reverse the
> time string:
>
> QlRXPpmK2Z1kUklxIpMNZpChOu.gnu <at> stefankangas.se
> Z59YikmK2Z1FSmYj172SAdPpuX.gnu <at> stefankangas.se
Maybe use (base64url-encode-string str t) is an option?
>> Some of the SHA hash algorithms are in the public domain. Could they be
>> added to Emacs and used for UUID generation here?
>
> We have `secure-hash'. Is that what you mean? Or do you mean to use a
> proper RFC 4122 UUID?
>
> All we need is for the Message-ID to be unique though, so some ad hoc
> solution is probably fine.
I am no security expert, so I'll end with this and then be quiet on this
issue. :-)
I agree that globally unique (with very high probability) is pretty easy
to design in ad hoc ways within a private namespace. But Message-ID is
not that. The Message-ID space consists of all Message-ID generated by
all programs past, present and future. An impossible goal, given that
the RFC basically says "go make something up, have fun with that", so it
is a free for all.
Given that, timestamp + short-rng + ".gnu" doesn't feel good enough to
me, but I won't lose sleep if you use that solution.
In these situations I usually feel better going with what seems like an
existing practice: encode a crypto hash sourced from (possibly security
sensitive) entropy available on the local machine.
Inspired by https://nullprogram.com/blog/2010/05/11/ and the code linked
there (https://nullprogram.com/download/uuid.el) I came up with this:
;; Written while discussing Emacs bug#58472
(defun my-message-unique-id()
(let* ((object (format "%S|%s|%s|%s"
(time-convert nil t)
(random (expt 2 512))
(emacs-uptime)
(emacs-pid)))
(hash (secure-hash 'sha512 object nil nil t))
(encoded-hash (base64url-encode-string hash t))
(id (concat encoded-hash ".gnu")))
;; Replace leading and trailing non-alnum with ?x. This is not
;; necessary to create strictly conforming Message-Id, but it makes
;; them less confusing in practice.
(dolist (index `(0 ,(1- (length id))))
(let ((char (aref id index)))
(when (or (eq char ?-)
(eq char ?_))
(aset id index ?x))))
id))
This gives IDs like:
lzD9K4s2GkiGmZhmP46M4ezv37sGEQO2tU2GX2mx-rs.gnu <at> stefankangas.se
wEUY87gazgDLnsARlJZ5MEY4pgFZmDhxy9IZhbDLpfA.gnu <at> stefankangas.se
VbUzrNkjJiCnZYiDzMdcqz2BqibzismlaR8ZkcU6O7E.gnu <at> stefankangas.se
XX8f7gOrT1Iv-NLdZBjHgz8s4u4EF2uE7o-UttFgF0U.gnu <at> stefankangas.se
tp9f3Up7ilsrH8XqtmQT4_evPkOO-EteN127d9bt85s.gnu <at> stefankangas.se
One advantage of this approach is that it is possible to add more
sources of entropy, or grow the hash function, without necessarily
requiring a "format" version change, because we're relying on the hash
for uniqueness in a sort of brute force way, not a schema and a hope
that only Emacs uses it.
One question: I'm not sure that (random (exp 2 512)) actually gives more
entropy than a simple call to (random). It depends on Emacs' RNG. Last
I checked there was code still calling something like rand(), which
generally not intended for this purpose.
P.S. if you go with your original approach, I think you want (expt 2 X)
instead of (exp X).
--
matt (sent from an Emacs running the feature/noverlay branch)
This bug report was last modified 2 years and 174 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.