GNU bug report logs -
#52051
[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Hello Ludovic!
Ludovic Courtès <ludo <at> gnu.org> writes:
> Hello!
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374 gettid() = 374
>> 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 gettid() = 374
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 gettid() = 374
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374 close(11) = 0
>> 374 gettid() = 374
>> 374 epoll_wait(5, <unfinished ...>
>> 391 <... close resumed>) = 0
>> 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391 exit(0) = ?
>> 391 +++ exited with 0 +++
>> 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
>> 374 lseek(7, 0, SEEK_SET) = 0
>> 374 read(7, "tty7\n", 63) = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…
Indeed, very interesting :-). Thanks for diving in!
> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.) Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.
Oh! Marius found this interesting issue [0] that they shared on IRC
today; I wonder if it may be related. sd-bus apparently pipeline things
aggressively, which is not always handled well by other D-bus
participants.
[0] https://github.com/systemd/systemd/issues/16610
> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
> _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
> int r;
>
> assert(bus);
>
> if (!bus->bus_client)
> return 0;
>
> r = sd_bus_message_new_method_call(
> bus,
> &m,
> "org.freedesktop.DBus",
> "/org/freedesktop/DBus",
> "org.freedesktop.DBus",
> "Hello");
> if (r < 0)
> return r;
>
> return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>
> [...]
>
> fail:
> /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
> * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
> * if the user then calls back into us again we won't wait any longer. */
>
> bus_set_state(bus, BUS_CLOSING);
> return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure. In our case it’s not helpful.
If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps? So it should never die, and expect
users to call back into it to retry things? In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.
> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:
Interesting that you mention this; that's what I worked on yesterday
(see attached patches). I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1]. Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.
[1] https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html
[0001-gnu-Add-dbus-verbose.patch (text/x-patch, attachment)]
[0002-services-dbus-Add-a-VERBOSE-configuration-option.patch (text/x-patch, attachment)]
[0003-build-Move-D-Bus-parser-to-its-own-build-module.patch (text/x-patch, attachment)]
[0004-services-telephony-Factorize-bus-polling-logic.patch (text/x-patch, attachment)]
[0005-services-dbus-Add-synchronization-primitives-to-dbus.patch (text/x-patch, attachment)]
[Message part 7 (text/plain, inline)]
> static dbus_bool_t
> handle_new_client_fd_and_unlock (DBusServer *server,
> DBusSocket client_fd)
> {
>
> [...]
>
> /* See if someone wants to handle this new connection, self-referencing
> * for paranoia.
> */
> new_connection_function = server->new_connection_function;
> new_connection_data = server->new_connection_data;
>
> _dbus_server_ref_unlocked (server);
> SERVER_UNLOCK (server);
>
> if (new_connection_function)
> {
> (* new_connection_function) (server, connection,
> new_connection_data);
> }
> dbus_server_unref (server);
>
> /* If no one grabbed a reference, the connection will die. */
> _dbus_connection_close_if_only_one_ref (connection);
> dbus_connection_unref (connection);
>
> We know that elogind is started after dbus-daemon has written its PID
> file (there’s a Shepherd service dependency). Is there a possibility
> that dbus-daemon writes its PID file before it has set
> ‘new_connection_function’?
Are PID files conventionally agreed to be synchronization primitives?
In the case of systemd, I feel this may be more of an afterthought,
given the system is designed to rely on socket-based activation. Still
just guessing though, and your analysis above seems to suggest
otherwise.
> It would seem that ‘bus_context_new’ writes the PID file after it has
> called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
> ‘process_config_first_time_only’.
>
> So not sure what happens.
>
> That’s it for today. Thoughts?
Thank you!
I'm sorry for my lack of deep analysis here, I've spent most of my
evening attempting to fix my system just to boot ^^'. I've at least
managed to collect the following verbose D-Bus log (54 MiB uncompressed)
which hopefully can shed some light onto how this failure came to be.
[dbus-daemon.log.xz (application/octet-stream, attachment)]
[Message part 9 (text/plain, inline)]
Happy Friday!
Maxim
This bug report was last modified 3 years and 152 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.