GNU bug report logs - #58320
Hurd VM fails to boot on AMD EPYC (kvm-amd)

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Wed, 5 Oct 2022 21:02:01 UTC

Severity: normal

Tags: wontfix

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Ludovic Courtès <ludo <at> gnu.org>
To: 58320 <at> debbugs.gnu.org
Cc: bug-hurd <at> gnu.org
Subject: bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)
Date: Mon, 10 Oct 2022 23:14:15 +0200
Ludovic Courtès <ludo <at> gnu.org> skribis:

> Through a dichotomy I tried to see how far it goes.  The info I have so
> far is that ld.so errors out from elf/rtld.c:563 (line 565 is not
> reached):
>
> 558:  if (bootstrap_map.l_addr || ! bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
> 559:    {
> 560:      /* Relocate ourselves so we can do normal function calls and
> 561:         data access using the global offset table.  */
> 562:
> 563:      ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0);
> 564:    }
> 565:  bootstrap_map.l_relocated = 1;
> ...
> 578:  __rtld_malloc_init_stubs ();

Via brute force¹, I found that ‘__assert_fail’ is hit, with its first
argument in $eax being:

--8<---------------cut here---------------start------------->8---
db> x/c 0x28604,80                                                                                  
                ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE\000\000map->l_in                     
                fo[VERSYMIDX (DT_VERSYM)] != NULL\000\000Fatal glibc error: Too                     
                 many audit mo                                                                      
--8<---------------cut here---------------end--------------->8---

This comes from i386/dl-machine.h:

--8<---------------cut here---------------start------------->8---
auto inline void
__attribute ((always_inline))
elf_machine_rel_relative (Elf32_Addr l_addr, const Elf32_Rel *reloc,
			  void *const reloc_addr_arg)
{
  Elf32_Addr *const reloc_addr = reloc_addr_arg;
  assert (ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE);
  *reloc_addr += l_addr;
}
--8<---------------cut here---------------end--------------->8---

How can we get there?  Looking at ‘_dl_start’, it could be that
‘elf_machine_load_address’ returns a bogus value and we end up reading
wrong ELF data?  Or it could be memory corruption somewhere.  Or…?

Thing is, it’s not fully deterministic (happens 9 times out of 10 with
KVM, never happens without KVM).

Ideas?  :-)

Ludo’.

¹ Building with ‘-fno-optimize-sibling-calls’ didn’t help get nicer
  backtraces, but that’s prolly because all that early relocation code
  is inlined.




This bug report was last modified 270 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.