GNU bug report logs - #50672
nnpack is not reproducible

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Sun, 19 Sep 2021 09:58:01 UTC

Severity: normal

Done: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Ludovic Courtès <ludovic.courtes <at> inria.fr>
Subject: bug#50672: closed (Re: bug#50672: nnpack is not reproducible)
Date: Mon, 25 Oct 2021 12:55:01 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#50672: nnpack is not reproducible

which was filed against the guix package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 50672 <at> debbugs.gnu.org.

-- 
50672: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=50672
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: Kyle Meyer <kyle <at> kyleam.com>
Cc: 50672-done <at> debbugs.gnu.org, zimoun <zimon.toutoune <at> gmail.com>
Subject: Re: bug#50672: nnpack is not reproducible
Date: Mon, 25 Oct 2021 14:54:28 +0200
Hi!

Kyle Meyer <kyle <at> kyleam.com> skribis:

> Ludovic Courtès writes:
>
>> For the record, I tried the attached patch in an attempt to sort things
>> as discussed in the issue above, but it doesn’t have the intended
>> effect.  There must be other unsorted dictionaries elsewhere.
>
> Hmm, I don't think dictionaries are a likely culprit here because
> Python's dict implementation preserves the insertion order as of Python
> v3.6 (and that behavior is declared as part of the language spec with
> v3.7).

Ah, silly me.

> In cases where the order of the keys isn't specified (i.e. Python 3.5
> and below), I think the end result after your change is the same: it
> creates a new dictionary for sorted _input_, but things won't
> necessarily come out in the same order.

Noted, thanks for explaining.

> I'm not familiar with PeachPy, but taking a peek at name.py, the sets
> used for the values of the prenames dictionary could be the problem.
> And if that's the case, one solution would be switching those values
> from sets to dictionaries.
>
> With the change below (on top of PeachPy's 257881e), nnpack builds
> reliably for me across a couple of attempts:
>
>   $ guix-dev build --with-git-url=python-peachpy=$local --no-grafts --check nnpack
>   successfully built /gnu/store/7z4nl55gssrf9na7wsvmw1dsqgawnj2p-nnpack-0.0-1.c07e3a0.drv
>   successfully built /gnu/store/7z4nl55gssrf9na7wsvmw1dsqgawnj2p-nnpack-0.0-1.c07e3a0.drv
>   /gnu/store/4ihjil42fbk53q73gpvdakynbv9q5q09-nnpack-0.0-1.c07e3a0

Your patch does the trick, indeed.  I went ahead and pushed it as
b87fe805aa66851f17f56078cb0e94f7cc4525df.

Thank you!

Ludo’.

[Message part 3 (message/rfc822, inline)]
From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: bug-guix <at> gnu.org
Subject: python-pytorch is not reproducible
Date: Sun, 19 Sep 2021 11:57:14 +0200
[Message part 4 (text/plain, inline)]
Bad news!

--8<---------------cut here---------------start------------->8---
$ guix challenge python-pytorch
/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0 contents differ:
  no local build for '/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0'
  https://ci.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0: 0i55iwy3z4da4lhn93dnrmz775s9ga5kyfli6cmrchacacf9xfpq
  https://bordeaux.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0: 1fl2v4pd0gcw7wp5k662q0zd4lvvzsggcm5ii8b4kq4v6synhkic
  differing file:
    /lib/python3.8/site-packages/torch/lib/libtorch_cpu.so

1 store items were analyzed:
  - 0 (0.0%) were identical
  - 1 (100.0%) differed
  - 0 (0.0%) were inconclusive
$ guix describe 
Generacio 189   Aug 30 2021 12:09:27    (nuna)
  guix f91ae94
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: f91ae9425bb385b60396a544afe27933896b8fa3
--8<---------------cut here---------------end--------------->8---

The file is 165 MiB and Diffoscope (which reads the output of ‘objdump’)
takes forever on it.

However, by comparing the output of ‘strings’ on each file, we get a
hint:

[Message part 5 (text/x-patch, inline)]
diff -ubBr --show-c-function /tmp/str2 /tmp/str1
--- /tmp/str2	2021-09-19 11:14:47.806798779 +0200
+++ /tmp/str1	2021-09-19 11:14:41.962761127 +0200
@@ -1100584,472 +1100584,472 @@ compute_fast_convolution_input_gradient
 compute_grad_kernel_transform
 compute_fast_convolution_kernel_gradient.isra.0
 compute_fast_convolution_output
-nnp_fft8x8_with_offset_and_stream__avx2.__local0
-nnp_fft8x8_with_offset_and_stream__avx2.__local13
-nnp_fft8x8_with_offset_and_stream__avx2.__local18
-nnp_fft8x8_with_offset_and_stream__avx2.__local1
+nnp_fft8x8_with_offset_and_stream__avx2.__local5
 nnp_fft8x8_with_offset_and_stream__avx2.__local16
+nnp_fft8x8_with_offset_and_stream__avx2.__local6
+nnp_fft8x8_with_offset_and_stream__avx2.__local11
+nnp_fft8x8_with_offset_and_stream__avx2.__local0
 nnp_fft8x8_with_offset_and_stream__avx2.__local2
 nnp_fft8x8_with_offset_and_stream__avx2.__local7
-nnp_fft8x8_with_offset_and_stream__avx2.__local17
-nnp_fft8x8_with_offset_and_stream__avx2.__local10
-nnp_fft8x8_with_offset_and_stream__avx2.__local8
 nnp_fft8x8_with_offset_and_stream__avx2.__local15
+nnp_fft8x8_with_offset_and_stream__avx2.__local8
 nnp_fft8x8_with_offset_and_stream__avx2.__local3
-nnp_fft8x8_with_offset_and_stream__avx2.__local6
-nnp_fft8x8_with_offset_and_stream__avx2.__local14
-nnp_fft8x8_with_offset_and_stream__avx2.__local9
+nnp_fft8x8_with_offset_and_stream__avx2.__local1
 nnp_fft8x8_with_offset_and_stream__avx2.__local4
[…]
 nnp_shdotxf8__avx2.__local13
-nnp_shdotxf8__avx2.__local15
 nnp_shdotxf8__avx2.__local0
+nnp_shdotxf8__avx2.__local9
+nnp_shdotxf8__avx2.__local10
+nnp_shdotxf8__avx2.__local11
+nnp_shdotxf8__avx2.__local12
+nnp_shdotxf8__avx2.__local2
[Message part 6 (text/plain, inline)]
This appears to come from NNPACK, one of the libraries that are still
bundled.  These functions seem to be generated by Python scripts that
use PeachPy, such as NNPACK/src/x86_64-fma/2d-fourier-8x8.py:

--8<---------------cut here---------------start------------->8---
for post_operation in ["stream", "store"]:
    fft8x8_arguments = (arg_t_pointer, arg_f_pointer, arg_t_stride, arg_f_stride, arg_row_count, arg_column_count, arg_row_offset, arg_column_offset)
    with Function("nnp_fft8x8_with_offset_and_{post_operation}__avx2".format(post_operation=post_operation),
        fft8x8_arguments, target=uarch.default + isa.fma3 + isa.avx2):
[…]
--8<---------------cut here---------------end--------------->8---


The ‘__local’ bit in the name comes from PeachPy, in peachpy/name.py:

--8<---------------cut here---------------start------------->8---
            suffixed_name = "__local" + str(suffix)
            for name_object in iter(unnamed_objects):
                # Generate a non-conflicting name by appending a suffix
                while suffixed_name in self.names:
                    suffix += 1
                    suffixed_name = "__local" + str(suffix)
--8<---------------cut here---------------end--------------->8---

So the problem may be that these things get generated in parallel, and
thus numbering is non-deterministic.

NNPACK/CMakeLists.txt has this bit to generate targets to build all
that:

--8<---------------cut here---------------start------------->8---
      ADD_CUSTOM_COMMAND(
        OUTPUT ${obj}
        COMMAND "PYTHONPATH=${PEACHPY_PYTHONPATH}"
          ${PYTHON_EXECUTABLE} -m peachpy.x86_64
            -mabi=sysv -g4 -mimage-format=${PEACHPY_IMAGE_FORMAT}
            "-I${PROJECT_SOURCE_DIR}/src" "-I${PROJECT_SOURCE_DIR}/src/x86_64-fma" "-I${FP16_SOURCE_DIR}/include"
            -o ${obj} "${PROJECT_SOURCE_DIR}/${src}"
        DEPENDS ${NNPACK_BACKEND_PEACHPY_OBJS})
--8<---------------cut here---------------end--------------->8---

It might be that building just those targets sequentially would solve
the problem.

To be continued…

Ludo’.

This bug report was last modified 3 years and 302 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.