GNU bug report logs - #73266
[PATCH 0/9] Add python-spacy-curated-transformers

Previous Next

Package: guix-patches;

Reported by: Nicolas Graves <ngraves <at> ngraves.fr>

Date: Sun, 15 Sep 2024 08:16:02 UTC

Severity: normal

Tags: patch

Full log


View this message in rfc822 format

From: Nicolas Graves <ngraves <at> ngraves.fr>
To: 73266 <at> debbugs.gnu.org
Cc: ngraves <at> ngraves.fr
Subject: [bug#73266] [PATCH 8/9] gnu: Add python-curated-tokenizers.
Date: Sun, 15 Sep 2024 10:57:13 +0200
* gnu/packages/machine-learning.scm (python-curated-tokenizers): New variable.

Change-Id: I719d2ffd499c86e6bb2f9215ed979e47c0e32484
---
 gnu/packages/machine-learning.scm | 41 +++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index d1b282fea8..e80412ed41 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -2480,6 +2480,47 @@ (define-public python-cutlery
 @end itemize")
     (license license:expat)))
 
+(define-public python-curated-tokenizers
+  (package
+    (name "python-curated-tokenizers")
+    (version "0.0.9")
+    ;; This source includes third_party protobuf, but a version that
+    ;; is not currently packaged in guix (3.6 < version <= 3.19.5).
+    ;; Try using guix's protobuf when updating.
+    (source
+     (origin
+       (method url-fetch)
+       (uri (pypi-uri "curated-tokenizers" version))
+       (sha256
+        (base32 "09ffs2qjlli35wnf8wf64s14xm75vi5ynvkrn9nqllmk9bjlfgf9"))))
+    (build-system pyproject-build-system)
+    (arguments
+     (list
+      #:phases
+      #~(modify-phases %standard-phases
+          ;; For some reason when both local and installed exist,
+          ;; local is chosen and is missing shared libraries.
+          ;; Use installed version to run tests instead.
+          (add-before 'check 'pre-check
+            (lambda* (#:key tests? inputs outputs #:allow-other-keys)
+              (when tests?
+                (copy-recursively "curated_tokenizers/tests" "tests")
+                (delete-file-recursively "curated_tokenizers")
+                (add-installed-pythonpath inputs outputs)))))))
+    (propagated-inputs (list python-regex))
+    (native-inputs (list python-cython python-pytest))
+    (home-page "https://github.com/explosion/curated-tokenizers")
+    (synopsis "Lightweight piece tokenization library")
+    (description "This package provides a lightweight wordpiece and
+sentencepiece tokenization library.  It supports multiple tokenizers:
+@itemize
+@item BPE
+@item Byte BPE
+@item Unigram
+@item Wordpiece
+@end itemize")
+    (license license:expat)))
+
 (define-public python-curated-transformers
   (package
     (name "python-curated-transformers")
-- 
2.46.0





This bug report was last modified 242 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.