From unknown Mon Aug 18 02:36:12 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 05:06:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 57151@debbugs.gnu.org Cc: Maxim Cournoyer X-Debbugs-Original-To: guix-patches@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.166028075422056 (code B ref -1); Fri, 12 Aug 2022 05:06:02 +0000 Received: (at submit) by debbugs.gnu.org; 12 Aug 2022 05:05:54 +0000 Received: from localhost ([127.0.0.1]:55352 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMrq-0005jg-69 for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:05:54 -0400 Received: from lists.gnu.org ([209.51.188.17]:41954) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMro-0005jY-U0 for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:05:53 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55656) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oMMro-00016r-M7 for guix-patches@gnu.org; Fri, 12 Aug 2022 01:05:52 -0400 Received: from mail-qt1-x82c.google.com ([2607:f8b0:4864:20::82c]:45898) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oMMrn-0000HS-8x for guix-patches@gnu.org; Fri, 12 Aug 2022 01:05:52 -0400 Received: by mail-qt1-x82c.google.com with SMTP id j17so37909qtp.12 for ; Thu, 11 Aug 2022 22:05:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc; bh=Wx0+DhIwF7F0pWORcgfhRIEjVeUCa5W2ckCcGtyzRNg=; b=LHoSbOreldsVOgjpu19nyvzINVtfMWux5iqCiCm7LaG3zTd4zvch47GR9nsrOZbERF 8cnlRjUv9QbzbN+waXCehA/b6vuMvmiWKsjHz3kOksK6Hw+VHvnE/5ysSXGYA0bYwrtg vnZnLW+BvdJAN1BGVr0MNZd4lrGp8FkHBXOV/rzxjs1PGJTkRffd0tnJswebZRhvGTBM 0Vu6LpQS3qLngwsourJS7EeGYFK7cFCR71U5th4HoMdRZk7eY3deAOKrTOJHIgW2+9HD uZVyQPmV7R/0LqpQzSCyrniL2QiOO2IXbm69Rx0vzQJ5YZFYRnclPgOnjiYUGzZXdD4d ZfBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc; bh=Wx0+DhIwF7F0pWORcgfhRIEjVeUCa5W2ckCcGtyzRNg=; b=fg7vTzIEssBtdXSwYzpNoLWRj2dPwNZXuOus+8Igch4kbF2rNsrkD0kin2PHnl/jdV 97vRk6IsdLEw3My0A3qcTZTuyzVXB+700eSEgAvnbGGXT+6zoHCunsIMx93XZaHQGuP3 /jCKfppDY3XMR2J3iDtEx26jaeFRMrVR6SWfpRZINK7P33ne2K1F0HWpd9vrxsoK1N82 lUpMw+XY3CsZKE4cYormexP7Zg42gopgQBOw2XZa9057B8DCqs9goCac6zWlWCXeW0B3 0z+pAP8rK31ZANe0XLm4aGNp7HyEXd/3sR20VaJ+qDE1kncY/iU3DGvYTFrxKJ/7BbfS x1MA== X-Gm-Message-State: ACgBeo3t23pNRwdoUlaVHaVIE6xmQ4lz6ouc/G0R6Qfn6ReHX/bho2p9 laHWPIaMcvxUlj4o7TVGFVIWO0ZBDrY= X-Google-Smtp-Source: AA6agR5U07sOpVVi68ZBb9UofBo/VzngWxSSh60WE0iC4mUnhbyARG0ki0TBW9gvB/HAEnGbWRRquQ== X-Received: by 2002:a05:622a:53:b0:31f:1fb6:8d3a with SMTP id y19-20020a05622a005300b0031f1fb68d3amr2200002qtw.386.1660280749699; Thu, 11 Aug 2022 22:05:49 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l17-20020a05620a28d100b006b998b5191esm956039qkp.87.2022.08.11.22.05.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:05:49 -0700 (PDT) From: Maxim Cournoyer Date: Fri, 12 Aug 2022 01:05:43 -0400 Message-Id: <20220812050543.3923-1-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::82c; envelope-from=maxim.cournoyer@gmail.com; helo=mail-qt1-x82c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello Guix, This makes our tesseract-ocr package usable. Here's a small experiment comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU: --8<---------------cut here---------------start------------->8--- $ time ocrad -i -s 10 /tmp/dump.ppm komput�lo _ O Tht_, _l_.__ � real 0m9.616s user 0m9.397s sys 0m0.157s $ time tesseract -l eng /tmp/dump.ppm out && cat out.txt Estimating resolution as 133 real 0m0.389s user 0m0.602s sys 0m0.053s komputilo QR @ Thu, 21:32 © Log In --8<---------------cut here---------------end--------------->8--- Maxim Cournoyer (2): gnu: Add tesseract-ocr-tessdata-fast. gnu: tesseract-ocr: Make the default install minimally useful. gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 57 insertions(+), 3 deletions(-) -- 2.36.1 From unknown Mon Aug 18 02:36:12 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast. References: <20220812050543.3923-1-maxim.cournoyer@gmail.com> In-Reply-To: <20220812050543.3923-1-maxim.cournoyer@gmail.com> Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 05:09:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 57151@debbugs.gnu.org Cc: Maxim Cournoyer Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166028088722702 (code B ref 57151); Fri, 12 Aug 2022 05:09:02 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 05:08:07 +0000 Received: from localhost ([127.0.0.1]:55357 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMty-0005u6-LM for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:06 -0400 Received: from mail-qk1-f180.google.com ([209.85.222.180]:35762) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMtw-0005ta-Fu for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:04 -0400 Received: by mail-qk1-f180.google.com with SMTP id u24so60048qku.2 for <57151@debbugs.gnu.org>; Thu, 11 Aug 2022 22:08:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc; bh=GbJGnHBh9MJDsGA+rg/3OJ2Iiu4gy6Tx9D9DOTXOvCc=; b=pyygfnLA78LAU9/fBt4zhcRYADBCxNPAtVh+Hzdo6McECoxPzoWCJ2aB1vRW35L1Tn wy6TwfcVZysNnXFsHWZVUVZJ/1Qmhzc4+kPA73nnYsaD2TVH3REa9gs2xz5yNzGMBs0f 7ZU9RrkErHNBVlz0wXo5hf6i/CINbTMqXgOoIYLNbSLO5i3q9xS1y08JlS+H7cjlDwdf 6M1bF/p3ZJgzhw+ZPGovqTJCV08JRQ766NZHPQ6dLVgk7Cg93BfN5Xn2M5DSWgMTUXzb rOkbzlaYBxBxGoVyi0gGsbXjGw8tTGWvrVmLrOTlFbaORcSI4KyQ2MRCy9KG2sk3WLo3 uvOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc; bh=GbJGnHBh9MJDsGA+rg/3OJ2Iiu4gy6Tx9D9DOTXOvCc=; b=Lp1Vt08fk7ooaAWLVDcGjGTb1pKHi1xyYQxA/4uj3tzWNmhcY1R3OuL+WVuZWMwtv2 NFn+KepwCFB5XTzWFCrN2dp9oafes7vDmSh8ez9Pw4nsKvf+JpKLdK4powFZQiGXMiNZ MuMzIUrH6rojhDM1wlagBaniZzOMwIZ+BDLyHDsQ5Wb71hQm0j1fArX0onnFE8XhbsPr kXeCxJO2L39461JukFxCxapVzWdwZt+oEQSWMrzzK+0gqcgbFojie83+WHa2Akje5wz4 GZgsRI2tYOtQFxuvqJA5XZLfqmfQS5idTD7aPrHXjO7GLyRMkQnySGAdXOC0QLnOVg4x uzvA== X-Gm-Message-State: ACgBeo0iI4dy2MDTAtZDqpadF2GOYMr9rOIIus3D59oUuXZBgxRO8NQr eJo9yKRkTz2R/RPBXB4PI9XST9xPTzA= X-Google-Smtp-Source: AA6agR49JDyGfnG0zST1UIR7mU8oLN9/MoqWR1naSGXveNZatBw//xS5GrenSM4iyGDpj7t3CoR6iA== X-Received: by 2002:a05:620a:bc9:b0:6b6:66b2:d417 with SMTP id s9-20020a05620a0bc900b006b666b2d417mr1683544qki.3.1660280877539; Thu, 11 Aug 2022 22:07:57 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l18-20020a37f912000000b006b5fe1c376fsm938253qkj.131.2022.08.11.22.07.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:07:57 -0700 (PDT) From: Maxim Cournoyer Date: Fri, 12 Aug 2022 01:07:51 -0400 Message-Id: <20220812050752.3980-1-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable. --- gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm index e28bd17668..e2c9f561cc 100644 --- a/gnu/packages/ocr.scm +++ b/gnu/packages/ocr.scm @@ -29,6 +29,7 @@ (define-module (gnu packages ocr) #:use-module (guix gexp) #:use-module (guix git-download) #:use-module (guix build-system cmake) + #:use-module (guix build-system copy) #:use-module (guix build-system gnu) #:use-module (guix build-system python) #:use-module (gnu packages) @@ -74,6 +75,32 @@ (define-public ocrad it produces text in 8-bit or UTF-8 formats.") (license license:gpl3+))) +(define-public tesseract-ocr-tessdata-fast + (package + (name "tesseract-ocr-tessdata-fast") + (version "4.1.0") + (source (origin + (method git-fetch) + (uri (git-reference + (url "https://github.com/tesseract-ocr/tessdata_fast") + (commit version))) + (file-name (git-file-name name version)) + (sha256 + (base32 + "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084")))) + (build-system copy-build-system) + (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata")) + #:phases #~(modify-phases %standard-phases + (add-after 'unpack 'delete-broken-links + (lambda _ + (delete-file "configs") + (delete-file "pdf.ttf")))))) + (home-page "https://github.com/tesseract-ocr/tessdata_fast") + (synopsis "Fast integer versions of trained LSTM models") + (description "This repository contains fast integer versions of trained +models for the Tesseract OCR Engine.") + (license license:asl2.0))) + (define-public tesseract-ocr (package (name "tesseract-ocr") -- 2.36.1 From unknown Mon Aug 18 02:36:12 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally useful. Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 05:09:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 57151@debbugs.gnu.org Cc: Maxim Cournoyer Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166028089222719 (code B ref 57151); Fri, 12 Aug 2022 05:09:02 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 05:08:12 +0000 Received: from localhost ([127.0.0.1]:55360 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMu4-0005uM-3M for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:12 -0400 Received: from mail-qt1-f181.google.com ([209.85.160.181]:34514) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMty-0005td-RN for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 01:08:11 -0400 Received: by mail-qt1-f181.google.com with SMTP id e28so61180qts.1 for <57151@debbugs.gnu.org>; Thu, 11 Aug 2022 22:08:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=HDg+cFq/ihlwS9SixKXXfP0sbLRkkHbidgtddAS5G+0=; b=UrT1wj9K2bso9tB9cvdLkgmqIfswm0GSwszL5VqhLG9Txd2byRp3uebN9DnqZb4cVt MGKchH+1xvCc3t0iRCOgtsg8sR+fUJYB3Y0ahGUiMfpibewVZMbsymDOkh3hOn4arH64 S9mFhfqgOKLokY+PBSF+l1L6Fpz4WDSP7smbykZlC6uwaH9AN+p72tBFbWZDpML4Bu9A cjlCjBi5id0DdMC7oiX2WGPwKS1VEbyWLuyGHheVjeAvEV3GWtr5b/Lzq2bTbNJXi36f nllgwalSACF7TELdEWVhEtQJRT8pj0C83/nkuvGVWKOrw17hJies+Oj8/Dt6brLAeFiQ 2zhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=HDg+cFq/ihlwS9SixKXXfP0sbLRkkHbidgtddAS5G+0=; b=2eTHljZrvMI7vTSsFRmezO9CO+JTjjYMD71nY5jZu1z0aWb9BKG8T3tGqxBb0T/wRr 6eA9L/2ZXr+Xc+nCAHG6e1NFG0eunWGgVZaEnKGrJ97z+xkGcVaTdDEj/NMpJZGJhj9c U4WjsYubZkvvCpMDQh0kcMgoxUq7rI32Rfh7pXxuijACX95qpZtN+r7aUzdihhlR2DOz AmoDkaHL6ylVE7qoQBphxd9QaEbpEzFp2nyNdn3sRGn00NViK6tuvFpwS1kogqp7i91O B7HfDULXuiHT4SpdPMTusdELGsuK3wY6NUw8Mpz3JWCSiXnccY/xVYTLrjp7Tk+vQU2b m2+Q== X-Gm-Message-State: ACgBeo1mChn3d2niRTUqZTz6BaYlZCctdCK5ak0Iz58n2Cx3APGZBoPy 3tofcNPen3Hg/rwa90WpaLAA3zyJEnU= X-Google-Smtp-Source: AA6agR6ykEQR0AKc58jhPhnjF9BRYIm9z/GLXZHVYL/Rq9HbaG7QyKeFOLTJy8u6AgxbAmD5e/gIWA== X-Received: by 2002:ac8:5f12:0:b0:343:6510:ed6f with SMTP id x18-20020ac85f12000000b003436510ed6fmr2195974qta.342.1660280881225; Thu, 11 Aug 2022 22:08:01 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l18-20020a37f912000000b006b5fe1c376fsm938253qkj.131.2022.08.11.22.08.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:08:00 -0700 (PDT) From: Maxim Cournoyer Date: Fri, 12 Aug 2022 01:07:52 -0400 Message-Id: <20220812050752.3980-2-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220812050752.3980-1-maxim.cournoyer@gmail.com> References: <20220812050752.3980-1-maxim.cournoyer@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) * gnu/packages/ocr.scm (tesseract-ocr) [phases]{adjust-TESSDATA_PREFIX-macro}: New phase. {install-minimal-tessdata}: New phase. [native-inputs]: Add tesseract-ocr-tessdata-fast. [search-paths]: New field. [description]: Mention how to add support for more languages. --- gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm index e2c9f561cc..21d257ef24 100644 --- a/gnu/packages/ocr.scm +++ b/gnu/packages/ocr.scm @@ -132,6 +132,15 @@ (define-public tesseract-ocr (substitute* "configure.ac" (("AC_SUBST\\(\\[XML_CATALOG_FILES])") "")))) + (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro + (lambda _ + ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more + ;; specific search-path than '/share' can be specified. The + ;; build system uses CPPFLAGS for itself, so we can't simply set + ;; a make flag. + (substitute* "Makefile.am" + (("-DTESSDATA_PREFIX='\"@datadir@\"'") + "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'")))) (add-after 'build 'build-training (lambda* (#:key parallel-build? #:allow-other-keys) (define n (if parallel-build? (number->string @@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string (invoke "make" "-j" n "training"))) (add-after 'install 'install-training (lambda _ - (invoke "make" "training-install")))))) + (invoke "make" "training-install"))) + (add-after 'install 'install-minimal-tessdata + ;; tesseract-ocr cannot be used without its trained models data; + ;; install the English language as a minimal base which can be + ;; extended via TESSDATA_PREFIX. + (lambda* (#:key native-inputs inputs #:allow-other-keys) + (define eng.traineddata + "/share/tesseract-ocr/tessdata/eng.traineddata") + (install-file (search-input-file (or native-inputs inputs) + eng.traineddata) + (dirname (string-append #$output + eng.traineddata)))))))) (native-inputs (list asciidoc autoconf @@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string libtool libxml2 ;for XML_CATALOG_FILES libxslt - pkg-config)) + pkg-config + tesseract-ocr-tessdata-fast)) (inputs (list cairo icu4c leptonica pango python-wrapper)) + (native-search-paths (list (search-path-specification + (variable "TESSDATA_PREFIX") + (files (list "share/tesseract-ocr/tessdata")) + (separator #f)))) ;single value (home-page "https://github.com/tesseract-ocr/tesseract") (synopsis "Optical character recognition engine") (description @@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string high accuracy. It supports many languages, output text formatting, hOCR positional information and page layout analysis. Several image formats are supported through the Leptonica library. It can also detect whether text is -monospaced or proportional.") +monospaced or proportional. Support for the English language is included by +default. To add support for more languages, the +@code{tesseract-ocr-tessdata-fast} package should be installed.") (license license:asl2.0))) (define-public gimagereader -- 2.36.1 From unknown Mon Aug 18 02:36:12 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast. Resent-From: Simon South Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 11:28:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: Maxim Cournoyer Cc: 57151@debbugs.gnu.org Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166030366930023 (code B ref 57151); Fri, 12 Aug 2022 11:28:01 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 11:27:49 +0000 Received: from localhost ([127.0.0.1]:55768 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMSpR-0007oB-FO for submit@debbugs.gnu.org; Fri, 12 Aug 2022 07:27:49 -0400 Received: from mailout.easymail.ca ([64.68.200.34]:59052) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMSpM-0007nr-Lf for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 07:27:47 -0400 Received: from localhost (localhost [127.0.0.1]) by mailout.easymail.ca (Postfix) with ESMTP id 274C86326E; Fri, 12 Aug 2022 11:27:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at emo07-pco.easydns.vpn Received: from mailout.easymail.ca ([127.0.0.1]) by localhost (emo07-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xysKnewthiQZ; Fri, 12 Aug 2022 11:27:37 +0000 (UTC) Received: from laptop (23-233-96-72.cpe.pppoe.ca [23.233.96.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mailout.easymail.ca (Postfix) with ESMTPSA id E115563211; Fri, 12 Aug 2022 11:27:36 +0000 (UTC) From: Simon South References: <20220812050543.3923-1-maxim.cournoyer@gmail.com> <20220812050752.3980-1-maxim.cournoyer@gmail.com> Date: Fri, 12 Aug 2022 07:27:35 -0400 In-Reply-To: <20220812050752.3980-1-maxim.cournoyer@gmail.com> (Maxim Cournoyer's message of "Fri, 12 Aug 2022 01:07:51 -0400") Message-ID: <87czd57lco.fsf@simonsouth.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Maxim Cournoyer writes: > * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable. Maxim, Would it not be better to generate a separate package for each of the languages and scripts this data covers, as is done by Debian for instance? The entire dataset is about a gigabyte in size and supports more than a hundred languages yet I imagine most people would be using only one or two. This would mean tesseract-ocr could simply propagate the "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a specific file, and would establish a convention that would be necessary for packaging the "best" dataset as well, if that's desired. (Thanks for working on this; it's been on my to-do list for a while as well.) -- Simon South simon@simonsouth.net From unknown Mon Aug 18 02:36:12 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast. Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 12 Aug 2022 12:53:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57151 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: Simon South Cc: 57151@debbugs.gnu.org Received: via spool by 57151-submit@debbugs.gnu.org id=B57151.166030875422331 (code B ref 57151); Fri, 12 Aug 2022 12:53:02 +0000 Received: (at 57151) by debbugs.gnu.org; 12 Aug 2022 12:52:34 +0000 Received: from localhost ([127.0.0.1]:55883 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMU9S-0005o7-Av for submit@debbugs.gnu.org; Fri, 12 Aug 2022 08:52:34 -0400 Received: from mail-qv1-f50.google.com ([209.85.219.50]:39657) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMU9R-0005nu-5I for 57151@debbugs.gnu.org; Fri, 12 Aug 2022 08:52:33 -0400 Received: by mail-qv1-f50.google.com with SMTP id h8so514770qvs.6 for <57151@debbugs.gnu.org>; Fri, 12 Aug 2022 05:52:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:from:to:cc; bh=k0hWN6ZvyJzokPpQLbZKyM7GXIM+SZQ8FL32aJyPBjs=; b=mlKWOAY4UeuzegGrSI2orh2SOPMryxGhqa64FiE64c1IN+ZYE6TkGkhstoZPU4XHGM 5/Pf7T2zq1S7d3Gf/SJuFbkpsZ5Bq5cHhAF3g0wtmiH1j/dU3I5UUowyjZoTYHzI5i+6 yGaaEdFpEu/igTIVpBSZjmINRlsEEGEr6kfdy2jZOR9TTgdQj/VgkI4rxRnn4KVhW8RK M2Ea1foFZKz8CXT4TEhYhVJi6RcOtvyIjmwru+ohwpFoUFBIZjWg68Ni+O05XylAxiLP 7fpu9DM+D8J2US0tJXQzWP+M1dclWLEocqdVkc9zKGswGk74CpoL8D3f+XHOEjt8Ox// QFFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:x-gm-message-state:from:to:cc; bh=k0hWN6ZvyJzokPpQLbZKyM7GXIM+SZQ8FL32aJyPBjs=; b=M9yk8T4dPPXStSF3RDgsNLq8vZoa5sBCFNBpWzchsK2FXOWBs0Yfh2+MZP1i4Aut/a k/HVjWYLE73fZc8VVE4yQ6NIzBLASc4lvUQ3m3lV2o8nV15zWuLIeCQ12xgZeHaMwItf YEjuNwX7XAVfcn05pDU6deYqlpCAJKGfV2TSZYdPwLNihWah4rXYPeP/+ggQklkNeNmN logw81VUvnYCZsdwLya4Z75pdL1IfA5XcPDc+94RFfRKgROUnirVXciSKRzOjPujBLkE c/65/sJzLzPkPage2nEEqriYrO4VHzsU3KOM4hmBQS/DOH7p6Nu7jRjza0BdgjWFVhSL Q2ZA== X-Gm-Message-State: ACgBeo0+g+ygDqMhIf5OKftjvtXJvz+HwNEh7qJs+Ej3LCameiYQdMvU Q8rOuuLPiqJk2IYER/QCa+/q+IVSXko= X-Google-Smtp-Source: AA6agR5+hhtvJefdE7zfLPtQ0CKN8ngL+nKxtEiM42tw7K2skxtkmnivsgm1nMAOFnLwWlFVG/aVRw== X-Received: by 2002:a05:6214:e66:b0:476:f6f1:404 with SMTP id jz6-20020a0562140e6600b00476f6f10404mr3233792qvb.65.1660308747339; Fri, 12 Aug 2022 05:52:27 -0700 (PDT) Received: from hurd (dsl-205-233-125-72.b2b2c.ca. [205.233.125.72]) by smtp.gmail.com with ESMTPSA id cb24-20020a05622a1f9800b0031ef6dd9700sm1592742qtb.55.2022.08.12.05.52.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Aug 2022 05:52:26 -0700 (PDT) From: Maxim Cournoyer References: <20220812050543.3923-1-maxim.cournoyer@gmail.com> <20220812050752.3980-1-maxim.cournoyer@gmail.com> <87czd57lco.fsf@simonsouth.net> Date: Fri, 12 Aug 2022 08:52:25 -0400 In-Reply-To: <87czd57lco.fsf@simonsouth.net> (Simon South's message of "Fri, 12 Aug 2022 07:27:35 -0400") Message-ID: <87k07dlj3q.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi Simon, Simon South writes: > Maxim Cournoyer writes: >> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable. > > Maxim, > > Would it not be better to generate a separate package for each of the > languages and scripts this data covers, as is done by Debian for > instance? The entire dataset is about a gigabyte in size and supports > more than a hundred languages yet I imagine most people would be using > only one or two. > > This would mean tesseract-ocr could simply propagate the > "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a > specific file, and would establish a convention that would be necessary > for packaging the "best" dataset as well, if that's desired. That's a good idea! I think we could have both, like Debian also has a 'tesseract-ocr-all' package for all the languages/scripts. Which means the individual variants could be added in at a later time by those interested, eh :-). A procedure returning a language-specific package variant would make sense for that. Thanks, Maxim From unknown Mon Aug 18 02:36:12 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Maxim Cournoyer Subject: bug#57151: closed (Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.) Message-ID: References: <877d3ddy37.fsf@gmail.com> <20220812050543.3923-1-maxim.cournoyer@gmail.com> X-Gnu-PR-Message: they-closed 57151 X-Gnu-PR-Package: guix-patches X-Gnu-PR-Keywords: patch Reply-To: 57151@debbugs.gnu.org Date: Fri, 12 Aug 2022 20:09:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1660334942-20040-1" This is a multi-part message in MIME format... ------------=_1660334942-20040-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #57151: [PATCH 0/2] *** Add trained data models for Tesseract OCR *** which was filed against the guix-patches package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 57151@debbugs.gnu.org. --=20 57151: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D57151 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1660334942-20040-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 57151-done) by debbugs.gnu.org; 12 Aug 2022 20:08:24 +0000 Received: from localhost ([127.0.0.1]:58983 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMaxD-0005C6-VT for submit@debbugs.gnu.org; Fri, 12 Aug 2022 16:08:24 -0400 Received: from mail-qk1-f174.google.com ([209.85.222.174]:37553) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMax9-0005Bs-Qv for 57151-done@debbugs.gnu.org; Fri, 12 Aug 2022 16:08:22 -0400 Received: by mail-qk1-f174.google.com with SMTP id a15so1184366qko.4 for <57151-done@debbugs.gnu.org>; Fri, 12 Aug 2022 13:08:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:to:from:from:to:cc; bh=mIsD0XV1wzkSV7DNbMtPC1HrgJe8MMcgGMieeOTfLXQ=; b=OC/bZsUwwSQPkEI/pweOHm85Vw/CBuRm0I/nmfIplB8mjrguCRDoN2D2qnTakaNQMY 3822Zu5PopfdhAwiGzbDLxsEKu6KNEEOeca5iRg8dTX52CHHT7U9zuSOdTH3ITwb/Kmo nV/zWADhhem+48cBy/wHSjICJpPdn+JPWve8mcG1MjmdO8nSEkXhnq8QcbLBvAnzRkjr kMmzXjS6p+eF+JdZYahSVdSAXgYWk6GQJMxJBIChr7y/SpMJlH1zWhl7mmPAYMTuE0Ch /ttyffeniq73LarNn1DxfK1tmShz7NWBI0b9yNS/1rnRf19XzUqUokUnBcSb+vHAJa4E XHEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:to:from:x-gm-message-state:from:to:cc; bh=mIsD0XV1wzkSV7DNbMtPC1HrgJe8MMcgGMieeOTfLXQ=; b=l3yqUCn/54nymqKpiuJk9CQsWGDkIb2MMGkK63gcYfxkXVXPaxiwAVafGmAe/9LHl2 MlhZ2NeULaCFPCciumygwuOtzQIvXjLoBDgM7shPQdvKZRydpK991uLorlQM5/MWex8g jeItCZ6wRUDIYm4Uswls8uOWIAhJDgyq7I+cgxfazbJac+bTusqstOh6PPW2Bh2M+8Xb Ggxit4J2OAaPqf9eUQ5Sk03TlXX/xQO3NX/aJhHD4aF40c/VHbUPaN8LiUtmX1JFCr48 Kgh3riUwMAdU+2KAD7ErPcYxyXsJhs4dOq1R2reUrfivXC9TmG8U2YXBmsR0E/+C7lUb jH/A== X-Gm-Message-State: ACgBeo0WZ+xQWR9f6GOBwwP9A03AH1GrfYYrV3AyYe/db0cJwxIMKtXZ 6hvUtCdkCr1qvQLDe0ESFOjNGKrwwHI= X-Google-Smtp-Source: AA6agR7HCS/Q4NxoFiFOTmAmJExPfZJSb6cxuw0OVEimvlRO51yuVc6V+87qkFXTl7A+XLrHKYcXKg== X-Received: by 2002:a05:620a:8083:b0:6ba:bc3d:bc42 with SMTP id ef3-20020a05620a808300b006babc3dbc42mr4051067qkb.662.1660334894148; Fri, 12 Aug 2022 13:08:14 -0700 (PDT) Received: from hurd (dsl-205-233-125-72.b2b2c.ca. [205.233.125.72]) by smtp.gmail.com with ESMTPSA id b16-20020ac87550000000b0031ee3449f34sm2316529qtr.86.2022.08.12.13.08.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Aug 2022 13:08:13 -0700 (PDT) From: Maxim Cournoyer To: Simon South , 57151-done@debbugs.gnu.org Subject: Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast. References: <20220812050543.3923-1-maxim.cournoyer@gmail.com> <20220812050752.3980-1-maxim.cournoyer@gmail.com> <87czd57lco.fsf@simonsouth.net> <87k07dlj3q.fsf@gmail.com> <87bksp61wn.fsf@simonsouth.net> Date: Fri, 12 Aug 2022 16:08:12 -0400 In-Reply-To: <87bksp61wn.fsf@simonsouth.net> (Simon South's message of "Fri, 12 Aug 2022 09:12:56 -0400") Message-ID: <877d3ddy37.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 57151-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi Simon, Simon South writes: > Maxim Cournoyer writes: >> Which means the individual variants could be added in at a later time >> by those interested, eh :-). > > Subtext noted. > > One last thing, in case you weren't already aware: Issue 47536 was > opened a while ago regarding the missing tessdata package, so you may > want to link it to your own issue 57151 and/or close it once your > changes are committed: > > https://issues.guix.gnu.org/47536 Thanks for pointing that to me. Pushed as ff0600c5ef. I'll now close the issue linked above. Thanks! Closing. Maxim ------------=_1660334942-20040-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 12 Aug 2022 05:05:54 +0000 Received: from localhost ([127.0.0.1]:55352 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMrq-0005jg-69 for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:05:54 -0400 Received: from lists.gnu.org ([209.51.188.17]:41954) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oMMro-0005jY-U0 for submit@debbugs.gnu.org; Fri, 12 Aug 2022 01:05:53 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55656) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oMMro-00016r-M7 for guix-patches@gnu.org; Fri, 12 Aug 2022 01:05:52 -0400 Received: from mail-qt1-x82c.google.com ([2607:f8b0:4864:20::82c]:45898) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oMMrn-0000HS-8x for guix-patches@gnu.org; Fri, 12 Aug 2022 01:05:52 -0400 Received: by mail-qt1-x82c.google.com with SMTP id j17so37909qtp.12 for ; Thu, 11 Aug 2022 22:05:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc; bh=Wx0+DhIwF7F0pWORcgfhRIEjVeUCa5W2ckCcGtyzRNg=; b=LHoSbOreldsVOgjpu19nyvzINVtfMWux5iqCiCm7LaG3zTd4zvch47GR9nsrOZbERF 8cnlRjUv9QbzbN+waXCehA/b6vuMvmiWKsjHz3kOksK6Hw+VHvnE/5ysSXGYA0bYwrtg vnZnLW+BvdJAN1BGVr0MNZd4lrGp8FkHBXOV/rzxjs1PGJTkRffd0tnJswebZRhvGTBM 0Vu6LpQS3qLngwsourJS7EeGYFK7cFCR71U5th4HoMdRZk7eY3deAOKrTOJHIgW2+9HD uZVyQPmV7R/0LqpQzSCyrniL2QiOO2IXbm69Rx0vzQJ5YZFYRnclPgOnjiYUGzZXdD4d ZfBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc; bh=Wx0+DhIwF7F0pWORcgfhRIEjVeUCa5W2ckCcGtyzRNg=; b=fg7vTzIEssBtdXSwYzpNoLWRj2dPwNZXuOus+8Igch4kbF2rNsrkD0kin2PHnl/jdV 97vRk6IsdLEw3My0A3qcTZTuyzVXB+700eSEgAvnbGGXT+6zoHCunsIMx93XZaHQGuP3 /jCKfppDY3XMR2J3iDtEx26jaeFRMrVR6SWfpRZINK7P33ne2K1F0HWpd9vrxsoK1N82 lUpMw+XY3CsZKE4cYormexP7Zg42gopgQBOw2XZa9057B8DCqs9goCac6zWlWCXeW0B3 0z+pAP8rK31ZANe0XLm4aGNp7HyEXd/3sR20VaJ+qDE1kncY/iU3DGvYTFrxKJ/7BbfS x1MA== X-Gm-Message-State: ACgBeo3t23pNRwdoUlaVHaVIE6xmQ4lz6ouc/G0R6Qfn6ReHX/bho2p9 laHWPIaMcvxUlj4o7TVGFVIWO0ZBDrY= X-Google-Smtp-Source: AA6agR5U07sOpVVi68ZBb9UofBo/VzngWxSSh60WE0iC4mUnhbyARG0ki0TBW9gvB/HAEnGbWRRquQ== X-Received: by 2002:a05:622a:53:b0:31f:1fb6:8d3a with SMTP id y19-20020a05622a005300b0031f1fb68d3amr2200002qtw.386.1660280749699; Thu, 11 Aug 2022 22:05:49 -0700 (PDT) Received: from localhost.localdomain (dsl-10-148-207.b2b2c.ca. [72.10.148.207]) by smtp.gmail.com with ESMTPSA id l17-20020a05620a28d100b006b998b5191esm956039qkp.87.2022.08.11.22.05.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Aug 2022 22:05:49 -0700 (PDT) From: Maxim Cournoyer To: guix-patches@gnu.org Subject: [PATCH 0/2] *** Add trained data models for Tesseract OCR *** Date: Fri, 12 Aug 2022 01:05:43 -0400 Message-Id: <20220812050543.3923-1-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::82c; envelope-from=maxim.cournoyer@gmail.com; helo=mail-qt1-x82c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit Cc: Maxim Cournoyer X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello Guix, This makes our tesseract-ocr package usable. Here's a small experiment comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU: --8<---------------cut here---------------start------------->8--- $ time ocrad -i -s 10 /tmp/dump.ppm komput�lo _ O Tht_, _l_.__ � real 0m9.616s user 0m9.397s sys 0m0.157s $ time tesseract -l eng /tmp/dump.ppm out && cat out.txt Estimating resolution as 133 real 0m0.389s user 0m0.602s sys 0m0.053s komputilo QR @ Thu, 21:32 © Log In --8<---------------cut here---------------end--------------->8--- Maxim Cournoyer (2): gnu: Add tesseract-ocr-tessdata-fast. gnu: tesseract-ocr: Make the default install minimally useful. gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 57 insertions(+), 3 deletions(-) -- 2.36.1 ------------=_1660334942-20040-1--