GNU bug report logs - #57151
[PATCH 0/2] * Add trained data models for Tesseract OCR *

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Fri, 12 Aug 2022 05:06:02 UTC

Severity: normal

Tags: patch

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 57151 in the body.
You can then email your comments to 57151 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to guix-patches <at> gnu.org:
bug#57151; Package guix-patches. (Fri, 12 Aug 2022 05:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to guix-patches <at> gnu.org. (Fri, 12 Aug 2022 05:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: guix-patches <at> gnu.org
Cc: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: [PATCH 0/2] *** Add trained data models for Tesseract OCR  ***
Date: Fri, 12 Aug 2022 01:05:43 -0400

Hello Guix,

This makes our tesseract-ocr package usable.  Here's a small experiment
comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU:

--8<---------------cut here---------------start------------->8---
$ time ocrad -i -s 10 /tmp/dump.ppm
komput�lo _ O Tht_, _l_.__ �

real    0m9.616s
user    0m9.397s
sys     0m0.157s

$ time tesseract -l eng /tmp/dump.ppm out && cat out.txt
Estimating resolution as 133

real    0m0.389s
user    0m0.602s
sys     0m0.053s
komputilo QR @ Thu, 21:32 ©

Log In
--8<---------------cut here---------------end--------------->8---

Maxim Cournoyer (2):
  gnu: Add tesseract-ocr-tessdata-fast.
  gnu: tesseract-ocr: Make the default install minimally useful.

 gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 57 insertions(+), 3 deletions(-)

-- 
2.36.1

Information forwarded to guix-patches <at> gnu.org:
bug#57151; Package guix-patches. (Fri, 12 Aug 2022 05:09:02 GMT) Full text and rfc822 format available.

Message #8 received at 57151 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: 57151 <at> debbugs.gnu.org
Cc: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 01:07:51 -0400

* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
---
 gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e28bd17668..e2c9f561cc 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -29,6 +29,7 @@ (define-module (gnu packages ocr)
   #:use-module (guix gexp)
   #:use-module (guix git-download)
   #:use-module (guix build-system cmake)
+  #:use-module (guix build-system copy)
   #:use-module (guix build-system gnu)
   #:use-module (guix build-system python)
   #:use-module (gnu packages)
@@ -74,6 +75,32 @@ (define-public ocrad
 it produces text in 8-bit or UTF-8 formats.")
     (license license:gpl3+)))
 
+(define-public tesseract-ocr-tessdata-fast
+  (package
+    (name "tesseract-ocr-tessdata-fast")
+    (version "4.1.0")
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                    (url "https://github.com/tesseract-ocr/tessdata_fast")
+                    (commit version)))
+              (file-name (git-file-name name version))
+              (sha256
+               (base32
+                "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+    (build-system copy-build-system)
+    (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+                     #:phases #~(modify-phases %standard-phases
+                                  (add-after 'unpack 'delete-broken-links
+                                    (lambda _
+                                      (delete-file "configs")
+                                      (delete-file "pdf.ttf"))))))
+    (home-page "https://github.com/tesseract-ocr/tessdata_fast")
+    (synopsis "Fast integer versions of trained LSTM models")
+    (description "This repository contains fast integer versions of trained
+models for the Tesseract OCR Engine.")
+    (license license:asl2.0)))
+
 (define-public tesseract-ocr
   (package
     (name "tesseract-ocr")
-- 
2.36.1

Information forwarded to guix-patches <at> gnu.org:
bug#57151; Package guix-patches. (Fri, 12 Aug 2022 05:09:02 GMT) Full text and rfc822 format available.

Message #11 received at 57151 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: 57151 <at> debbugs.gnu.org
Cc: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: [PATCH 2/2] gnu: tesseract-ocr: Make the default install minimally
 useful.
Date: Fri, 12 Aug 2022 01:07:52 -0400

* gnu/packages/ocr.scm (tesseract-ocr)
[phases]{adjust-TESSDATA_PREFIX-macro}: New phase.
{install-minimal-tessdata}: New phase.
[native-inputs]: Add tesseract-ocr-tessdata-fast.
[search-paths]: New field.
[description]: Mention how to add support for more languages.
---
 gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e2c9f561cc..21d257ef24 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -132,6 +132,15 @@ (define-public tesseract-ocr
               (substitute* "configure.ac"
                 (("AC_SUBST\\(\\[XML_CATALOG_FILES])")
                  ""))))
+          (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
+            (lambda _
+              ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
+              ;; specific search-path than '/share' can be specified.  The
+              ;; build system uses CPPFLAGS for itself, so we can't simply set
+              ;; a make flag.
+              (substitute* "Makefile.am"
+                (("-DTESSDATA_PREFIX='\"@datadir@\"'")
+                 "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
           (add-after 'build 'build-training
             (lambda* (#:key parallel-build? #:allow-other-keys)
               (define n (if parallel-build? (number->string
@@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string
               (invoke "make" "-j" n "training")))
           (add-after 'install 'install-training
             (lambda _
-              (invoke "make" "training-install"))))))
+              (invoke "make" "training-install")))
+          (add-after 'install 'install-minimal-tessdata
+            ;; tesseract-ocr cannot be used without its trained models data;
+            ;; install the English language as a minimal base which can be
+            ;; extended via TESSDATA_PREFIX.
+            (lambda* (#:key native-inputs inputs #:allow-other-keys)
+              (define eng.traineddata
+                "/share/tesseract-ocr/tessdata/eng.traineddata")
+              (install-file (search-input-file (or native-inputs inputs)
+                                               eng.traineddata)
+                            (dirname (string-append #$output
+                                                    eng.traineddata))))))))
     (native-inputs
      (list asciidoc
            autoconf
@@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string
            libtool
            libxml2                      ;for XML_CATALOG_FILES
            libxslt
-           pkg-config))
+           pkg-config
+           tesseract-ocr-tessdata-fast))
     (inputs
      (list cairo
            icu4c
            leptonica
            pango
            python-wrapper))
+    (native-search-paths (list (search-path-specification
+                                (variable "TESSDATA_PREFIX")
+                                (files (list "share/tesseract-ocr/tessdata"))
+                                (separator #f)))) ;single value
     (home-page "https://github.com/tesseract-ocr/tesseract")
     (synopsis "Optical character recognition engine")
     (description
@@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string
 high accuracy.  It supports many languages, output text formatting, hOCR
 positional information and page layout analysis.  Several image formats are
 supported through the Leptonica library.  It can also detect whether text is
-monospaced or proportional.")
+monospaced or proportional.  Support for the English language is included by
+default.  To add support for more languages, the
+@code{tesseract-ocr-tessdata-fast} package should be installed.")
     (license license:asl2.0)))
 
 (define-public gimagereader
-- 
2.36.1

Information forwarded to guix-patches <at> gnu.org:
bug#57151; Package guix-patches. (Fri, 12 Aug 2022 11:28:01 GMT) Full text and rfc822 format available.

Message #14 received at 57151 <at> debbugs.gnu.org (full text, mbox):

From: Simon South <simon <at> simonsouth.net>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 57151 <at> debbugs.gnu.org
Subject: Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 07:27:35 -0400

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.

Maxim,

Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance?  The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.

This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.

(Thanks for working on this; it's been on my to-do list for a while as
well.)

-- 
Simon South
simon <at> simonsouth.net

Information forwarded to guix-patches <at> gnu.org:
bug#57151; Package guix-patches. (Fri, 12 Aug 2022 12:53:02 GMT) Full text and rfc822 format available.

Message #17 received at 57151 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Simon South <simon <at> simonsouth.net>
Cc: 57151 <at> debbugs.gnu.org
Subject: Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 08:52:25 -0400

Hi Simon,

Simon South <simon <at> simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance?  The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.

That's a good idea!  I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts.  Which means
the individual variants could be added in at a later time by those
interested, eh :-).

A procedure returning a language-specific package variant would make
sense for that.

Thanks,

Maxim

Reply sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
You have taken responsibility. (Fri, 12 Aug 2022 20:09:02 GMT) Full text and rfc822 format available.

Notification sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
bug acknowledged by developer. (Fri, 12 Aug 2022 20:09:02 GMT) Full text and rfc822 format available.

Message #22 received at 57151-done <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Simon South <simon <at> simonsouth.net>, 57151-done <at> debbugs.gnu.org
Subject: Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 16:08:12 -0400

Hi Simon,

Simon South <simon <at> simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>> Which means the individual variants could be added in at a later time
>> by those interested, eh :-).
>
> Subtext noted.
>
> One last thing, in case you weren't already aware: Issue 47536 was
> opened a while ago regarding the missing tessdata package, so you may
> want to link it to your own issue 57151 and/or close it once your
> changes are committed:
>
> https://issues.guix.gnu.org/47536

Thanks for pointing that to me.  Pushed as ff0600c5ef.  I'll now close
the issue linked above.

Thanks!

Closing.

Maxim

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 10 Sep 2022 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 341 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #57151 [PATCH 0/2] *** Add trained data models for Tesseract OCR ***

GNU bug report logs - #57151
[PATCH 0/2] * Add trained data models for Tesseract OCR *