GNU bug report logs -
#57151
[PATCH 0/2] *** Add trained data models for Tesseract OCR ***
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 57151 in the body.
You can then email your comments to 57151 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
guix-patches <at> gnu.org
:
bug#57151
; Package
guix-patches
.
(Fri, 12 Aug 2022 05:06:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
guix-patches <at> gnu.org
.
(Fri, 12 Aug 2022 05:06:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello Guix,
This makes our tesseract-ocr package usable. Here's a small experiment
comparing GNU Ocrad vs Tesseract on a LightDM login screendump from QEMU:
--8<---------------cut here---------------start------------->8---
$ time ocrad -i -s 10 /tmp/dump.ppm
komput�lo _ O Tht_, _l_.__ �
real 0m9.616s
user 0m9.397s
sys 0m0.157s
$ time tesseract -l eng /tmp/dump.ppm out && cat out.txt
Estimating resolution as 133
real 0m0.389s
user 0m0.602s
sys 0m0.053s
komputilo QR @ Thu, 21:32 ©
Log In
--8<---------------cut here---------------end--------------->8---
Maxim Cournoyer (2):
gnu: Add tesseract-ocr-tessdata-fast.
gnu: tesseract-ocr: Make the default install minimally useful.
gnu/packages/ocr.scm | 60 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 57 insertions(+), 3 deletions(-)
--
2.36.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#57151
; Package
guix-patches
.
(Fri, 12 Aug 2022 05:09:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 57151 <at> debbugs.gnu.org (full text, mbox):
* gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
---
gnu/packages/ocr.scm | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e28bd17668..e2c9f561cc 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -29,6 +29,7 @@ (define-module (gnu packages ocr)
#:use-module (guix gexp)
#:use-module (guix git-download)
#:use-module (guix build-system cmake)
+ #:use-module (guix build-system copy)
#:use-module (guix build-system gnu)
#:use-module (guix build-system python)
#:use-module (gnu packages)
@@ -74,6 +75,32 @@ (define-public ocrad
it produces text in 8-bit or UTF-8 formats.")
(license license:gpl3+)))
+(define-public tesseract-ocr-tessdata-fast
+ (package
+ (name "tesseract-ocr-tessdata-fast")
+ (version "4.1.0")
+ (source (origin
+ (method git-fetch)
+ (uri (git-reference
+ (url "https://github.com/tesseract-ocr/tessdata_fast")
+ (commit version)))
+ (file-name (git-file-name name version))
+ (sha256
+ (base32
+ "1m310cpb87xx8l8q7jy9fvzf6a0m8rm0dmjpbiwhc2mi6w4gn084"))))
+ (build-system copy-build-system)
+ (arguments (list #:install-plan #~'(("." "share/tesseract-ocr/tessdata"))
+ #:phases #~(modify-phases %standard-phases
+ (add-after 'unpack 'delete-broken-links
+ (lambda _
+ (delete-file "configs")
+ (delete-file "pdf.ttf"))))))
+ (home-page "https://github.com/tesseract-ocr/tessdata_fast")
+ (synopsis "Fast integer versions of trained LSTM models")
+ (description "This repository contains fast integer versions of trained
+models for the Tesseract OCR Engine.")
+ (license license:asl2.0)))
+
(define-public tesseract-ocr
(package
(name "tesseract-ocr")
--
2.36.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#57151
; Package
guix-patches
.
(Fri, 12 Aug 2022 05:09:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 57151 <at> debbugs.gnu.org (full text, mbox):
* gnu/packages/ocr.scm (tesseract-ocr)
[phases]{adjust-TESSDATA_PREFIX-macro}: New phase.
{install-minimal-tessdata}: New phase.
[native-inputs]: Add tesseract-ocr-tessdata-fast.
[search-paths]: New field.
[description]: Mention how to add support for more languages.
---
gnu/packages/ocr.scm | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/gnu/packages/ocr.scm b/gnu/packages/ocr.scm
index e2c9f561cc..21d257ef24 100644
--- a/gnu/packages/ocr.scm
+++ b/gnu/packages/ocr.scm
@@ -132,6 +132,15 @@ (define-public tesseract-ocr
(substitute* "configure.ac"
(("AC_SUBST\\(\\[XML_CATALOG_FILES])")
""))))
+ (add-after 'unpack 'adjust-TESSDATA_PREFIX-macro
+ (lambda _
+ ;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
+ ;; specific search-path than '/share' can be specified. The
+ ;; build system uses CPPFLAGS for itself, so we can't simply set
+ ;; a make flag.
+ (substitute* "Makefile.am"
+ (("-DTESSDATA_PREFIX='\"@datadir@\"'")
+ "-DTESSDATA_PREFIX='\"@datadir@/tesseract-ocr\"'"))))
(add-after 'build 'build-training
(lambda* (#:key parallel-build? #:allow-other-keys)
(define n (if parallel-build? (number->string
@@ -140,7 +149,18 @@ (define n (if parallel-build? (number->string
(invoke "make" "-j" n "training")))
(add-after 'install 'install-training
(lambda _
- (invoke "make" "training-install"))))))
+ (invoke "make" "training-install")))
+ (add-after 'install 'install-minimal-tessdata
+ ;; tesseract-ocr cannot be used without its trained models data;
+ ;; install the English language as a minimal base which can be
+ ;; extended via TESSDATA_PREFIX.
+ (lambda* (#:key native-inputs inputs #:allow-other-keys)
+ (define eng.traineddata
+ "/share/tesseract-ocr/tessdata/eng.traineddata")
+ (install-file (search-input-file (or native-inputs inputs)
+ eng.traineddata)
+ (dirname (string-append #$output
+ eng.traineddata))))))))
(native-inputs
(list asciidoc
autoconf
@@ -152,13 +172,18 @@ (define n (if parallel-build? (number->string
libtool
libxml2 ;for XML_CATALOG_FILES
libxslt
- pkg-config))
+ pkg-config
+ tesseract-ocr-tessdata-fast))
(inputs
(list cairo
icu4c
leptonica
pango
python-wrapper))
+ (native-search-paths (list (search-path-specification
+ (variable "TESSDATA_PREFIX")
+ (files (list "share/tesseract-ocr/tessdata"))
+ (separator #f)))) ;single value
(home-page "https://github.com/tesseract-ocr/tesseract")
(synopsis "Optical character recognition engine")
(description
@@ -166,7 +191,9 @@ (define n (if parallel-build? (number->string
high accuracy. It supports many languages, output text formatting, hOCR
positional information and page layout analysis. Several image formats are
supported through the Leptonica library. It can also detect whether text is
-monospaced or proportional.")
+monospaced or proportional. Support for the English language is included by
+default. To add support for more languages, the
+@code{tesseract-ocr-tessdata-fast} package should be installed.")
(license license:asl2.0)))
(define-public gimagereader
--
2.36.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#57151
; Package
guix-patches
.
(Fri, 12 Aug 2022 11:28:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 57151 <at> debbugs.gnu.org (full text, mbox):
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
Maxim,
Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance? The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.
This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.
(Thanks for working on this; it's been on my to-do list for a while as
well.)
--
Simon South
simon <at> simonsouth.net
Information forwarded
to
guix-patches <at> gnu.org
:
bug#57151
; Package
guix-patches
.
(Fri, 12 Aug 2022 12:53:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 57151 <at> debbugs.gnu.org (full text, mbox):
Hi Simon,
Simon South <simon <at> simonsouth.net> writes:
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance? The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.
That's a good idea! I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts. Which means
the individual variants could be added in at a later time by those
interested, eh :-).
A procedure returning a language-specific package variant would make
sense for that.
Thanks,
Maxim
Reply sent
to
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:
You have taken responsibility.
(Fri, 12 Aug 2022 20:09:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:
bug acknowledged by developer.
(Fri, 12 Aug 2022 20:09:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 57151-done <at> debbugs.gnu.org (full text, mbox):
Hi Simon,
Simon South <simon <at> simonsouth.net> writes:
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>> Which means the individual variants could be added in at a later time
>> by those interested, eh :-).
>
> Subtext noted.
>
> One last thing, in case you weren't already aware: Issue 47536 was
> opened a while ago regarding the missing tessdata package, so you may
> want to link it to your own issue 57151 and/or close it once your
> changes are committed:
>
> https://issues.guix.gnu.org/47536
Thanks for pointing that to me. Pushed as ff0600c5ef. I'll now close
the issue linked above.
Thanks!
Closing.
Maxim
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 10 Sep 2022 11:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 2 years and 341 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.