GNU bug report logs - #61851
[PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.

Previous Next

Package: guix-patches;

Reported by: jlicht <at> fsfe.org

Date: Mon, 27 Feb 2023 20:56:02 UTC

Severity: normal

Tags: patch

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Simon South <simon <at> simonsouth.net>
To: jlicht <at> fsfe.org
Cc: 61851 <at> debbugs.gnu.org
Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Date: Mon, 27 Feb 2023 17:43:43 -0500
Jelle,

Respectfully, and speaking only as an interested observer, I think this
may not be the right fix.

Guix's Tesseract is indeed missing its config files, causing (among
other things) the examples in the online documentation[0] to not work,
e.g.:

  ssouth <at> hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  read_params_file: Can't open hocr
  The (quick) [brown] {fox} jumps!
  Over the $43,456.78 <lazy> #90 dog
  (...)

But the root issue appears to be a misconfiguration of the
TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
Tesseract's own config files to be installed in a folder other than the
one it's configured to search.

Fixing this places Tesseract's config files and the trained-data files
together beneath /usr/share/tessdata, allowing Tesseract to work as
expected:

  ssouth <at> hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  (...)

This approach has the advantage of keeping the
tesseract-ocr-tessdata-fast package "pure" and focused only on
trained-data files, which will be important for the patch I'm working on
that will split it into multiple packages, one for each language and
script, to allow greater flexibility.

I'll respond to this email with a draft (!) patch to tesseract-ocr that
should achieve the same result as yours, making the config files
available for use.  Does this also fix the problem for you?  If so,
would you consider submitting this change instead?

-- 
Simon South
simon <at> simonsouth.net

[0] https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html




This bug report was last modified 2 years and 66 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.