GNU bug report logs -
#61851
[PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Previous Next
Reported by: jlicht <at> fsfe.org
Date: Mon, 27 Feb 2023 20:56:02 UTC
Severity: normal
Tags: patch
Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Hi Simon,
Simon South <simon <at> simonsouth.net> writes:
> Jelle,
>
> Respectfully, and speaking only as an interested observer, I think this
> may not be the right fix.
Cunningham's law strikes again :) [1].
>
> Guix's Tesseract is indeed missing its config files, causing (among
> other things) the examples in the online documentation[0] to not work,
> e.g.:
>
> ssouth <at> hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
> read_params_file: Can't open hocr
> The (quick) [brown] {fox} jumps!
> Over the $43,456.78 <lazy> #90 dog
> (...)
>
> But the root issue appears to be a misconfiguration of the
> TESSDATA_PREFIX search path in the tessdata-ocr package, which causes
> Tesseract's own config files to be installed in a folder other than the
> one it's configured to search.
>
> Fixing this places Tesseract's config files and the trained-data files
> together beneath /usr/share/tessdata, allowing Tesseract to work as
> expected:
>
> ssouth <at> hamlet ~/tesseract-ocr-test [env]$ tesseract images/eurotext.png - -l eng hocr
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> (...)
I will believe you without any doubt, but there's this spooky comment
left in the tesseract-ocr 'adjust-TESSDATA_PREFIX-macro phase:
--8<---------------cut here---------------start------------->8---
;; Use a deeper TESSDATA_PREFIX hierarchy so that a more
;; specific search-path than '/share' can be specified. The
;; build system uses CPPFLAGS for itself, so we can't simply set
;; a make flag.
--8<---------------cut here---------------end--------------->8---
This makes me believe the current situation was a deliberate choice, but
I personally don't understand what the original problem was/is.
> This approach has the advantage of keeping the
> tesseract-ocr-tessdata-fast package "pure" and focused only on
> trained-data files, which will be important for the patch I'm working on
> that will split it into multiple packages, one for each language and
> script, to allow greater flexibility.
>
> I'll respond to this email with a draft (!) patch to tesseract-ocr that
> should achieve the same result as yours, making the config files
> available for use. Does this also fix the problem for you? If so,
> would you consider submitting this change instead?
It seems to work for my stuff! I'm bringing Maxim to weigh in on this, as
they are the (un?)lucky expert according to my git-foo.
Thanks for paying attention!
- Jelle
[1] https://meta.wikimedia.org/wiki/Cunningham%27s_Law
This bug report was last modified 2 years and 66 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.