GNU bug report logs - #57151
[PATCH 0/2] *** Add trained data models for Tesseract OCR ***

Previous Next

Package: guix-patches;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Fri, 12 Aug 2022 05:06:02 UTC

Severity: normal

Tags: patch

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #14 received at 57151 <at> debbugs.gnu.org (full text, mbox):

From: Simon South <simon <at> simonsouth.net>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 57151 <at> debbugs.gnu.org
Subject: Re: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 07:27:35 -0400
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.

Maxim,

Would it not be better to generate a separate package for each of the
languages and scripts this data covers, as is done by Debian for
instance?  The entire dataset is about a gigabyte in size and supports
more than a hundred languages yet I imagine most people would be using
only one or two.

This would mean tesseract-ocr could simply propagate the
"tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
specific file, and would establish a convention that would be necessary
for packaging the "best" dataset as well, if that's desired.

(Thanks for working on this; it's been on my to-do list for a while as
well.)

-- 
Simon South
simon <at> simonsouth.net




This bug report was last modified 2 years and 342 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.