GNU bug report logs - #57151
[PATCH 0/2] *** Add trained data models for Tesseract OCR ***

Previous Next

Package: guix-patches;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Fri, 12 Aug 2022 05:06:02 UTC

Severity: normal

Tags: patch

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Simon South <simon <at> simonsouth.net>
Cc: 57151 <at> debbugs.gnu.org
Subject: [bug#57151] [PATCH 1/2] gnu: Add tesseract-ocr-tessdata-fast.
Date: Fri, 12 Aug 2022 08:52:25 -0400
Hi Simon,

Simon South <simon <at> simonsouth.net> writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>> * gnu/packages/ocr.scm (tesseract-ocr-tessdata-fast): New variable.
>
> Maxim,
>
> Would it not be better to generate a separate package for each of the
> languages and scripts this data covers, as is done by Debian for
> instance?  The entire dataset is about a gigabyte in size and supports
> more than a hundred languages yet I imagine most people would be using
> only one or two.
>
> This would mean tesseract-ocr could simply propagate the
> "tesseract-ocr-tessdata-fast-eng" package rather than cherry-picking a
> specific file, and would establish a convention that would be necessary
> for packaging the "best" dataset as well, if that's desired.

That's a good idea!  I think we could have both, like Debian also has a
'tesseract-ocr-all' package for all the languages/scripts.  Which means
the individual variants could be added in at a later time by those
interested, eh :-).

A procedure returning a language-specific package variant would make
sense for that.

Thanks,

Maxim




This bug report was last modified 2 years and 342 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.