Tesseract Add a Language
Issue:
A user is utilizing Tesseract for OCR and needs to utilize a language other than English.
Purpose:
This procedure will teach you how to obtain, install and configure another language pack for the Tesseract OCR engine.
Prerequisites:
As a note, this procedure was written for version 3.0.3.4 SP2. This version comes with Tesseract 3.00 installed, but Teseract 3.01 is also installed but not utilized. In this example we will show you how to reconfigure Ephesoft to utilize Tesseract 3.01, install an Arabic language pack and configure Ephesoft to utilize this language pack.
Procedure:
- Stop the Ephesoft server.
- Rename the “[path]\Ephesoft\Application\native\Tesseract-OCR” to “[path]\Ephesoft\Application\native\Tesseract-OCR-3.00”
- Rename the “[path]\Ephesoft\Application\native\Tesseract-OCR-3.01” to “[path]\Ephesoft\Application\native\Tesseract-OCR”
- Download the Arabic language file from Google (ensure it is version 3.01 and not 3.02)
https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.01.ara.tar.gz&can=2&q=arabic - Extract the contents of tessdata folder from the compressed file to “[path]\Ephesoft\Application\native\Tesseract-OCR\tessdata”
- Start the Ephesoft server
- Go to the Batch Class Management screen and open the batch class which needs to be updated.
- Go to the Page Process Module.
- If Recostar_HOCR and Create_OCR_Input plugins are there, please remove them as they will no longer be needed with Tesseract.
- Under the Tesseract_HOCR plugin, ensure the Tesseract Switch is set to On and the Tesseract Language is set to ara.
- Be sure to Save (and Validate/Deploy workflow if plugins were removed).
Post Procedure
- You may now run batches through for the setup language.
- Other language packs are available at the following site:
https://code.google.com/p/tesseract-ocr/downloads/list
Please pay close attention to the Tesseract version number and only download file for one of the two versions available in Ephesoft.