OCR in Ghostscript-9.54

Introduction

OCR (Optical Character Recognition) is a new technology in Ghostscript-9.54, based on the Tesseract and Leptonica libraries. OCR can read a scanned document and output a pdf file containing the individual characters and words in the document. This makes it possible to select words and sentences and copy them. So in a sense, the scanned page becomes editable again.

Machine Learning

OCR depends on machine learning to understand text written in a particular language. The outcome of many machine learning sessions for a particular language is placed in a file. This file should be placed in /usr/local/share/tessdata. You will have to create this tessdata folder.

Many such files have been constructed over the years for various languages. We will use such a file for the English language, named "eng.traineddata". This file can be downloaded from https://github.com/tesseract-ocr/tessdata. Download this file and place it in /usr/local/share/tessdata.

An Example of OCR

Construct a folder in which OCR experiments can be performed. I named my folder "OCRTest". All files from now on should be placed in this folder.

Now scan a page of a book. I scanned page 58 of Lamport's book on LaTeX. Page 58 is mostly standard English, but it has one two column display of code, with the output on the left and the input, in bold, on the right. My scanner opened the result in Apple's Preview, and I selected "Export as PDF" to save the resulting bitmap in a pdf file. I named my copy "Lamport-58.pdf".

Next we apply OCR to this file. Many commands are available, but we will concentrate on two:

/usr/local/bin/gs -sDEVICE=pdfocr8 -o Lamport-58-out.pdf -r600 -dDownScaleFactor=3 Lamport-58.pdf
/usr/local/bin/gs -sDEVICE=ocr -o Lamport-58.txt -r200 Lamport-58.pdf

The first outputs an OCR version of the original pdf file, and the second just outputs the text in this file. We will concentrate on the pdf output. So in Terminal, cd to OCRTest and execute the first command. The result will be a new file named Lamport-58-out.pdf.

Below are these samples. First the scanned image Lamport-58.pdf. Next the OCR output Lamport-58-out.pdf. Compare these results and notice that they look almost identical. But notice that text cannot be selected and copied in the original, but can be selected and copied in the OCR version.

Copy portions of text from "Lamport-58-out.pdf" to your favorite text editor. You will discover that OCR makes mistakes, and in particular sometimes omits the space between words. This is the first release of the software in Ghostscript, so it capabilities will improve. Moreover, changing the flags used to call OCR, for instance that "-dDownScaleFactor=3" flag, can decrease the error rate. If you intend to use OCR, read the extensive documentation on the Ghostscript site.

Richard Koch
2740 Washington St
Eugene OR 97405

Phone: (541)686-8466
Email: < koch@uoregon.edu >