Many such files have been constructed over the years for various languages. We will use such a file for the English language, named "eng.traineddata". This file can be downloaded from https://github.com/tesseract-ocr/tessdata. Download this file and place it in /usr/local/share/tessdata.
Now scan a page of a book. I scanned page 58 of Lamport's book on LaTeX. Page 58 is mostly standard English, but it has one two column display of code, with the output on the left and the input, in bold, on the right. My scanner opened the result in Apple's Preview, and I selected "Export as PDF" to save the resulting bitmap in a pdf file. I named my copy "Lamport-58.pdf".
Next we apply OCR to this file. Many commands are available, but we will concentrate on two:
Below are these samples. First the scanned image Lamport-58.pdf. Next the OCR output Lamport-58-out.pdf. Compare these results and notice that they look almost identical. But notice that text cannot be selected and copied in the original, but can be selected and copied in the OCR version.
Copy portions of text from "Lamport-58-out.pdf" to your favorite text editor. You will discover that OCR makes mistakes, and in particular sometimes omits the space between words. This is the first release of the software in Ghostscript, so it capabilities will improve. Moreover, changing the flags used to call OCR, for instance that "-dDownScaleFactor=3" flag, can decrease the error rate. If you intend to use OCR, read the extensive documentation on the Ghostscript site.