Trying out Tesseract OCR

It’s been a long time since I took a look at the state of the art in optical character recognition (OCR): the last time I really paid attention was when Delrina’s Winfax program gained OCR capabilities in 1994 (I used to do DIY OCR by faxing myself things).

Man, has the state of the art ever advanced. And the shiny object attracting my eye tonight was Tesseract, an open source OCR engine that was developed originally at HP labs.

Tesseract has the benefit of being dead simple to install on a Mac with Homebrew; you just:

brew install tesseract

And, blamo, about 8 minutes later your Mac is a powerful OCR machine.

To take Tesseract out for a short ride, I used Robertson Library’s Plustek OpticBook A300 scanner (which is awsomely fast) to scan the 1924 book by D.B. Updike, In the Day’s Work into 44 TIFF files (each 330ppi, and about 8MB in size). And then, proof-positive of how easy it is to use Tesseract, I did:

tesseract printing0008.tif page8

And, about 3 seconds later (yes, it is fast), I had:

On the
Planning if Printing

,T must of necessity be,” said Sir
Ioshua Reynolds, “ that even works
of genius, like every other effect, as
they must have their cause, must also have their
rules; it cannot be by chance that excellen-
cies are produced with any constancy or any
certainty, for this is not the nature of chance:
but the rules by which men of extraordinary
parts—-and such as are called men of genius—-
work, are either such as they discover by their
own peculiar observations, or of such a nice
texture as not easily to admit being expressed in
words. Unsubstantial, however, as these rules
may seem, and difficult as it may be to convey
them in writing, they are still seen and felt in the
mind of the artist; and he works from them
with as much certainty as if they were embod-
ied upon paper. It is true these refined princi-
ples cannot always be made palpable, as the


from this:

Page from In the Day's Work by D.B. Updike, 1924

By my count, there were only 3 errors: “if” instead of “of” in the italic title, an understandable issue with pulling the “I” out of the ornament at the beginning of the paragraph, and Joshua being read as Ioshua.