It’s been a long time since I took a look at the state of the art in optical character recognition (OCR): the last time I really paid attention was when Delrina’s Winfax program gained OCR capabilities in 1994 (I used to do DIY OCR by faxing myself things).
Man, has the state of the art ever advanced. And the shiny object attracting my eye tonight was Tesseract, an open source OCR engine that was developed originally at HP labs.
Tesseract has the benefit of being dead simple to install on a Mac with Homebrew; you just:
brew install tesseract
And, blamo, about 8 minutes later your Mac is a powerful OCR machine.
To take Tesseract out for a short ride, I used Robertson Library’s Plustek OpticBook A300 scanner (which is awsomely fast) to scan the 1924 book by D.B. Updike, In the Day’s Work into 44 TIFF files (each 330ppi, and about 8MB in size). And then, proof-positive of how easy it is to use Tesseract, I did:
tesseract printing0008.tif page8
And, about 3 seconds later (yes, it is fast), I had:
On the Planning if Printing ,T must of necessity be,” said Sir Ioshua Reynolds, “ that even works of genius, like every other effect, as they must have their cause, must also have their rules; it cannot be by chance that excellen- cies are produced with any constancy or any certainty, for this is not the nature of chance: but the rules by which men of extraordinary parts—-and such as are called men of genius—- work, are either such as they discover by their own peculiar observations, or of such a nice texture as not easily to admit being expressed in words. Unsubstantial, however, as these rules may seem, and difficult as it may be to convey them in writing, they are still seen and felt in the mind of the artist; and he works from them with as much certainty as if they were embod- ied upon paper. It is true these refined princi- ples cannot always be made palpable, as the [3]
from this:
By my count, there were only 3 errors: “if” instead of “of” in the italic title, an understandable issue with pulling the “I” out of the ornament at the beginning of the paragraph, and Joshua being read as Ioshua.
Add new comment