Monthly Archives: February 2017

Building an OCR Pipeline with Tesseract

I’m going to give some details to some future software engineers on how to get the most out of Tesseract OCR.  My company, Zorbu, is currently doing some small business software automation.  Last year, I needed to process faxes for automatic data entry.  Before looking at commercial systems I always investigate open source.  I’m going to tell you what was first used in production.

First off, Tesseract is very temperamental software.  It took me about a month of work to get it processing well enough for production.  The key to improving results is taking over tasks that Tesseract says it does with superior modern versions.

1. Remove Vertical Lines (Optional)

If you’re processing faxes, you might be amazed to find out how often they have vertical black lines through them.  The lines are caused from dirty or damaged sensors.  These lines wreck havok on Tesseract’s recognition.  Tesseract does better with gaps in text than extra lines.  Figure out a way to filter out the black lines and you’ll see your results improve.  This is easiest and fastest to do first.

2. Double the Scale

This seems like it would add no new information, but this by far the biggest improvement especially with small fonts, and low resolutions.  Even if you can read it perfectly, tesseract will struggle with it.  I found bilinear works, but bicubic may improve results further.  There are two ways that I believe this helps.  There are are some fixed sizes in Tesseract that makes the character matching less accurate as it’s not granular enough.  Second, most modern faxes are color and Tesseract processes strictly black/white images.  By doubling the size you are better able represent the actual shape of the characters when it’s turned into black and white.

3. Deskew the Document

Another task that Tesseract claims to handle is document deskewing.  I found modern techniques do a better job, at least my accuracy went up.  It also makes it easier to do more processing.

4. Convert to Black and White.

This needs to be done after deskewing and resizing.  The goal in handling it yourself is producing a very consistent character size and thickness.  You may need to do some image analysis to get the results you want.  This needs to be done after deskewing and resizing or you lose too much information.

5. Train Tesseract on the Font

This was very tedious and frustrating experience.  Tesseract is not modern and most of the errors are cryptic and confusing.  I wrongfully thought that giving Tesseract more examples was the way to go…nope my results went down.  I found it’s best to train on single ideal examples.  Except you might not have that.  Even knowing the font, I found out it wasn’t the same due to the resolution differences.  I ended up making super resolution versions of the characters, converting it to black and white and showing a single example of each character.

6. Process Twice

I would first use Tesseract with the default english settings to get the text to classify.  I would then run it again with my custom language/font.

7. Use White Lists

This part is somewhat obvious, but you can set the “tessedit_char_whiltelist” variable to just the characters that show up in the document.

8. Reprocess characters/word/letters with smaller white lists

Even with all this work, I had a lot of trouble with Os and 0s with Tesseract.  The more you restrict Tesseract, the more likely it will get the correct results.  If you know there is only a single character, let it know that via the page segementation mode.  If you know it’s a single word, let it know.  If it’s just a number restrict it.  I went so far as to restrict the tens’s place of a month to 0-3.

Tesseract is fickle software, but with some preprocessing you can get improved results.