Building an OCR Pipeline with Tesseract

I’m going to give some details to some future software engineers on how to get the most out of Tesseract OCR.  My company, Zorbu, is currently doing some small business software automation.  Last year, I needed to process faxes for automatic data entry.  Before looking at commercial systems I always investigate open source.  I’m going to tell you what was first used in production.

First off, Tesseract is very temperamental software.  It took me about a month of work to get it processing well enough for production.  The key to improving results is taking over tasks that Tesseract says it does with superior modern versions.

1. Remove Vertical Lines (Optional)

If you’re processing faxes, you might be amazed to find out how often they have vertical black lines through them.  The lines are caused from dirty or damaged sensors.  These lines wreck havok on Tesseract’s recognition.  Tesseract does better with gaps in text than extra lines.  Figure out a way to filter out the black lines and you’ll see your results improve.  This is easiest and fastest to do first.

2. Double the Scale

This seems like it would add no new information, but this by far the biggest improvement especially with small fonts, and low resolutions.  Even if you can read it perfectly, tesseract will struggle with it.  I found bilinear works, but bicubic may improve results further.  There are two ways that I believe this helps.  There are are some fixed sizes in Tesseract that makes the character matching less accurate as it’s not granular enough.  Second, most modern faxes are color and Tesseract processes strictly black/white images.  By doubling the size you are better able represent the actual shape of the characters when it’s turned into black and white.

3. Deskew the Document

Another task that Tesseract claims to handle is document deskewing.  I found modern techniques do a better job, at least my accuracy went up.  It also makes it easier to do more processing.

4. Convert to Black and White.

This needs to be done after deskewing and resizing.  The goal in handling it yourself is producing a very consistent character size and thickness.  You may need to do some image analysis to get the results you want.  This needs to be done after deskewing and resizing or you lose too much information.

5. Train Tesseract on the Font

This was very tedious and frustrating experience.  Tesseract is not modern and most of the errors are cryptic and confusing.  I wrongfully thought that giving Tesseract more examples was the way to go…nope my results went down.  I found it’s best to train on single ideal examples.  Except you might not have that.  Even knowing the font, I found out it wasn’t the same due to the resolution differences.  I ended up making super resolution versions of the characters, converting it to black and white and showing a single example of each character.

6. Process Twice

I would first use Tesseract with the default english settings to get the text to classify.  I would then run it again with my custom language/font.

7. Use White Lists

This part is somewhat obvious, but you can set the “tessedit_char_whiltelist” variable to just the characters that show up in the document.

8. Reprocess characters/word/letters with smaller white lists

Even with all this work, I had a lot of trouble with Os and 0s with Tesseract.  The more you restrict Tesseract, the more likely it will get the correct results.  If you know there is only a single character, let it know that via the page segementation mode.  If you know it’s a single word, let it know.  If it’s just a number restrict it.  I went so far as to restrict the tens’s place of a month to 0-3.

Tesseract is fickle software, but with some preprocessing you can get improved results.

The Game Boy, Memory Mapping

I think older technology shouldn’t be discounted.  They’re excellent educators with many principals being applied to modern development.  Learn about memory mapping, some assembly, gaming and more in this excellent video.


Inside the Game Boy Part 1: The CPU

Every once in a while I come across an article that I wish I had when I was learning programming.  This one is a fast introduction to the Game Boy.  It even includes some ASM that covers an iconic platform.

Could Fitts’s Law be used to detect Aim Bots?

I have been thinking about how we decide if a person is cheating in some online games.  More than once my gaming experience has been ruined by a sniper who suddenly seems to hit my head out of nowhere.  In some games, you can spectate through their view.  Sometimes it’s obvious snapping and tracking that is inhumane.  Other times it’s more subtle when a person can trigger the aim bot only in certain times.  But how could we automate detection of at least the most glaring cheaters?

While we often think of the games as defying physics, they actual inputs are tied to people who are confined to the physical world.  The mouse is moved by a human hand which can’t accelerate instantly nor stop instantly.  If we assume this is to be the case, we could use Fitt’s Law as a way to determine if their movements confine to real world physics.  This would need to be done by the game manufacturer behind the scenes, as the actual data isn’t provided to the end user.

In each first person shooter game there are hit boxes that a person wants to hit, usually the head or the body.  This could be considered the target.  These are often determined by height and width, which are inputs into Fitt’s Law.  Smaller targets that further away are thus harder to hit, and often take more time to aim.  Through observing a known human play you could determine what is possible within human parameters that confine to Fitt’s Law.  This would mean, rapid re-aiming to small spots could be quantifiable determined the likelihood that the person is cheating.

The hard part isn’t using Fitt’s Law, it is determining from the mouse movements when a person makes a judgement to move their mouse to what target.  Also sometimes people just get lucky and hit somebody they weren’t even thinking to hit.  But I think through repeated observations and some analysis, aim botting would get less feasible, because at best you could have it turn you into the best human player, not the best machine player.

I used to think CPUs required massive amounts of transistors beyond most human’s comprehension.  But the Nibbler 4 Bit CPU shows that the basics don’t require as many chips as you would think.

Most programmers should learn Regular Expressions

There was a time when I didn’t know regular expressions.  Lately I have been having to parse a lot of textual based files, and I have found even a small use of regular expressions speeds up my development.  However, I think the problem is if you try to do everything with them, then you’re using them wrong.  I think this is mainly due to the fact that it’s hard to comment a regular expression.  So if you can’t quickly look at the regular expression and figure out what it means, it’s no better than code you can’t read.

One site, that I have used a lot when programming regular expressions is  It makes writing and understanding regular expressions much easier.

How a Transistor Works

I have used computers for decades now but I never fully understood the physics behind a semi conductor until I watched this short educational video.

Transitor Schematic Image


The $1 MP3 Player

I saw this referred to on Reddit.  Although it lacks headphones, a memory card, or even a USB cable.  You can apparently buy an MP3 player off Ebay for $1.  Or about $1.50 shipped.  I remember buying an MP3 player for $200.  I’m tempted to buy one just to take it apart or to use as some sort of a small mobile product case.  I find this a bit mind blowing.

Picture of the Mini Mirror Surface Clip

Mini Mirror Surface Clip

Linux-based clock radio

I found this really amazing as I had thoughts of doing something similar.  This gentleman overhauled his alarm clock to run Linux.

SpriteMod's Linux Alarm Clock


Faking out WebClient

I spent an hour or two to find this correct solution for how to unit test System.Net.WebClient in Visual Studio 2012.  I found one page that actually had a minor typo in the implementation that caused me quite of a bit of aggravation.  I won’t link to it, cause I don’t want to bump it up higher on Google.  Here’s what my System.Fakes looks like:

<Fakes xmlns=””>
<Assembly Name=”System” Version=”″/>
<Add FullName=”System.Net.WebClient”/>

I’ve been experimenting with writing unit tests as I write my code.  I find it a very organic way of development and makes the unit tests almost free.  My experience so far is that unit tests are best for back end testing.