Pythonic OCR: Fun with Convolutional Neural Nets


Convolutional neural networks attain state of the art performance in computer vision. They do this by analyzing the pixels in images in the same way as the human visual system. The network I built is made up of several simpler layers ‘stacked’ on top of each other: this means that it is a form of deep learning. I trained the network using Keras and then rewrote it in numpy so it could run as a demo.

Enter any alphanumeric character


To understand convolutional neural networks (CNNs), you must first understand convolution. Wikipedia gives a good mathematical definition here.

To get a physical intuition of discrete 2D convolution (the kind I use in this project), imagine

  1. an image
  2. a small patch of random pixels
When we convolve the two, we ‘smear’ the image’s pixels around using the patch. Different patch patterns will create different smears. Some of these smears will highlight particular features of the image such as horizontal lines or edges.

In a convolutional neural network, each patch represents a different type of neuron in the human visual system. A convolutional neural net, in fact, is just like a regular neural net but each neuron acts on the entire image space (more and even more). These nets were invented in 1989 by Yann LeCun.

In the past few years, computers and training methods have become powerful enough to enable anyone (even me) to build and train these networks; this has resulted in a massive resurgence of interest. I started with the Keras MNIST convolutional neural net example and made modifications to create this project.

How it works

Let’s take a closer look. Convolution is defined as $$ (f * g )(t) = \int_{0}^{t} f(\tau)\, g(t - \tau)\, d\tau\ \mathrm{for} \ \ f, g : [0, \infty) \to \mathbb{R} $$ where \( (f * g )(t) \) represents the convolution of \(f\) on \(g\). The discrete analog is $$ (f * g)[n]\ \stackrel{\mathrm{def}}{=}\ \sum_{m=-\infty}^\infty f[m]\, g[n - m] $$ At first, I implemented convolution using a \(\mathrm{for}\) loop, but for many convolutions, this approach is too slow. Fortunately, convolution can be performed much more efficiently if you first take a fast fourier transform (fft). Python’s numpy library has a built-in fft function, so we can actually perform convolution in just two lines:

freq_result = np.fft.fft2(image, target_dim) * np.fft.fft2(patch, target_dim)   #frequency domain 
spatial_result = np.fft.ifft2(fft_result).real   #go back to spatial domain

This method is at least an order of magnitude faster for my project!

The 2D convolution is only half of the picture. For every convolution layer, there is a pooling layer. The responsibility of the pooling layer is to combine different convolutions to create a 'image' of features that can be fed into the next convolution layer. I used max pooling for the demo.

The first part of my model has two convolution layers and two pooling layers. These layers extract high-level image features. On top of these layers are two regular old dense neural networks (with dropout) to handle classification. I used Keras to build and train the entire model.

Keras's backends compile everything in C++. I did not want my demo to do this, so I rewrote the network in numpy. While not especially scalable or modular, my reimplementation allows me to run my CNN as a demo on this page!

A non-trivial part of machine learning is building a good database. I created my own miniature dataset of 3,000 samples. I resized, desaturated, cropped, and centered every input around 0 with stdev \(\sigma=1\). I made a separate preprocessor class to handle these processes.

With a train=0.9 and valid+test=0.1 I was getting about 99% accuracy. There are a few extras: try drawing a 5-pointed star or a smiley face! Another of these extras is a category called ‘scribble.’ Basically, anything that did not fit well into the other categories is supposed to be classified as a scribble.


Convolutional neural nets are very powerful. This network performed well on a dataset without any hand-engineered features. Furthermore, the small ‘scribble detector’ innovation I added to the dataset worked quite well.

One problem is that I used an existing deep architecture along with the preset hyperparameters. Choosing a model and setting correct hyperparameters is half the challenge in deep learning, so in the future this is something I will focus on.

Fork me on GitHub