Published on

Authors
• Name
Qi Wang

# Introduction

Have you ever wondered how computers can recognize images? Image recognition, object detection, semantic segmentation, and neural transfer learning are just a few tasks that can be accomplished with Convolutional Neural Networks (CNNs). Many of these applications are utilized in autonomous vehicles as well as a variety of different applications. In this post, I will go through the intuition of CNNs and how they work. It might be useful if you already have some Neural Network knowledge coming in.

The link to the paper on CNNs is attached here.

# Why CNNs

For those that are familiar with normal Neural Networks, you might be wondering why we can't do image recognition tasks with Deep Neural Networks. The answer is you can, but there are some disadvantages. Take a $64 \times 64$ image for example. Assume there are three channels representing each pixel: R, G, and B. This means there are $64 \cdot 64 \cdot 3 = 12288$ input parameters. From first glance, this may seem like a fairly low number, but such an image has an extremely low resolution. If you were using higher resolution images such as $512 \times 512$, you find that you have $512 \cdot 512 \cdot 3 = 786432$ input parameters. With this many parameters in the input layer, the network will become way too big and result in extremely slow training time and low accuracy.

# Filters / Kernals

During learning, the computer must extract features of the image to accurately execute its tasks. Turns out we can use filters (might be referred to as kernals but I will be using the term filters) to extract these features out of the image. Filters are matrices that are layered on the image and takes the sum of each element-wise multiplied to get the resultant matrix with features. Below is an example to visualize the concept.

One common filter for finding vertical edges in images is:

$\begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \\ \end{bmatrix}$

Now lets take a $6 \times 6$ image matrix:

$\begin{bmatrix} 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \end{bmatrix}$

We will start the filter at $(0, 0)$ on the image and shift it right one pixel each time. When the edge of the filter reaches the edge of the image, we shift the filter down one row and repeat the steps until the whole image matrix is covered. When a patch of the image matrix and the filter are multiplied and summed for an entry, the operation called a convolution. Hence, the name Convolutional Neural Networks. I've illustrated the convolution operations of the full above image below.

$\begin{bmatrix} 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \\ 2 & 2 & 0 & 0 & 2 & 2 \end{bmatrix} \ast \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \\ \end{bmatrix}$

Becomes:

$\begin{bmatrix} 6 & 6 & -6 \\ 6 & 6 & -6 \\ 6 & 6 & -6 \end{bmatrix}$

As shown, there is a clear indication of a vertical edge between the 2nd and 3rd column of the matrix. We refer to the distance we shift the filter every time as the stride. In the above example, we had a stride of $1$. Another common technique when applying filters is to pad the image. As seen above, the image size was reduced after the convolutions, and a more drastic reduction would be seen if a larger stride size was used. Therfore, you can choose to pad dimenions of the image to maintain the same size after applying the filters. If we wanted to maintain the same size we would need to have an additional padding of $P = \frac{W(S-1) + F - S}{2}$ where $W$ is the dimensions, $S$ is the stride, $F$ is the filter dimensions, and $P$ is the padding needed. For our above example, we would need to have a padding of $2$ for a same sized matrix after extracting the features. Below is the $6 \times 6$ image with $P = 2$.

$\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 2 & 2 & 0 & 0 & 2 & 2 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}$

The final size after all the convolution operations for an $N \times N$ image would be $\frac{N + 2P - F}{2} \times \frac{N + 2P - F}{2}$.

# Pooling

Another important building block of CNNs are pooling layers. Pooling layers are used to down-size the amount of parameters that need to be trained by summarizing the overall matrix. Two common types of pooling layers are Max Pooling and Average Pooling. These are quite intuitive to understand, max pooling takes the maximum value over the region you pool and average pooling takes the average value over the region you pool. Below is an example of what pooling would do to this $4 \times 4$ matrix.

$\begin{bmatrix} 1 & 2 & 6 & 7 \\ 4 & 5 & 4 & 8 \\ 1 & 3 & 2 & 6 \\ 1 & 5 & 1 & 5 \end{bmatrix}$

Max Pooling with a size of $2 \times 2$ and a stride of $2$:

$\begin{bmatrix} 5 & 8 \\ 5 & 6 \end{bmatrix}$

Average Pooling with a size of $2 \times 2$ and a stride of $2$:

$\begin{bmatrix} 3 & 6.25 \\ 2.5 & 3.5 \end{bmatrix}$

# What's Next?

Now that you know how the building blocks of a CNN works, we will cover methods to incorporate filters and pooling layers together into a full-blown Convolutional Neural Network. See you there!