- Qi Wang
In this part, I will cover how a CNN is structured. If you haven't read the Part 1, make sure to do so as I went over some of the crucial building blocks of a CNN. In this part, we will look at how filters and pooling layers come together to create a full network that has the potential to perform a variety of tasks such as image recognition.
Convolutions Over Multiple Pixel Channels
Given most RGB images have three channels, how do we apply filters or pooling to a image matrix? Turns out, all we need to do is to add another dimension to the filter size. Instead of an size, we will use a size, being however many channels your image matrix contains (in our case, 3). During the convolution step, you do an elementwise sum over all three dimensions to obtain the convoluted matrix. To illustrate this, imagine we have a image matrix with each channel represented below—each representing R, G, and B respectively.
Let's use our vertical edge detector example from part 1 again here. If you wanted to detected edges in only the red channel, you would use a filter as such:
However, if you wanted to detect an vertical edge in the all channels, you would instead use this filter:
For the sake of the example, our filter size and image size are equal, but realistically, image sizes are usually larger than filter sizes. Therefore, after the convolution operation, we will end up with a resultant matrix as shown below.
We obtain that by calculating the sum of each element in the filter multiplied by each element in the corresponding image matrix cell across all three channels:
Stride and padding for convolutions over multiple dimensions will work the same. We only apply stride and padding to the 2D dimensions of the image and not the pixel channel dimensions.
Pooling layers over multiple dimensions basically work the same way. However, you do individually pool each layer, so the depth dimension after pooling should stay the same(though the first two dimensions would change).
Single Layer CNN
We can now move on to building one layer! Obviously, if the CNN can only detect edges, it wouldn't be able to perform complex tasks that image classification, etc. requires. Usually, many filters applied to the matrix in one layer. The results of each filter will be stacked on top of each other into a block. For example, if we used three filters on the previous matrix, we would end up with a matrix, being the resulting dimension depending on the filter size, stride, and padding.
Dense Layers in CNNs
For image classification and similar tasks, you would actually have a few Convolution blocks and then at the end at some fully connected deep neural network layers with a softmax or relu activation on the last layer to reach our result. This basically takes all the attributes in the blocks that the convolutional layers have already found and applies deep learning to piece the image features together to reach a conclusion as to what type of picture it is.
Other Examples of CNNs
For tasks such as semantic segmentation, you would not apply dense layers at the end of the network because you want the result to be an image. Therefore, you will actually expand the blocks back up(their size decreases at first due to filters and padding), and you will link previous layers to the deeper layers, so it has more information about each individual pixels.
What's Next... again
In this part, you learned what the architecture behind one layer of a CNN is as well as how to use filters across multiple pixel channels. If you want to dive deeper into how to actually implement a CNN, I encourage you to follow the numerous tutorials avaliable online and practice with image data on Kaggle!