Reading about advances in Convolutional Network designs, you will surely stumble across the Inception network (also known as GoogLeNet). This is a historically significant CNN design, having won the ImageNet recognition competition in 2014. The Inception model is also significant because it uses 12 times fewer parameters than the AlexNet model presented in 2012. The Inception network’s performance amongst its peers is displayed below:
The first thing to understand about the Inception network is the source of its name. It is named after the famous Leonardo DiCaprio movie in which they journey into the dreams of dreams and popularly coupled with the ‘We Need to Go Deeper’ Meme. However, unlike the Inception movie, the Inception network does not focus on plunging deeper into the network in terms of having many layers. Rather, the Inception network presents a novel block which, in a sense, increases the width of each layer rather than the depth.
This network is named after another network named ‘Network-In-Network’ in which a 1x1 convolution is placed as a bottleneck between 3x3 convolutions to reduce computational complexity. The Inception network builds on the ideas of the ‘Network-in-Network’ idea with a more complex hidden block consisting of a 1x1, a 3x3, a 5x5, and a max-pooling layer all composed into one neural network block which we will discuss further in the article. Thus, the Inception name comes from the recursive structure of prying into the width of the layers rather than the depth of the network.
Before further explaining how the Inception block works, we will quickly comment on the difference between Inception and other popular advanced CNN designs such as ResNet and DenseNet. ResNet and DenseNet increase the capacity of CNNs by increasing the depth of the network. This is done by adding shortcut connections that combat problems with feature redundancy and feature redundancy. Contrastingly, the Inception network focuses on the ‘width’ of each layer, oppossed to the ‘depth’ of the overall network. In more concrete terms, the Inception network presented in this paper contains 22 parametric layers, whereas ResNets or Highway networks usually contain over 100 layers.
The unique structure of the Inception block (pictured below) should be one of the central takeaways after reading this article:
Looking at (a), we see this idea of stacking many different layers together clearly illustrated. This is in contrast with a traditional CNN which would line these different operations up sequentially such that the output of the 1x1 convolution is the input to the 3x3 convolution and so on. The Inception module copies the original input and feeds this same input to each of the different operations. Now looking at (b), we see that the input is initially passed through a 1x1 convolution before the 3x3, 5x5, and max-pool. This is done in order to reduce computational complexity, remember that a 1x1 convolution preserves the spatial dimensions of feature maps but can be used to decrease/increase the depth with the number of filters maps hyper-parameter.
One of the additional details that makes this paper so interesting is the use of intermediate classification networks to avoid vanishing gradient problems. The use of intermediate classifiers is pictured below:
In the picture above, the intermediate classifiers are highlighted by a branched stem with neural network layers colored red → blue → blue → blue → yellow → white. The internal operations of these classifiers doesn’t really matter, but this concept of using intermediate classifiers in a deep neural network is very interesting. Intuitively, this forces the intermediate features learned in this network to have discriminative features for the task at hand. With respect to the mechanisms used for implementation, the loss from each intermediate classifier is nerfed by a factor of 0.3 when updating the parameters via backpropagation.
This idea is interesting but perplexing. Requiring intermediate features to be discriminative somewhat distorts the logic that CNNs learn a hierarchy of features such as edges → shapes → objects and so on. However, as a mechanism to combat vanishing gradient problems, it seems to make a lot of sense. However, why do they need to branch off into a separate classificaiton network for this? Can’t they just store the global loss during backpropagation and inject a nerfed version of it into each partial derivative at certain checkpoints? With respect to Multi-Task Learning, which we discuss in this article: , it is interesting to think of the intermediate classifiers focused on tasks that require incremental difficulty. One example of this is how a basketball player would learn to shoot a layup before a three-pointer.
Concluding thoughts from Henry AI Labs
We looked into the details behind the Inception module in our pursuit of understanding Neural Architecture Search and the trend in human-designed CNNs. The Inception block offers a unique perspective to CNN design in tangent to designs such as the skip-connections used in ResNets and DenseNets. Additionally, we are interested in seeing how blocks such as the Inception block is used in other computer vision tasks such as super-resolution and in GANs. It is interesting to think of the variance in CNN designs and the difficulty of designing a discrete search space such that an RL search algorithm can discover something like the Inception block or a Residual Block. Thanks for checking out Henry AI Labs, please check out the corresponding video with this article!