Maxout Networks are a Neural Network model that you may encounter in the wild and this article is provided to summarize the main ideas behind the Maxout Networks. Firstly, Maxout Networks are named as such because they replace traditional activation functions such as sigmoid, tanh, or ReLU with the Maxout activation function.
What is the Maxout Activation Function?
If you remember that each hidden layer is composed of a matrix multiplication of the weight matrix WT and the inputs x plus the bias. This WTx + b makes up the pre-activation of the hidden unit. In traditional neural nets, this pre-activation is passed through a sigmoid activation to convert to -1 and 1. The maxout activation instead takes the maximum value of the pre-activations and reshapes it into a vector containing only this value.
Next, we will view some slides from Ian Goodfellow’s PhD defense to visualize the idaes behind the maxout activation:
The image above depicts the traditional hidden layer design with an activation function such as sigmoid.
This image depicts the maxout unit. Already you can see that their are some significant differences in this model. For one, there are now multiple zs / pre-activations in the hidden unit. Each of these zs has a different set of weights from the inputs, (denoted v in this image).
Motivation Behind the Maxout Activation
Goodfellow highlights some problems with activations like sigmoid, (pictured below):
This picture shows that there is a useful gradient when the input is between -5 and 5, but outside of this range the gradient ‘saturates’, meaning that the gradient is very small and thus learning is very difficult.
The maxout activation creates a gradient slope like this, everywhere along the input axis, there is a gradient to learn with.
Ian Goodfellow et al. designed maxout to complement the dropout regularization technique. Dropout is a regularization technique that randomly zeros out the activations of inputs to the next layer.
Goodfellow et al. highlight the similarity between dropout and traditional bagging techniques used in Machine Learning. Bagging is an ensemble technique that combines the predictions of multiple models to form a final prediction. Dropout is analogous to this because it learns a set of networks contained in a large parameter space that are combined in the final prediction. Dropout has been shown to be very effective for training Deep Neural Networks and avoiding overfitting.
So how do Maxout activations complement Dropout Regularization?
The authors highlight that Dropout works best when the model takes large update steps during training. This is counterintuitive compared to traditional Stochastic Gradient Descent which aims to find the right learning rate such that the model does not over- or under- step.
How do Maxout activations ensure that the model takes large update steps?
Thus, the complementary objective of maxout activations and dropout is clear. The maxout activation facilitates the training of dropout with large gradient updates.
Concluding Thoughts from Henry AI Labs
We are very interested in Meta-Learning Activations such as what is presented in the paper “Searching for Activation Functions”. At Henry AI Labs, we investigated Goodfellow’s Maxout network paper to try and get a sense of the intuitions behind designing a novel activation function. In our experiments relating to the “Searching for Activation Functions” paper, the search is too constrained to find an activation function such as Maxout. Thus, it raises questions about how we define the discrete search space in our experiments. Thank you for reading and checking out Henry AI Labs!