Henry AI Labs

Searching for Activation Functions

Some hyperparameters of neural networks such as learning rate and batch size are very tangible to define and search through. Meta-learning in Deep Neural Networks begins to get very interesting when we are searching for more interal characteristics such as the activation function of hidden units. Activation functions are one of the central building blocks to the neural unit:

Image depicting a neuron in a Deep Neural Network

The Activation function is an essential component to the Neural Network and has a massive impact on the training dynamics of the network.

The importance of activation functions is a recurrent theme in the recent history of advances in Deep Learning. For example, the AlexNet paper cites the use of the ReLU activation function, (f(x) = max(x, 0) as one of the main contributions to their success in ImageNet classification with Deep Convolutional Networks. They note that this activation function is less saturating than functions such as sigmoid and tanh, facilitating graident flow through backpropoagation.

One of the initial design decisions to make when constructing a Neural Network is the activation function. As a quick reminder, a standard multi-layer perceptron unit of a neural network has the form wTx + b where w represents the weights of the neuron, x the inputs to the neuron, and b as the bias term. The output of wTx + b is then transformed through an activation function. In the common case of the sigmoid activation, this results in a transformation from some scalar input to a value between 0 and 1.

There are many activation functions to choose from when designing Neural Networks, including sigmoid, tanh, ReLU, Leaky ReLU, Maxout, and others. This article discusses a technique that uses Meta-Learning search algorithms to design a novel actiation for the task at hand.

This work builds on the successes of other Meta-Learning tasks such as Neural Architecture Search and AutoAugment in which an auxiliary RNN controller network searchers through the hyperparameters of a neural network. The activation of a network is no different from the intermediate layer blocks in the sense that it is highly parameterized and has a massive opportunity for improvement.

The picture above illustrates how the search space for an activation function is desgined. As a quick reminder, a unary function takes in a single input x and returns a single output y, whereas a binary function takes in two inputs x1 and x2 and reutrns a single ouput y. The search algorithm will find a combination of unary and binary functions sequentially constrained to fit the structure of the above diagram.

Before diving further, let’s explain how the Swish activation function is derived from this search space. The Swish function is defined as f(x) = x * sigmoid(Bx). This is a search through one core unit, (highlighted in gray). In this core unit, the first unary function is f(x) = x and the second is f(x) = Bx. In this search space B is a constant or learnable parameter. Finally, the binary function is f(x) = sigmoid(x). Combining these three decisions derives the Swish activation function.

The search algorithm will then use this combination as an activation function to train a ResNet-20 Deep CNN on the CIFAR-10 dataset and use the error rate form this network to guide the search.

Activation Function Search Space

The search algorithm choses four unary functions and two binary functions from this discrete set:

In total there are 25 unary functions and 10 binary functions to search through. With two cor eunits, this amounts to a search space with 25 x 25 x 25 x 25 x 10 x 10 = 39,062,500 possibilities.

Exhaustive search through a space of 39 million permutations is not as dramatic as some neural network search algorithms end up being, but it is still quite computationally expensive. This highlights a fundamental problem with this kind of Meta-Learning architecture search, training the child network is not an instant operation! Even if training the child network takes only 10 minutes on really high end GPUs, the exhaustive search would take 390 million minutes, which is 74 years. Thus, it is necessary to use intelligent search algorithms such as Reinforcement Learning to avoid searching the entire space.

Using Activation Function search, the authors discover the Swish activation:

Following is the complete list of activation functions found and their classification accuracy when transferred from the child network used during training (ResNet-20) to RN (ResNet-164), WRN (WideResNet-28-10), and DN (DenseNet-100-12). This is done to show that although trained on a small network, the activation functions found generalize to larger models.

Concluding Thoughts from Henry AI Labs

Using search algorithms to search for the internal characteristics of Deep Neural Networks is a very interesting idea. We are looking at how we can search for activation functions using a series of search algorithms such as Reinforcement Learning, Evolutionary Algorithms, Tree Search methods. It will be very interesting to see if automated architecture search can have any impact in the design of Generative Adversarial Networks. Thanks for reading and checking out Henry AI Labs!

Related Articles

Henry AI Labs