Meta-Learning is somewhat of a blanket term in Deep Learning research that could mean a lot of things. In itâ€™s most basic form, Meta-Learning, is used to describe random or grid searchers through a set of hyperparameters on a Deep Neural Network. These hyperparameters are typically the learning rate, auxiliary parameters such as the beta term on the Adam optimizer, and the batch size. A set of these hyperparameters are manually crafted, and the resulting search algorithm iterates through them.

The next level of Meta-Learning involves searching for internal characteristics of a Deep Neural Network. This includes things such as Neural Architecture Search, Searching for Activation Functions, and AutoAugment. These methods all spawned out of the same research lab and follow the same ideology of a Recurrent Neural Network controller designing the internal characteristics while trained with a Proximal Policy optimization algorithm, (derived from Reinforcement Learning). Interestingly, this lab has yet to combine all three, activation functions, data augmentation, and neural network block/connectivity design into one process, most likely due to the large computational complexity of doing this.

Another high level of Meta-Learning includes tasks such as learning initializations and learning optimizers.

Meta-Learning is a very general subject of study which could include many different areas of focus. The papers and discussions in this survey will primarily focus on ideas realting to the design of internal characteristics of the network or hyperparameter optimization techniques.

The most basic algorithms are random and grid search through a set of hyperparameters. For example, one of the important hyperparameters to use with neural networks is the learning rate of the optimizer. This could take on a range of values usually from 0.0001 to 0.01. A random search would define an interval such as 1e-4 to 1e-1, incrementally spaced and then randomly choose values in the interval to test. A grid or exhaustive search would test all of these values.

Andrew Ng presents an interesting point to this discussion in his very famous lecture series on Deep Learning topics. The example we just presented uses a uniformly spaced interval. For example a 0 to 50 interval containing the values, 0, 10, 20, 30, 40, and 50 is uniformly spaced. However, this parameter space could follow a distirbution as well. Andrew Ng describes an exponential or power-law distribution of hyperparameters as the search space. It is interesting to tihkn of the distribution of values in the hyperparameter search space and the impact this might have on the big picture optimization process.