Cutout regularization is another strategy to fit against the ever-looming problem of overfitting. Overfitting refers to the phenomenon when a statistical model models its training data too closely. This is often described as the bias-variance tradeoff in which a model with high bias does not have the flexibility to model its training data and a model with with high variance is overly flexible in this sense. Deep Neural Networks are especially prone to overfitting, a consequence of their enormous representational capacity.
Many regularization methods have been proposed such as dropout and weight penalties. Cutout regularization is unique to these and especially interesting because unlike other methods that regularize the model in the function space of the model itself, this technique focuses on the data space.
The idea behind Cutout Regularization is appealing for its simplicity and ease of implementation. The algorithm works by simply masking out contiguous squares of an image, (depicted below):
Generally, Cutout Regularization is very similar to other data augmentation methods such as rotating and translating images. These kinds of data augmentations are very popular because they can be very easily implemented, especially with Keras’ ImageDataGenerator class. Unlike many classical data augmentations, Cutout Regularization can be stacked on top of any other augmentation. In this sense, Cutout Regularization serves as an intermediate pre-processing step between the original and augmented data and the training batch.
Intuitively, Cutout Regularization should force a CNN model to develop a more diverse set of features for classifying images. Instead of just focusing on the wheels of a car, Cutout Regularization should force the model to look at other details of the image.
Proof of Intuition with Feature Activation Distribution
The authors prove that Cutout Regularization does so this by comparing the distribution of activations with and without Cutout Regularization. With Cutout, the largest feature activations are distributed centrally towards the initial or shallow layers of the network, indicating that the network is using more information from the data space.
With respect to the implementation details of this paper, the authors found that the size of the zeroed-mask is more important than the shape of it. Also interesingly, they found that a 16x16 mask worked best on CIFAR-10, but an 8x8 mask worked best for CIFAR-100. This follows our intuition that Cutout forces the learning of more general features. In 100-class classification, the model needs more fine-grained features to separate between the classes.
The plot above shows the search for patch-length across the different datasets. In CIFAR-100 optimal performance is achieved with a patch-length of 8 pixels, in CIFAR-10 the optimal performance is achieved with 16 pixels.
Cutout is implemented by randomly selecting a center point in the image and masking out a square around this point. You may be tempted to limit the center point such that the mask won’t pass the borders of the image, however, the authors tested this and claim that letting the mask surpass the frame of the image was necessary for achieving maximum accuracy. One explanation for this is that it is good for the model to see some images that are not occluded. This is also achieved by applying a probability p in which images are either inputted to the network as they are or passed through the Cutout augmentation.
Results of using Cutout Regularization with different CNN architectures. The “+” denotes the use of additional data augmentations such as mirroring and cropping. We see that adding data augmentation results in a huge improvement in all models.
Cutout Regularization vs. Dropout
Another very popular regularization technique is Dropout. Dropout works by randomly zeroing out the activations of hidden uints. Cutout is fundamentally different from dropout in that it works in the data space and it masks out a contiguous region. Dropout in the data space would be more similar to what is used to format data for a denoising auto-encoder. For the use of image classification, Cutout and Dropout can be used as complements to each other.
Targeted vs. Random Cutout
Another interesting question with Cutout is whether it is better to target the region to cutout, rather than randomly choosing a center point. One strategy for this could be to take an image and exhaustively or randomly mask out regions and check the impact on feature activations. Coming up with a solution algorithm for targeted cutout is somewhat challenging. Do you try and maximize or minimize a certain activation or try and have as much variance between activations as possible. The authors claim that they do not see any imporvement with targeted cutout, but this still seems like a promising area of future work.
Concluding thoughts from Henry AI Labs
At Henry AI Labs, we are very interested in how Multi-Task Learning facilitates representational learning. We are looking at using Cutout Regularization for Multi-Task Learning where one network classifies the image and another works as a context encoder to reconstruct the missing region in the image. There are many additional challenges with this such as how to scale the gradients in the Multi-Task Learning sense. Additionally, we are interested in seeing if we can define a discrete search space along this framework and use a Meta-Learner to find an optimal geometry, mask size, and mask fill to improve on this. Thanks for reading and checking out Henry AI Labs!