StackGAN is one of the most popular GAN variants, currently holding the ‘state-of-the-art’ title for text-to-image synthesis. StackGAN builds on the ideas of Reed et al. in their paper “Learning Deep Representations of Fine-Grained Visual Descriptions”. Reed et al. present a method for aligning text embeddings such as those derived from Word2Vec, with visual features from images. StackGAN builds on this work with two major contributions. Firstly, the use of a multi-scale architecture that incrementally synthesizes images first to 64x64 and then to 256x256. Secondly, the use of a conditioning augmentation trick that ensures better smoothness in the text embedding space. Both of these mechanisms will be explained in this blog post.
Alignment of Visual and Text embeddings proposed by Reed et al. used in StackGAN
StackGAN focuses on the task of text-to-image synthesis. The authors list photo-editing and computer aided design as some applications of this technology. It is very interesting to think of the potential of having this kind of natural language interface with AI-generated image creation.
The only modification control available for state-of-the-art image synthesis models such as BigGAN is to interpolate along the latent space. The StyleGAN model allows for better flexibility amongst the generated images, however, the user can only turn pre-defined knobs to modify images. Thus, it will be interesting to see the intersection between text-to-image and state-of-the-art latent space control models such that users can interface with generated images with language alone.
Bridging the advances in text embeddings and image synthesis starts with the text embeddings. Classically text-embeddings are very high dimensional, having greater than 100 dimensions. Additionally, these embeddings contain a certain level of discontinuity due to limited available training data.
This motivates the Conditional Augmentation trick presented in the StackGAN.
In order to improve diversity of generated images as the input z vector changes, the discontinuities in the text embedding space need to be smoothed out. This is done by adding a multivariate gaussian distribution to the latent space. This multivariate gaussian is parameterized by the mean value and diagonal covariance of the text embedding. Additionally, these parameters are updated during training through a learned embedding layer.
The image below shows how the conditioning augmentation trick enables the model to generate different images with changes in the input vector z. In the top row (without the conditioning augmentation trick), the model collapses to the same image due to the discontinuity in the text embedding space previously mentioned.
In addition to the conditioning augmentation, the StackGAN model demonstrates the application of using a multi-scale architecture. The Stage-I model upsamples a random vector and the augmented text embedding to a 64x64 image. The Stage-II model is conditioned on the image from the Stage-I model as well as the text embedding to upsample up to 256x256. The full architecture is pictured below:
There is a lot going on in this diagram, following is a list of takeaways from this picture: Note the parametric embedding layers in the Conditioning Augmentation block, the mean and diagonal covariance parameters are updated during training. Note the difference in architecture between the Stage-I and Stage-II generator. The Stage-II generator has a more complex encoder-decoder model, whereas the Stage-I simply proceeds to upsample the input. Note the progressive model where the 64x64 image is inputted to the Stage-II GAN alongside the augmented text embedding.
This multi-scale architecture is one of the dominant trends in high-resolution image synthesis with GANs. This technique is also employed in NVIDIA’s famous Progressively Growing GAN model. The basic idea of breaking high-resolution image synthesis into more tractable sub-problems is very intuitive. The authors of the BigGAN find that the multi-scale model is unnecessary for their experiments on ImageNet, but they also use wildly large amounts of compute and advanced techniques such as spectral normalization.
It is additionally interesting to contrast work on super-resolution with the multi-scale architectures used in image synthesis models such as StackGAN. The authors comments that super-resolution techniques can only add limited details, whereas the Stage-II GAN dramatically alters the Stage-I images. Below is a series of images showing how the outputs change from Stage-I to Stage-II in the StackGAN:
Below is an individual anecdote for further analysis:
The example above demonstrates the authors' vision of a sketch-refinement process with the Stage-I to Stage-II model. The image on top (produced by Stage-I) is very coarse and doesn’t contain nearly the detail of the image below. Additionally, we can see the Stage-II incorporate text information not evident in the Stage-I image such as ‘with a brown beak’.
There are two more demonstrations in the paper that must necessarily compliment novel GAN results. These are amongst some of the ideas discussed in famous GAN papers such as “Are all GANs created equal?” and “A note on the evaluation of GANs”. The first is a nearest-neighbor analysis to ensure that the GAN model is not overfitting.
Below is a generated sample (left) and its five nearest neighbors in the training data-set.
The bird in the image is quite similar to the third nearest neighbor, however, this image does not show dramatic overfitting.
Following, is an image demonstrating the capability of latent space interpolation in the text embedding space. This is shown in the incremental transitions from one description to another:
Concluding Thoughts from Henry AI Labs
With the recent release of image synthesis models such as NVIDIA’s GauGAN that can convert rough sketches to photo-realistic images, it is becoming increasingly useful to develop natural language interfaces to these models. We are also interested in the engineering dichotomy in GAN research between multi-scale / progressively-growing architectures and models that shoot for a high-resolution image directly from a noise vector, (albeit typically alongside many conditional variables as well). StackGAN is one of the most interesting papers in GAN research and the conditioning augmentation trick is a great example of how multivariate gaussians can be used to smooth high-dimensional discontinuous space. We are very interested in seeing the progression of text-to-image synthesis and hope this post helped improve your understanding of the StackGAN model. Thanks for reading and checking out Henry AI Labs!
special thanks to Dr. Praveen Narayanan and Jeremy List for contributions to the Quora discussion on this topic Quora Post