Introduction:

I have always been curious while reading novels how the characters mentioned in them would look in reality. Imagining an overall persona is still viable, but getting the description to the most profound details is quite challenging at large and often has various interpretations from person to person. Many at times, I end up imagining a very blurry face for the character until the very end of the story. It is only when the book gets translated into a movie, that the blurry face gets filled up with details. For instance, I could never imagine the exact face of Rachel from the book ‘The girl on the train’. But when the movie came out (click for trailer), I could relate with Emily Blunt’s face being the face of Rachel. There must be a lot of efforts that the casting professionals take for getting the characters from the script right.

This problem inspired me and incentivized me to find a solution for it. Thereafter began a search through the deep learning research literature for something similar. Fortunately, there is abundant research done for synthesizing images from text. Following are some of the ones that I referred to.

https://arxiv.org/abs/1605.05396 “Generative Adversarial Text to Image Synthesis” https://arxiv.org/abs/1612.03242 “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks” https://arxiv.org/abs/1710.10916 “StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”

After the literature study, I came up with an architecture that is simpler compared to the StackGAN++ and is quite apt for the problem being solved. In the subsequent sections, I will explain the work done and share the preliminary results obtained till now. I would also mention some of the coding and training details that took me some time to figure out.