Any sufficiently advanced technology is indistinguishable from magic

— Arthur C. Clarke

The rate at which fast-paced technological fields such as machine learning and extended reality are growing is truly astounding and even leaving science fiction in the dust. Many big-budget films have stretched our limits of imagination and utilized the state-of-the-art during production. However, on looking back, I have noticed examples where today’s tech has progressed miles ahead of the fiction counterparts from previous decades.

Shankar is already well-known for elaborate visual effects and futuristic technology in his movies. One of my favorite Shankar classics is Jeans, whose production started all the way back in 1996. What is truly amazing in this movie is that the amount of VFX time was more than Jurassic Park!

Movie Poster (Source: Wikipedia)

What caught my attention even back in the day was particularly, this song, which showcases some badass real-time motion capture and augmented reality to create another copy of Aishwarya Rai. Do check out the video below to see how awesome it is for those days.

Inspired by the song from Jeans (1998) — also a trip down nostalgia lane

But for someone today in 2020, this seems very reproducible using a laptop and a GPU. Even better, no need to wire yourself up for motion capture when you have 3D pose estimation linked with Unity!

Let’s start with a sub-problem. Can we capture the facial expressions first?

I mean, facial expressions are a key aspect of dance, particularly Indian dance, with emphasis on the Navarasas or the 9 emotions.

The 9 Emotions of Indian Dance (Source: Pinterest)

Keypoint detection for facial expressions is very much in vogue and are a standard illustrative example of deep learning and computer vision techniques. We also need an estimation of the bounding box and head tilt to map it to a 3D character. Trying that out on a snippet from the video, we get

The key points can then be combined with a 3D avatar in Unity so that we can create a real-time mapping of expressions from our video. Please don’t forget to have a look at the awesome repository — VTuber_Unity. We will work with the equally talented Unity Chan as a placeholder for Aishwarya Rai.

Connecting live stream to Unity character (Source: VTuber_Unity)

Similarly, we can replace the keypoint detection just for the face using human pose estimation networks. These give the key points corresponding to the entire body as an output when provided a human image.

This repository* provides a working version of this functionality and achieves a decent framerate while using a GPU for the inference.

Demo of Pose Estimation + Unity (Source: Jacob’s Tech)

And welcome to the future!

If you think there is any cool tech you have seen in movies and are wondering if it can be replicated today, please drop a comment.