For decades machines have been able to understand simple musical features like beats per minute. Now AI is boosting their abilities to the point that they can not only figure out what particular genre of music is playing, but also how to appropriately dance to it.

It’s obvious that the dancing style in an EDM club is very different from the way people waltz in a hotel ballroom. And even if you’re no country music fan, your foot may tap and your head softy sway when you hear the nostalgic “Country Roads” chorus. How our bodies respond to diverse musical stimuli almost seems instinctual — how to teach that to a machine?

Researchers from the University of California, Merced and NVIDIA have introduced a synthesis-by-analysis learning framework, Music2Dance, which can generate “style-consistent, and beat-matching dances” for different musical genres. This work will be presented at NuerIPS 2019 next month in Vancouver.

The researchers introduce a novel decomposition-to-composition framework that can transform basic body movements into complex dances conditioned on music. The decomposition phase learns how to perform basic dance moves by defining and normalizing dance units segmented from video of real dancing sequences by a kinematic beat detector. In the composition phase, a music-to-movement generative adversarial network (GAN) generates music conditioned dance moves. Researchers then extract style and beats, synthesize dance units in a recurrent manner, and apply a “beat warper” to the generated dance unit sequence to render the final output dance.

Schematic overview of the decomposition-to-composition framework.

Researchers compared their decomposition-to-composition framework with baselines such as LSTM and Aud-MoCoGAN on metrics that included motion realism, style consistency, diversity, multimodality, beat coverage and hit rate. The researchers’ proposed framework produced dances that were more realistic, diverse, and better synchronized with the music.

The researchers plan to collect and incorporate additional dancing styles such as pop-dance and partner dances in the future.

Comparison of the generated dances (left) and examples of multimodal generation (right)

Earlier this year, MIT CSAIL also conducted interesting research on cross-modal learning between audio and video. In their paperSpeech2Face: Learning the Face Behind a Voice,researchers design and train a deep neural network to reconstruct facial images of people based on their short speech audio recordings.

Although AI researchers are not generally known as party animals, they do seem to have a passion for dance. In October Synced published the story Shake Your Booty: AI Deepfakes Dance Moves From a Single Picture, which reports on ShanghaiTech University and Tencent AI Lab researchers’ 3D body mesh recovery module Liquid Warping GAN, which can deepfake dance moves from a single picture. The study is presented in the paper Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis.

The August Synced story Silicon Night Fever: Berkeley AI Gets Down meanwhile introduces the UC Berkeley paper Everybody Dance Now, which proposes a video-to-video translation approach for dance moves.

The paper Dancing to Music is on arXiv. There is also a project GitHub.