From algorithms that can automatically tag you in photos, to face recognition systems embedded in city surveillance systems to voice generators that can put words in people’s mouths, AI is dismantling privacy. A new tool is peeling back the curtain a little more, with a method to figure out what your face looks like from your voice.

In research published on Arxiv, a publishing site for non-peer-reviewed papers, MIT researchers created a way to reconstruct some people’s very rough likeness based on a short audio clip. The paper, “Speech2Face: Learning the Face Behind a Voice,” explains how they took a dataset made up of millions of clips from YouTube and created a neural network-based model that learns vocal attributes associated with facial features from the videos. Now, when the system hears a new sound bite, the AI can use what it’s learned to guess what the face might look like.

The researchers, led by MIT postdoctoral student Tae-Hyun Oh, do briefly acknowledge the privacy concerns in the paper, explaining in an “Ethical Consideration” section that Speech2Face was trained to capture visual features like gender and age that are common, and only when there was enough evidence from the voice to do so. In other words, the system is not trying or able to produce images of specific people.

Still, the researchers speculate, the AI “may support useful applications, such as attaching a representative face to phone/video calls based on the speaker’s voice.”

The resulting images are certainly very rough. But while they are not quite the quality of the latest computer-generated images that police departments are putting out to find missing children or crime suspects, generally, many of the images get in the right ballpark for age, ethnicity, and gender. Previous research has explored methods for predicting age and gender from speech, but in this case, the researchers claim they have also detected correlations with some facial patterns too. “Beyond these dominant features, our reconstructions reveal non-negligible correlations between craniofacial features (e.g., nose structure) and voice,” they write.

The system struggled with people of certain identities, however. Under the ethics section, the researchers acknowledge cases where attributes like spoken language or voice pitch caused the model to create highly erroneous associations and approximations of what the speaker looks like. This reflects the limits of machine learning, and the limits of the premis that a voice can be used to predict a face beyond basic stereotypes. With enough data, AI can find insignificant patterns anywhere.