Driving in a busy city, you have to get good at scrutinizing the body language of pedestrians. Your foot hovers somewhere between the gas and the brake, waiting for your brain to triangulate their intent: Is that one trying to cross the street, or just waiting for the bus? Still, a whole lot of the time you hit the brakes for nothing, ending up in a kind of dance with the pedestrian (you go, no you go, no YOU go).

If you think that’s frustrating, then you’ve never been a self-driving car. As human drivers slowly go extinct (and human pedestrians don’t), autonomous vehicles will have to get better at decoding those unspoken intersection interactions. So a startup called Perceptive Automata is tackling that looming problem. The company says its computer vision system can scrutinize a pedestrian to determine not only their awareness of an oncoming car, but their intent—that is, using body language to predict behavior.

Typically if you want a machine to recognize something like trees, you first have humans label tens of thousands of pictures: trees or not trees. It’s a nice, neat binary. It gives the machine learning algorithms a base level of knowledge. But detecting human body language is more complex.

“In the case of a pedestrian, it's not, this person is crossing the road and this person isn't crossing the road. It's, this person isn't crossing the road but they clearly want to,” says Sam Anthony, co-founder of Perceptive Automata. Is the person looking down the road at oncoming traffic? If they’ve got grocery bags, have they set them down to wait, or are they mid-hoist, getting ready to cross?

Perceptive trains its models to look at those kinds of behaviors. They begin with human trainers, who watch and analyze videos of different pedestrians. Perceptive will take a clip of, say, a human looking down the street to cross the road, and manipulate it hundreds of ways—obscuring portions of it, for instance. Maybe sometimes the head is easier to see, maybe sometimes it’s harder. Then they depart from the tree-not-tree binary by asking the trainers a range of questions, such as, "Is that pedestrian hoping to eventually cross the street?" or “If you were that cyclist, would you be trying to stop the car from passing?”

When different parts of the image are harder to see, the human trainers have to think harder about their judgements of body language, which Perceptive can measure by tracking eye movement and hesitation. Maybe the head is harder to make out, for example, and the trainer has to put more thought into it. “This tells us that there's information about the appearance of the person's head in this particular slice that's an important part of how people judge whether that person in that training video is going to cross the street,” Anthony says.

The head is clearly an important clue for human observers, so it’s also an important clue for the machines. “So when the model saw a novel image where the head was important,” Anthony says, “it would be primed based on the training data to believe that people would likely really care about the pixels around the head area, and would produce an output that captured that human intuition.”

By considering cues like where the pedestrian is looking, Perceptive can quantify awareness and intent. A person walking down the sidewalk with their back to the car, for example, isn’t anything to worry about—both unaware and not intending to cross the street. But someone standing at a crosswalk peering down the street is another story. This insight would give a self-driving car extra time to slow down in case the pedestrian does decide to make a run for it.