Deb Roy and Rupal Patel pulled into their driveway on a fine July day in 2005 with the beaming smiles and sleep-deprived glow common to all first-time parents. Pausing in the hallway of their Boston home for Grandpa to snap a photo, they chattered happily over the precious newborn son swaddled between them.

This normal-looking suburban couple weren’t exactly like other parents. Roy was an AI and robotics expert at MIT, Patel an eminent speech and language specialist at nearby Northeastern University. For years, they had been planning to amass the most extensive home-video collection ever.

From the ceiling in the hallway blinked two discreet black dots, each the size of a coin. Further dots were located over the open-plan living area and the dining room. There were 25 in total throughout the house – 14 microphones and 11 fish-eye cameras, part of a system primed to launch on their return from hospital, intended to record the newborn’s every move.

It had begun a decade earlier in Canada – but in fact Roy had built his first robots when he was just was six years old, back in Winnipeg in the 1970s, and he’d never really stopped. As his interest turned into a career, he wondered about android brains. What would it take for the machines he made to think and talk? “I thought I could just read the literature on how kids do it, and that would give me a blueprint for building my language and learning robots,” Roy told me.

Over dinner one night, he boasted to Patel, who was then completing her PhD in human speech pathology, that he had already created a robot that was learning the same way kids learn. He was convinced that if it got the sort of input children get, the robot could learn from it.

Toco was little more than a camera and microphone mounted on a Meccano frame, and given character with ping-pong-ball eyes, a red feather quiff and crooked yellow bill. But it was smart. Using voice recognition and pattern-analysing algorithms, Roy had painstakingly taught Toco to distinguish words and concepts within the maelstrom of everyday speech. Where previously computers learned language digitally, understanding words in relation to other words, Roy’s breakthrough was to create a machine that understood their relationship to objects. Asked to pick out the red ball among a range of physical items, Toco could do it.

Patel ran an infant lab in Toronto and Roy flew up there to see what he could learn. Observing the mothers and babies at play, he realised he’d been teaching Toco badly. “I hadn’t structured my learning algorithm correctly,” he explained to Wired magazine in 2007. “Every parent knows that when you’re talking to an 11-month-old, you stay on a very tight subject. If you’re talking about a cup, you stick to a cup and you interact with the cup until the baby gets bored and then the cup goes away.”

His robot had been searching through every phoneme it had ever heard when it was learning a new object, but Roy tweaked its algorithm to give extra weight to its most recent experiences, and began to feed it audio from Patel’s baby lab recordings. Suddenly Toco began to build a basic vocabulary at a rate never seen before in AI research. His dream of “a robot that can learn by listening and seeing objects” felt closer than ever. But it needed to feed on recordings, and these were hard to find.

An image from Deb Roy and Rupal Patel’s project to record their infant son’s first years. Photograph: MIT Media Lab

No one had ever truly studied “in the wild” what happens to a child in those first crucial years. The norm for researchers were weekly hour-long observation sessions – that was how Patel studied mothers and infants in her lab. If you were going to study the way a baby learned to talk, you’d need someone eccentric enough to rig up a house with hidden recording devices.

I first heard about Patel and Roy’s experiment while working as a teacher at a London comprehensive. Most of the children I taught arrived at school aged 11 far behind where they we were expected to be with their language, and as a novice I struggled to help them catch up. Whereas everything I tried seemed outdated, Roy’s approach was scientific. I hoped his findings would unlock a secret that could help kids to realise their full potential. If we could create machines that learned like humans, could we also develop ones that could help us perfect human learning?

Before pressing record, Roy and Patel agreed some ground rules. The recordings would be available only to their most trusted inner circle of researchers. If at any time they felt uncomfortable with the filming, they would junk the footage. When privacy was required, the system could be temporarily shut down. It was a leap of faith, but they agreed it was worth it. Their experiment had the power to unlock new insight into the workings of the infant mind.

Toco was Pinocchio to Roy’s Geppetto. But whereas he was wondering what real kids could teach robots, I wanted to know if those home videos might hint at how to enhance learning for the youngest humans.

In 1995, two researchers, Betty Hart and Todd Risley, published the results of a study in which they trailed 42 Kansas City families to compare the experiences of preschoolers from poor families with their richer peers. Starting when the infants were nine months old, they observed them regularly over a two-and-a-half-year period, recording and transcribing all parent-and-child speech during their hour-long visits. The findings were stark. The number of words a child heard by their third birthday strongly predicted academic success aged nine. The difference was barely fathomable. They estimated that, at the age of four, the richest kids had heard 30m more words than the poorest.

“The problem of skill differences among children at the time of school entry is bigger, more intractable and more important than we thought,” Hart and Risley said. Their research showed it was worth intervening as early as possible. “The longer the effort is put off, the less possible change becomes.”

If the problem was stark, the solution seemed simple. There was a gap, and it had to be filled with words. Hart and Risley’s findings fuelled a word-rush that endures today. Across the English-speaking world, parents flocked to buy flashcards and brain trainers for their tots.

But my experience in the classroom suggested that the interpretation was a little simplistic, equating the development of the human mind with the inputs and outputs of computers. I suspected that there was more to infant learning than the quantity of words you heard.

A professor of early-childhood development at Temple University in Pennsylvania, Kathy Hirsh-Pasek, seemed to agree. She had written that “just as the fast food industry fills us with empty calories, what we call the ‘learning industry’ has convinced many among us that the memorisation of content is all that is needed for learning success and joyful lives”. She had also written an influential book that laid out her reservations about the word-rush: Einstein Never Used Flashcards: How Children Really Learn and Why They Need to Play More and Memorize Less. I thought she might have some answers.

Hirsh-Pasek is legendary in the field of early child development. The author of 12 books and hundreds of academic articles, she is a distinguished faculty fellow who runs Temple’s Infant and Child Laboratory, whose slogan is “Where Children Teach Adults”.

At the lab, scientists were putting tiny humans through their paces. Researchers had developed ingenious experiments that measured changes in heart rate to show some of the things that eight-month-olds already knew. “They know the mobile won’t fall on them,” said Hirsh-Pasek. “They know that if I drop this plate on the table, the plate won’t go through the table. That’s amazing. They know that if I’m sitting across from you, and you can’t see the bottom part of my body, I still have one.”

Until recently, scientists had tended to think of infants as irrational, illogical and egocentric. In his Principles of Psychology in 1890, William James had described babies’ experience of sensory overload: “The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion.” This understanding had contributed to a mechanistic view of learning, and the idea that the sheer repetition of words was what mattered most. But it wasn’t true.

Even in utero, babies are learning. At that stage, they pick up sounds. One-hour-olds can distinguish their mother’s voice from another person’s. They arrive in the world with a brain primed to learn through sensory stimulation. We are natural-born explorers, ready made for scientific inquiry. We have to understand this if we were to realise our learning potential.

“We enter the world ready to ‘read the perfect cues out of the environment’,” said Hirsh-Pasek. I thought back to Toco. He read the environment, too – or at least what his eye cameras saw and ear microphones heard. But robots can only reach out in ways they have been programmed to, can only learn from stimuli they were instructed to pay attention to. It limits them to a small range of experiences that would shape their behaviours. There is no meaning in their methods. Babies, on the other hand, are social learners.

“We arrive ready to interact with other humans and our culture,” said Hirsh-Pasek. The real genius of human babies is not simply that they learn from the environment – other animals can do that. Human babies can understand the people around them and, specifically, interpret their intentions.

As we evolved, social and cultural transmission became possible. Language was our starting point – the possibility of two beings ascribing a shared meaning to an otherwise abstract concept or symbol. Couldn’t we see the beginnings of this in babies’ behaviour? Infants under a year engaged in proto-conversations with carers. They babbled away, held eye contact, exchanged things, mimicked their expressions or actions. They also experimented with tools, sticking them in their mouths, bashing them on things.

At the Max Planck Institute for Evolutionary Anthropology in Leipzig, Prof Michael Tomasello wrote that our young learn “in an environment of ever-new artefacts and social practices, which, at any one time, represent something resembling the entire collective wisdom of the entire social group throughout its entire cultural history”.

If all of us are to achieve our potential as learners, the question we have to answer is how we ought to shape this environment. Human brains have specially adapted to learn. Our long period of immaturity is a risky evolutionary strategy, making us vulnerable early on to predators or sickness, and delaying for many years our capacity to reproduce, but the payoff is immense. We can actively incorporate enormous amounts of the latest information from our environment and social group into our cognitive development.

Scientists have long recognised the nature-v-nurture debate as fallacy. A huge amount of our brain development takes place in the first three years. In those years, the brain grows in relation to the environment, forming itself in interaction with sensory experience. As Hart and Risley showed in their study of the word gap, that experience can have a huge effect on who that person becomes.

We have evolved to be a species of teachers and learners. Our ability to understand other people arrives around the ninth month, at a moment in their development at which babies begin to check the attention of others by holding or pointing at objects. At a year, they can follow another’s attention, gazing at, touching or listening to the same thing. At 15 months they can direct it. Listen to that! Look over there! Shared attention is the starting point of conscious human learning. It is why infants don’t learn to talk from video, audio or overhearing parental conversations. We haven’t evolved to. That’s why it matters that we talk to our children. It’s also why we can’t learn from robots – yet.

The implication for understanding how we learn sounds like common sense: each generation ought to ensure the next is steeped in their earliest years in the tools, symbols and social practices of the current culture.

In search of the kind of learning environment that might best cultivate our natural abilities, I visited Pen Green Early Childhood Centre, a specialist centre in early child development in the Northamptonshire town of Corby. The outdoor space was cold and overcast, but that wasn’t deterring the children. By a bamboo bush, two small boys splashed at an ever-running tap. “Don’t get me wet!” they squeaked with delight. A teacher bent down to comfort a toddler in a “Be Fast or Come Last” T-shirt. Four small girls were deep in a serious conversation while absent-mindedly digging sand into colourful buckets.

Pen Green had a global reputation for excellence in early child development and family support, a prototype that had inspired successive early-years interventions by government, including Sure Start and Early Excellence. I spoke to the director, Angela Prodger. She had just taken over from the legendary Margy Whalley, who set up the centre in 1983. In the 1980s, Corby was among the UK’s poorest towns, its population of Scottish migrant workers unmoored by the closure of the steelworks for which they had moved south – 11,000 people had been made redundant. The centre was intended as a lifeline for the next generation. Today it serves 1,400 of the UK’s least well-off households.

I asked her about language learning. We knew words mattered, but I’d not heard much talk at playtime. “If we’re not addressing personal, social, emotional development first, you’re not ready to learn,” said Prodger. She explained that before children could acquire the tools of speech and language, you had to ensure they felt a sense of “being and belonging”. Too frequently, she thought, our approaches to early learning skipped these steps. It sounded to me like a nice-to-have, not an essential, but research showed otherwise.

In the 1950s, British psychoanalyst John Bowlby proposed a theory of “attachment”. He hypothesised that infants, unable to regulate their own feelings, were prone to get upset when they were hungry, sad or lonely. A carer was needed to help them “co-regulate” their feelings, which over time would teach the child to self-regulate, provided their early experiences helped them do so. If negative experiences weren’t alleviated with love from a parental figure, they could become established.

Illustration: Guardian Design

The implications for children growing up in poverty-stricken or traumatic environments were significant. This was why Pen Green took care to put the being and belonging of its children first. It also explained some of the behaviour at the school where I had taught. Where I’d missed the signs of kids responding to the stress of the environment in which they were growing up, at Pen Green they worked closely with carers to ensure children built strong, nurturing relationships that would help them thrive in the nursery and ultimately at school. I’d always believed children wanted to wreak havoc. It had never occurred to me that they might simply have been conditioned by their environment to act in a certain way. “Behaviour is always just a sign of children trying to tell you something,” said Prodger.

As we toured the building, Prodger told me that the skill of the practitioners at Pen Green was in learning to attend to what was going on in the minds of the kids, and interpreting it as evidence of what the youngsters were signalling, even before they were able to verbalise it themselves. Children were constantly communicating with us, she told me. We just had to learn to understand.

“It’s about looking,” Prodger said. “What are the children trying to explore? What are they trying to find out?”

Creative play is the foundation on which creativity, language, maths and science are built. If you start too early with flashcards, you lose this developmental stage. “It’s about being free,” Prodger said. “It’s about risk-taking.”

They take the kids out to the forest a few days a week, light fires, let them experiment with scissors and ride BMX bikes. If they want to be outside, they go outside. If they fancy returning to the snug, where the youngest infants roll around, that’s where they would go. The environment dictates the learning. The adults aim only to connect and share attention with the children. Reading and writing could wait. Nurseries ought to be as social as possible, and follow kids’ lead in their play. Before kids can get on with learning, we have to ensure they belong.

The children seemed happy here, learning to belong and laying down foundations for their future success through play. And yet I wondered if we couldn’t do still more to accelerate early learning. The implication of Deb Roy’s robot experiment was that every moment counted. Could we afford to leave so much to chance?

“The accident of birth is the greatest source of inequality in the US,” wrote economist James Heckman. It’s equally true in the UK today, where the strongest predictor of academic achievement is how much your parents earn. Though two-thirds of our kids attain a C or above in English and maths GCSEs each year, that number falls to just over a third of kids on free school meals. Heckman has also shown that the best way to tackle this inequality is to invest in children’s development as early as possible in their lives. It isn’t enough to transform schools – we have to start much earlier than that.

At Temple University, Hirsh-Pasek told me that we can’t simply drop kids in front of iPads and expect them to catch up – but that doesn’t mean we should give up entirely on intelligent machines. Some of her lab’s experiments are aimed at closing developmental gaps between rich and poor kids. Others cover topics such as language development and spatial awareness, and all use technology in different ways. “What the machine can’t do is be a partner,” Hirsh-Pasek told me. “It isn’t social. It’s interactive without being adaptive.”

Hirsh-Pasek’s mission was to change the way we thought about learning, especially for the poorest kids. “We had this vision that it was so important to get the basics into poor kids,” she told me. “We thought we should drop recess – even though we know being physical helps kids learn, helps build better brains. And we thought we should just do reading and maths, and cut out the arts and all this superfluous stuff like social studies.”

‘From the earliest ages, we learn from people’ … Kathy Hirsh-Pasek

It weighed heavily on her. Policymakers and laymen had twisted the science to fit their own ends. No scientist thought flashcards worked. No scientist believed you should start learning to read and write at an ever younger age. It was a fantasy of governments. More recent research has added depth to the language lessons of Hart and Risley’s Kansas Study. In 2003, the psychologist Patricia Kuhl experimented with teaching American infants Mandarin. Split into three groups (video, audio and flesh-and-blood teacher) only those with a human tutor learned anything at all. In 2010 a study of the wildly popular Baby Einstein vocabulary-building DVDs (Time called them “Crack for Babies”) revealed that infants who watched them “showed no greater understanding of words from the program than kids who never saw it”. Nor did babies learn words by eavesdropping on parental conversations or listening to In Our Time on Radio 4, however soothing the mellifluous tones of Melvyn Bragg. More than words, it took a human being for a baby to learn language. They could not learn from screens.

Schools are still guilty of ignoring these insights into infant learning. Erika Christakis, early-childhood expert and author of The Importance of Being Little, charts the slow descent in preschool learning from a multidimensional, ideas-based approach to a two-dimensional naming-and-labelling curriculum. Daphna Bassok at the University of Virginia asks if kindergarten is really the new first grade. The expectation that kindergarteners – aged five or six – can read is now commonplace. Yet this is counter to all the evidence. A Cambridge study comparing groups of children who started formal literacy lessons at five and seven found that starting two years earlier made no difference at all to a child’s reading ability aged 11, “but the children who started at five developed less-positive attitudes to reading, and showed poorer text comprehension than those who started later”.

These findings are clear: if you start on the decoding before you have an underlying understanding of story, experience, sensation and emotion, then you become a worse reader. And you like it less. Treat kids like robots during early learning and you put them off for life.

Instead, Hirsh-Pasek wanted kids to embrace the joy in learning and growing up. Apart from kids, her other great love was music. She often used to break into song, especially on the phone to her granddaughter.

In her book, she suggested six Cs for modern learning: collaboration, communication, content, critical thinking, creative innovation and confidence. Truisms, I had thought, but unlike much education policy, drawn from scientific evidence. If I was to take away one thing, she said, it should be that “from the earliest ages, we learn from people”.

It was the same insight that had prompted a pair of suburban scientists to hit the “record” button.

Deb Roy was dressed in black and still looked youthful when we met at MIT. A few flecks of grey in his hair were the only evidence of 11 years of parenthood. Looking back, the Human Speechome Project – as his and Patel’s home-recoding experiment had been named – seemed a quirk of turn-of-the-millennium enthusiasm about artificial intelligence. In all, they had captured 90,000 hours of video and 140,000 hours of audio. The 200 terabytes of data covered 85% of the first three years of their son’s life (and 18 months of his little sister’s). But now the footage had been gathering dust. “I still have the whole collection,” he said. “I’m waiting for his wedding day, just to bore the hell out of everyone.”

In a way, it was also a great lost home video. With his team at MIT, Roy had developed new approaches to visualising and studying the data they had captured: “Social Hotspots” showed two tightly knotted lines, visual traces of tender moments in which parent and child came together to chat, learn or explore; “Wordscapes” were snow-capped mountains ranged throughout the living room and kitchen, the highest peaks rising where particular words were most often heard. The tools had turned out to be fantastically lucrative as a means for analysing talk on Twitter. Roy and a graduate student had spent the decade building a new media company.

Roy was now back at MIT. His new group was called the Laboratory for Social Machines. He had given up building robots that would compete with humans and instead turned his attention to the augmentation of human learning. What had changed his mind was the process of actually raising a child.

The first time his son uttered something that wasn’t just babble, Roy was sitting with him looking at pictures. “He said ‘fah’,” Roy explained, “but he was actually clearly referring to a fish on the wall that we were both looking at. The way I knew it was not just coincidence was that right after he looked at it and said it, he turned to me. And he had this kind of look, like a cartoon lightbulb going off – an ‘Ah, now I get it’ kind of look. He’s not even a year old, but there’s a conscious being, in the sense of being self-reflective.”

“I guess, putting on my AI hat, it was a humbling lesson,” he continued. “A lesson of like, holy shit, there’s a lot more here.”

Roy was no longer sure you could bring a robot up like a real human – or that we should even try. It didn’t seem there was much to gain by developing robots that took exactly one human childhood to become exactly like one young adult human. That’s what people did. And that was before you got into imagination or emotions, identity or love – things that were impossible for Toco. Watching his son, Roy had been blown away by “the incredible sophistication of what a language learner in the flesh actually looks like and does”. Infant humans didn’t only regurgitate; they created, made new meaning, shared feelings.

The learning process wasn’t decoding, as he had originally thought, but something infinitely more continuous, complex and social. He was reading Helen Keller’s autobiography to his kids, and had been struck by her epiphany at understanding language for the first time. Deaf and blind after an illness in infancy, Keller was seven years old when she got it. “Suddenly I felt a misty consciousness as of something forgotten,” she wrote, “a thrill of returning thought; and somehow the mystery of language was revealed to me. I knew then that ‘w-a-t-e-r’ meant the wonderful cool something that was flowing over my hand. That living word awakened my soul, gave it light, hope, joy, set it free! Everything had a name, and each name gave birth to a new thought. As we returned to the house every object which I touched seemed to quiver with life.”

Roy had recently started working with Hirsh-Pasek, following her insight that machines might augment learning between humans, but would never replace it.

He had discovered that human learning was communal and interactive. For a robot, the acquisition of language was abstract and formulaic. For us, it was embodied, emotive, subjective, quivering with life. The future of intelligence wouldn’t be found in our machines, but in the development of our own minds.

Natural Born Learners by Alex Beard will be published by Weidenfeld & Nicolson on 12 April. To order a copy for £16.14, go to guardianbookshop.com

• Follow the Long Read on Twitter at @gdnlongread, or sign up to the long read weekly email here.