How much can your tweets reveal about you? Judging by the last nine hundred and seventy-two words that I used on Twitter, I’m about average when it comes to feeling upbeat and being personable, and I’m less likely than most people to be depressed or angry. That, at least, is the snapshot provided by AnalyzeWords, one of the latest creations from James Pennebaker, a psychologist at the University of Texas who studies how language relates to well-being and personality. One of Pennebaker’s most famous projects is a computer program called Linguistic Inquiry and Word Count (L.I.W.C.), which looks at the words we use, and in what frequency and context, and uses this information to gauge our psychological states and various aspects of our personality.

Since the creation of the L.I.W.C., in 1993, studies utilizing the program have suggested a close connection between our language, our state of mind, and our behavior. They have shown, for instance, that words used while speed dating can predict mutual romantic interest and desired future contact; that a person’s word choices can reveal her place in a social or professional hierarchy; and that the use of different filler words (“I mean”; “You know”) can suggest whether a speaker is male or female, younger or older, and more or less conscientious. Even the ways in which we use words like “and,” “under,” or “the” can be linked to depression, reactions to stress, social status, cultural norms, gender, and age. “The words we use in natural language reflect our thoughts and feelings in often unpredictable ways,” Pennebaker and his colleague Cindy Chung have written.

Social media seems tailor-made to take this kind of language analysis to the next level. You don’t have to ask for writing samples or diary entries. It’s all already online: tweets, Tumblr posts, and even Instagram captions give researchers access to the language that individuals use on an unprecedented scale. But the world of social-media language analysis is also fraught with difficulties. “The biggest problem with this approach is establishing causality,” Pennebaker said, when I spoke to him last week.

Take a study, out last month, from a group of researchers based at the University of Pennsylvania. The psychologist Johannes Eichstaedt and his colleagues analyzed eight hundred and twenty-six million tweets across fourteen hundred American counties. (The counties contained close to ninety per cent of the U.S. population.) Then, using lists of words—some developed by Pennebaker**,** others by Eichstaedt’s team—that can be reliably associated with anger, anxiety, social engagement**,** and positive and negative emotions, they gave each county an emotional profile. Finally, they asked a simple question: Could those profiles help determine which counties were likely to have more deaths from heart disease?

The answer, it turned out, was yes. Counties where residents’ tweets included words related to hostility, aggression, hate, and, fatigue—words such as “asshole,” “jealous,” and “bored”—had significantly higher rates of death from atherosclerotic heart disease, including heart attacks and strokes. Conversely, where people’s tweets reflected more positive emotions and engagement, heart disease was less common. The tweet-based model even had more predictive power than other models based on traditional demographic, socioeconomic, and health-risk factors.

It’s long been known that stress, anger, and loneliness increase the risk of heart attacks and other, often fatal, heart conditions. But that doesn’t make the results of this study any less bizarre. Even the researchers sound a cautionary note: “The people tweeting are not the people dying,” they point out. An individual’s tweets weren’t shown to predict her risk of heart disease; instead, collective negative tweeting in certain parts of the country corresponded to higher mortality rates in those areas. That correlation is especially strange because people who tweet are, on the whole, younger than people who die of heart disease. According to the most recent statistics from the Pew Research Center, around nineteen per cent of American adults use Twitter; of those users, only twenty-two per cent are older than fifty. The risk of heart attacks, on the other hand, increases with age, rising sharply in one’s sixties and continuing to increase through one’s eighties. How can the negative tweeting habits of some young people reveal that unrelated but nearby older people are at risk?

The researchers have a theory: they suggest that “the language of Twitter may be a window into the aggregated and powerful effects of the community context.” They point to other epidemiological studies which have shown that general facts about a community, such as its “social cohesion and social capital,” have consequences for the health of individuals. Broadly speaking, people who live in poorer, more fragmented communities are less healthy than people living in richer, integrated ones.“When we do a sub-analysis, we find that the power that Twitter has is in large part accounted for by community and socioeconomic variables,” Eichstaedt told me when we spoke over Skype. In short, a young person’s negative, angry, and stressed-out tweets might reflect his or her stress-inducing environment—and that same environment may have negative health repercussions for other, older members of the same community.

And yet that story is just speculation: nothing in the study directly examines how stress levels vary from county to county or links the feelings of the Twitter users with the health of their elders. Last week, when I spoke with Pennebaker about these findings, he, too, urged me to be cautious about drawing causal conclusions from the study. (He was not involved in the research and is not affiliated with anyone on the team.) “To say that eighteen-year-olds tweeting hostile messages is associated with the sudden death of their great-grandparents is a fairly big leap of logic,” he said. The relationship could be both statistically significant and something of a fluke. That possibility, however, doesn’t necessarily make the work less valuable: “Even if it ends up that there’s actually no real connection, it forces you to think. What is the causality?” Pennebaker said. Large-scale language analysis might be interesting precisely because it raises questions—not because it answers them.

In the meantime, Eichstaedt’s team is refining its work. The researchers are now collaborating with a group that conducts longitudinal epidemiological research; the plan is to track communities and individuals over time, instead of looking at a high-altitude snapshot. (The tweets in the heart-disease study were all part of a ten-per-cent random sample that Twitter made available for researchers between June, 2009, and March, 2010; ideally, the research would follow individual users for many months, if not years.) Eichstaedt is also in the process of looking at Facebook profiles: Twitter data, he says, casts a wide net, but it isn’t as expressive, deep, and individual as the information on Facebook. Not all big data is created equal.

Eichstaedt’s research is typical of today’s big-data psychology: it’s fascinating, but a work in progress. On one hand, it’s based on correlation rather than causation; on the other, it may offer a quicker, cheaper window into existing causal models. And, for psychologists, such work is a way of shedding light on bigger cultural and social trends that are difficult to capture through ordinary laboratory research. Pennebaker, for example, is currently using data from Twitter to identify and track how certain values, such as family cohesion and religious faith, shift over time.