I “needed” to have a backup of my chat conversations, didn’t trust Facebook —of course. After the disappointment in the available solutions and tools for such a task, I just decided to rely on the old and gold DIY technique.

Once I got a simple working scraper, I felt the need of at least a basic parser, in order to get a nicely-readable format for my precious conversations.

I am sure you will agree with me when I say that having clean and tidy data is a waste, if you don’t try to extort all the info and stats you can from it, so here I am.

The content of this article is mostly based on my conversation-analyzer python project. There you can find specific info about the implementation, how to set it up and actually run it on your conversations (the code can be used for any kind of conversation once the text content is properly parsed). On the other hand, what I will discuss here is a generic overview of various aspects and methods related to the task of conversation analysis.

In this scope, a conversation is simply a textual interaction between two or more participants (or senders).

Here I will include views and considerations belonging to different areas, from natural language processing and text analysis, to data modeling and visualization, as well as sociological interpretations — that is, personal debatable speculations— of some analytical results. As the title suggests, it is just an introduction: it contains information that is probably obvious to many readers, but with time I will try to embed or expand it with results from more specific areas and covering more complex techniques.

Overall, I hope to provide some insight or inspiration, and I more than welcome all kinds of comments, critiques, suggestions and — obviously — corrections.

Basic Length Stats

The first basic set of statistics for a conversation consist of length measures, like total number of messages, total length of all messages, and message average length. These measurements can be about the overall conversation, or for a specific sender, and can be used as basic building blocks for more complex and interesting stats. For example, grouping by other parameters (like date or time), gives access to additional views, and constitute the base for sender-activity comparison.

An heatmap can provide an immediate view of a conversation’s feature through time. Here we can see the total length of messages over a 10 month period

Interval Stats

Intuitively an important aspect of a conversation is its duration (or interval).

Being start date the datetime of the first message, and end date the datetime of the last one, the conversation overall duration can be simply defined as end date - start date. An interesting info for this interval is the list of days for which no message from any participant has been sent. The ratio between the length of this list and the total number of conversation days, constitute the “density” of the conversation. From this, other minor and more specific stats can be derived, like the maximum number of consecutive days without messages, or the density distribution across different time-frames.

The density of a conversation can provide insight about the relationship between the participants. Higher density should imply a stronger relationship — in the analyzed digital context — especially if we consider only a one-to-one conversation. On a second note, also the amount of information shared during each day should be considered before jumping to conclusions.

Aggregation

Given the characterization of a conversation, we are dealing in a way or another with time-series, and as such a really useful and powerful operation is the aggregation. With aggregation we refer to the operation of grouping a set of messages by a specific feature, and collapsing the resulting multiple values into a single one, by means of a function (e.g. sum, average, count). Multiple values are the result of messages sharing the same value for the feature we are grouping by.

Take for example the hour feature. We can aggregate all messages by it, summing together all multiple values (since an hour will most likely appear again for different days, month, years, etc.). By doing this, we can observe the message-length trend for the conversation and derive the sender’s hourly-pattern-activity.

You can spot sender-specific routine patterns by simply looking at the resulting diagrams. Apart from the obvious sleeping cycles you can see when you are more prone to write, when you write more per message, or when you simply write more messages. Moreover, if graphs are observed on different periods, it is possible to recognize how habits changed in time.

Lexical Stats

Lexical stats are more toward a linguistic point of view of the conversation, considering its words and vocabulary. We consider a word (or token) to be a sequence of characters meaningful to us. The richness (or lexical diversity) of a conversation is then simply defined as the ratio between total word count and distinct word count (or vocabulary).

It is interesting to observe how the richness of a conversation changes in time. I would say that it’s more likely to observe a decrease of it for various reasons. One is that we might end up narrowing down the discussed topics, another is that we might start adapting to the other participants, and rely on a implicitly-agreed-upon vocabulary.

Lexical richness variation by year, aggregated by month

Words Frequency

All starts by simply counting words: how many times a token appears in the conversation. We are not interested in the meaning or context of a word at this point.

Overall-stats simply consider the counting for the conversation as a whole, with basic info like the top N words (the N words that occurs the most in the conversation). We can then start considering one or few specific words, grouping by sender, and aggregating word-count by features.