Time as Your Friend





We talk more about this topic in the later chapters, but for now the question to ask is: How could we take advantage of a historical data collection? One answer is that we could look for a baseline so that, as we take measurements now, in the present, we can better understand if those measurements have any real meaning.



For example, in the previous example where we discussed positive versus negative sentiment, on average for the month shown, we had a ratio of 1:8 negative to positive comments (for every one negative comment, we were able to see eight positive comments).



The question is: Is that good or bad? The answer to this question depends on the specifics of the situation and also on the goal of the social analytics project that we are working on. Looking back in time, we were able to compute that in previous months, the ratio remained relatively constant. So there was no need for alarm in seeing the negative statements (note that this is not to say we need to ignore the negative things being said—far from it but there doesn’t appear to be an increase in any negative press or feelings).



But there is another interesting use case for data stores, and it’s that of validating our models. In one interesting use case we had, we were asked to monitor social media channels in an attempt to identify a hacker trying to break into a specific website. The idea sounds a bit absurd: What person or group would attempt to break into a website and publish their progress on social media sites?



While we didn’t have high hopes for success, we followed our iterative approach and built an initial model that would describe how people might “talk” when (or if) they were attempting a server breach. It was a fairly complex model and required a security specialist to spend time looking over whatever information we were able to find. Over the months of running our model, watching as many real-



This inevitability leads to the question: Is the model working, or is it just that nobody is trying to break in? It’s a valid question. How do we validate that we can detect a specific event if we don’t know that the event has occurred?



Well, we took the data model and changed it to reflect IBM. We wanted to run the model against a site that had been hacked in the past, and a widely reported incident around IBM’s DeveloperWorks site defacement was a perfect candidate. Access to historical information can be vastly informative (and useful) if utilized in the proper way . Up until this point in our discussion, we’ve described the collection of data as it appears but really haven’t mentioned the use of a data store or the collection of historical information.We talk more about this topic in the later chapters, but for now the question to ask is: How could we take advantage of a historical data collection? One answer is that we could look for a baseline so that, as we take measurements now, in the present, we can better understand if those measurements have any real meaning.For example, in the previous example where we discussed positive versus negative sentiment, on average for the month shown, we had a ratio of 1:8 negative to positive comments (for every one negative comment, we were able to see eight positive comments).The question is: Is that good or bad? The answer to this question depends on the specifics of the situation and also on the goal of the social analytics project that we are working on. Looking back in time, we were able to compute that in previous months, the ratio remained relatively constant. So there was no need for alarm in seeing the negative statements (note that this is not to say we need to ignore the negative things being said—far from it but there doesn’t appear to be an increase in any negative press or feelings).But there is another interesting use case for data stores, and it’s that of validating our models. In one interesting use case we had, we were asked to monitor social media channels in an attempt to identify a hacker trying to break into a specific website. The idea sounds a bit absurd: What person or group would attempt to break into a website and publish their progress on social media sites?While we didn’t have high hopes for success, we followed our iterative approach and built an initial model that would describe how people might “talk” when (or if) they were attempting a server breach. It was a fairly complex model and required a security specialist to spend time looking over whatever information we were able to find. Over the months of running our model, watching as many real- timing social media venues as we could, we ended up returning a large number of false positives.This inevitability leads to the question: Is the model working, or is it just that nobody is trying to break in? It’s a valid question. How do we validate that we can detect a specific event if we don’t know that the event has occurred?Well, we took the data model and changed it to reflect IBM. We wanted to run the model against a site that had been hacked in the past, and a widely reported incident around IBM’s DeveloperWorks site defacement was a perfect candidate.

One of the more popular descriptive metrics that people like to use is sentiment. Sentiment analysis basics is usually associated with the emotion (positive or negative) that an individual is feeling about the topic being discussed. In this post, which is focused on time, we discuss sentiment as it changes over time.Sentiment analytics involves the analysis of comments or words made by individuals to quantify the thoughts or feelings intended to be conveyed by words. Basically, it’s an attempt to understand the positive or negative feelings individuals have toward a brand, company, individual, or any other entity. In our experience, most of the sentiment collected around topics tends to be “neutral” (or convey no positive or negative feelings or meanings). It’s easiest to think about sentiment analytics when we look at Twitter data (or any other social site where people express a single thought or make a single statement). We can compute the sentiment of a document (such as a wiki post or blog entry) by looking at the overall scoring of sentiment words that it contains. For example, if a document contains 2,000 words that are considered negative versus 300 words that are considered positive in meaning, we may choose to classify that document as overall negative in sentiment. If the numbers are closer together (say 3,000 negative words versus 2,700 positive words or an almost equal distribution), we may choose to say that document is neutral in sentiment .Consider this simple message from LinkedIn:Hot off the press! Check out this week’s enlightening edition of the #companyname Newsletter http://bit.ly/xxxxA sentiment analysis of this message would indicate that it’s positive in tone. The sentiment analysis being done by software is usually based on a sentiment dictionary for that language. The basic package comes with a predefined list of words that are considered as positive. Similarly, there is also a long list of words that can be considered negative. For many projects, the standard dictionary can be utilized for determining sentiment. In some special cases, you may have to modify the dictionary to include domain-specific positive and negative words. For example, the word Disaster can be a negative sentiment word in a majority of contexts, except when it is used to refer to a category of system such as “Disaster Recovery Systems.”Understanding the general tone of a dataset can be an interesting metric, if indeed there is some overwhelming skew toward a particular tone in the message.Consider the descriptive set of metrics of sentiment shown in Figure 4.6, taken from an analysis we did for a customer in the financial industry over a one-month period. This represents the tone of the messages posted in social media comments about this particular company.On the surface, this looks like a good picture. The amount of positive conversation is clearly greater than the amount of negative, and the neutral sentiment (which is neither bad nor good) overwhelms both. So in summary, this appears to be quite acceptable.However, if we take the negative sentiment and look at it over time, a different picture emerges, as illustrated in Figure 4.7.While cumulatively the negative sentiment was much smaller than the positive, there was one particular date range (from approximately the 16th to the 18th of the month) when there was a large spike in negative messaging centered around our client. While just an isolated spike in traffic, the event could have lingering effects if not addressed.