I Learned How to Hack Twitter Using Data Science

How to spread a message to millions: an introduction to R

I met him at a hack-and-tell event in Washington DC. He had five minutes to present an idea. In his excited Russian accent, he presented Hacking Twitter: Spreading a Message.

Slide by slide he showed the small group how to hack Twitter (see below). My jaw dropped a little. How is this possible, I thought. After the event, I asked if he would sit down with me and show me more. He agreed.

Turns out, data scientist Dmitri Adler worked as a former $10 billion investment fund analyst for global bank Macquarie Group. He used data science to deploy regression models and data forecasting to place bets in the market on all types of securities, commodities, and investment vehicles.

His fascination with data science grew out of University of Virginia, where he managed the school’s student investment fund and fell in love with math. To him, math was the cleanest explanation of how people, objects, and places relate to one another.

He explained how Google’s search engine used data science to determine the best results.

“It’s how you quantify trust,” he said. “Google’s PageRank algorithm uses raw data to measure importance and influence.”

At first I was skeptical. How do you use cold data to measure something as intangible as human trust?

He showed me how presidential candidates hire data scientists to proliferate their campaign messages at the right time, to the right people, with the right words.

“Are politicians using this technology right now?” I asked.

“I don’t know how they could not,” he said. “President Obama did.”

Data science is used heavily in political campaigning. Source

To measure human trust, Dmitri used terms like text mining, network analysis, sentiment analysis, and eigenvector centrality. It was all mullukmulluk to me. If it were not for the graphics and visuals, I’d still be lost.

Apparently, he’s not the only one using data science to hack things. According to Predictive Analytics by Eric Seigel, a book Stein Kretsinger dubbed the “‘Freakonomics’ of big data,” large companies deploy data science to save millions of dollars in business each year.

Allstate predicts bodily injury liability based on vehicle type, saving roughly $40 million annually.

predicts bodily injury liability based on vehicle type, saving roughly $40 million annually. Google Flu Trends has predicted outbreaks of influenza 7–10 days earlier than the CDC, allowing hospitals and pharmacies to stock supplies.

Flu Trends has predicted outbreaks of influenza 7–10 days earlier than the CDC, allowing hospitals and pharmacies to stock supplies. Researchers use machine learning to predict which screenplays will be Hollywood blockbusters and which songs will top the charts.

blockbusters and which songs will top the charts. UPS optimizes truck delivery routes, eliminating left turns to decrease travel.

optimizes truck delivery routes, eliminating left turns to decrease travel. Hewlett Packard predicts outcomes for 92% of sales efforts with 95% accuracy.

Data itself is pointless if it does not result in action. That’s why Dmitri paired his mathematical knowledge of how data works with an extremely powerful computer program called R. He recalled the day it was introduced to him while at Macquarie:

“One day, a Ph.D. friend asked me, ‘Why are you using Excel when you should be using R?’ I told him I had never heard of R before. So I looked into it.”

Dmitri researched R and cried. “I had spent weeks of lost sleep using Excel to run regression analyses,” he said. “With R, I could run the same processes in mere hours.”

What is R?

Just like a carpenter has a woodshop and a chef has a kitchen, think of R as an ideal environment in which a data scientist works.

R is an open source programming language for statistical computing. It’s like a customizable Excel but with a command prompt. Users can build in functions for machine learning and nonlinear modeling, and can even produce dynamic publication-ready graphics.

In the early 1990s, statisticians said they needed something simple that could build dynamic algorithms and process multiple series of data clusters. C would work, but it would require exhaustive lines of code. Statisticians wanted something quick and fast to process data dynamically — what experts now call machine learning.

In 1993, statisticians at the University of Auckland, New Zealand, Ross Ihaka and Robert Gentleman created R — a robust yet lean programming language with a code-interface for manipulating statistics.

At the time of this writing, R has 6,500 libraries, each containing 5–100 functions. Not only could it mine and process data rapidly, it also could be used to render visualizations, formats, and dashboards for easy consumption, using frameworks such as Javascript.

R is widely regarded as the best data mining software. An annual poll conducted by KDNuggets asked, “What analytics/data mining software did you use in the past 12 months for a real project (not just evaluation).” In 2015, R was the top-ranked data mining solution.

R is the most popular data mining tool in the industry. Source

Because of its open-source DNA, R is free. But you need to know how to use it.

Dmitri pointed out that the reason why many people have never heard about R is that it’s difficult to find expert teachers. “Deep experts can’t teach and educators don’t have deep knowledge,” he said.

He noticed this when his hedge fund was hiring. He saw students from top schools graduate with mountains of debt and readily applicable skills in his trade, not even MBAs or computer science grads.

“I used maybe five percent of what I learned in school,” he said. “There’s an enormous disconnect between the high-level jobs we were hiring for and the skills of the candidates graduating from four-year schools.”

Upon seeing what Dmitri could do with data science, I asked him the question that he gets all the time, “Why haven’t you taken this knowledge and skill and applied it to make yourself independently wealthy?”

“That’s not what business is about,” he said. “Business is about helping as many people as possible by providing a solution to a problem. The problem that I see is a lack of education. People do not understand how to use data science to benefit themselves and their society. This is why I have partnered with Merav Yuravlivker, a nationally ranked Kaplan instructor in the United States, to launch Data Society.”

Data Society is an online education platform with the mission to make data science accessible to everyone.

Dmitri, Merav, and their team of data scientists are ramping up fast. They are hiring, looking for investors, and preparing for growth, because they know the university model of education is about to change.

“The student loan bubble is unsustainable, the cost of education will have to come down,” said Dmitri. “Students will be looking for concrete career paths and when they catch wind of the profitable use cases of data science, we want to be prepared. In our era, data is not going leaving anytime soon. It is only going to grow.”

Example: How to Hack Twitter

For those of you got this far, here’s an example of how data science can be used to hack a social network. This is an excerpt from a course offered by Data Society.

In this exercise, the goal is to send a message to millions of people. But not just that. The tweets will be timed, worded, and targeted with optimum effectiveness.

First, plug into the Twitter API using R.

Input the date range to view all tweets within the date range (note: there is a 7-day limit with the free version of the Twitter API, paid versions let you access tweets further back).

Download tweet data to a csv.

You will be able to view the spreadsheet with information fields visible for each use such as: user ID, number of followers, number of tweets, location, etc.

Use R to graphically map out the directed information flow of the network. This will show you hierarchical clusters of how users are connected.

Target users with greatest eigenvector centrality (who is most important in the clusters).

Use text mining and sentiment analysis to quantitatively determine when and what the best message should be to send to these accounts.

Use R to programmatically generate and send tweets in the API account from the right location, with the right wording, the right delivery (e.g., as a reply tweet), to the right accounts.

Run the network simulation, test your results, and then visualize results.

Execute campaign.

Sit back and watch as your audience interacts with your data-driven social campaign.