Andrew Brook is the CTO of Selerity, a provider of real-time news, data, and content analytics. His expertise lies in applying distributed, realtime systems technology and data science to real-world business problems.

Selerity reported Twitter's Q1 2015 quarterly earnings results on April 28.

Besides the news itself (Twitter's revenues were disappointing to some investors) the event was noteworthy because it occurred at 3:07pm New York time—almost an hour before the close of trading. While it’s rare for companies to release during market-hours there is no official policy prohibiting it and early announcements do happen occasionally. In the case of Twitter’s earnings, it was apparently the result of an accident by NASDAQ's investor relations subsidiary, Shareholder.com.

Most of the media coverage to date has focused on the process by which Selerity obtained the earnings press release so quickly. Some of that coverage has been speculative or inaccurate. In particular it’s important to understand that this was not a “hack." That term implies a circumvention of laws or privacy, something Selerity would never do. Nor was it a “leak” by Selerity—it had already been published in the expected manner in the expected location. It was just early. We did not “guess” the URL that contained Twitter’s quarterly earnings results. Anyone with a Web browser and an Internet connection could have followed the links from the main investor relations page to the same PDF file that Selerity found.

Today’s $TWTR earnings release was sourced from Twitter’s Investor Relations website https://t.co/QD6138euja. No leak. No hack. — Selerity (@Selerity) April 28, 2015

Web scraping alone also misses the more interesting half of the story—which is what we do with the press release once we have it. Below I’ll explain exactly how we obtained the press release, what we did with it, and finally why that matters more broadly than just Tuesday’s event.

Domain-specific setup

Ahead of the event (way back in 2014) Selerity analysts reviewed the Twitter investor relations site [https://investor.twitterinc.com/] for the locations most likely to be used for publication of the release based on where prior releases were published. The location where Tuesday release was first publish was found this way:

Start from Twitter's investor relations website: [https://investor.twitterinc.com/] Follow the link for "Quarterly results" under "Financial information" in the left side bar. That takes you here: [https://investor.twitterinc.com/results.cfm/] Select the current quarter and year from the drop-down filter at the top of the page. For example, selecting “First quarter” and “2015” takes you here: [https://investor.twitterinc.com/results.cfm?Quarter=1&Year=2015] Ahead of the release that page always contains a short message saying the results aren’t available yet - but at the time of publication an “Earnings Press Release” link appears which links to a PDF document.

That URL [https://investor.twitterinc.com/results.cfm] with instructions for setting the quarter and year was then handed off to the dev team which configured our Web scraper to poll that URL ahead of each scheduled release and look for the ‘Earnings Press Release’ link. It has worked reliably for several quarters now.

Real-time processing

Tuesday, April 28, 2015 (US Eastern Daylight Time)

14:00:00. Web scraper starts polling the ‘Quarterly results’ page on investor.twitterinc.com.

… (Time passes, no results available yet) ...

15:07:48. Web scraper detects no change

15:07:54. Web scraper detects no change

15:07:56. Web scraper’s query to https://investor.twitterinc.com/results.cfm?Quarter=1&Year=2015 returns a page with a link called “Earnings Press Release” which points to https://investor.twitterinc.com/common/download/download.cfm?companyid=AMDA-2F526X&fileid=824316&filekey=887C2A59-9344-4C95-8AA6-59FD63944321&filename=2015%20Q1%20Earnings%20Release%20FINAL%20-%20WOTB.pdf

15:07:57 - The Web scraper downloads the linked document (a PDF document titled “2015 Q1 Earnings Release FINAL - WOTB.pdf”) and extracts the text from the PDF.

15:07:57 - Several different NLP algorithms work in parallel to parse the text to determine that it's an earnings release and extract key factual data. The algorithms work independently, targeting different parts of the document and using different techniques.

15:07:59 - A proprietary headline generating engine consumes the raw output of the different algorithms. When and if the algorithms reach a consensus, headlines are automatically generated and published to Selerity’s Twitter feed (@Selerity) and Selerity’s Notifications API.

15:08:03 – Four seconds later, according to data from Nasdaq (see chart below), trading activity increases sharply, presumably in reaction to the news. The timing of the trades indicates that most if not all of the trading was conducted by humans rather than machines.

Domain context and relevance

The process of obtaining earnings press releases immediately after publication isn’t very hard—which is a good thing since these are intended to be easily accessible to everyone. Good open source Web scraping libraries and third-party Web scraping services abound and don't require especially powerful computers to run on. Selerity's scraper that downloaded the Twitter earnings report runs on a two-year-old commodity Dell server. We've routinely run the same scraper on Amazon EC2 instances that anyone can rent for around $100/mo so it’s clearly not a big barrier to entry. And of course, if you’re willing to spend the time, you can do it all by hand with a Web browser—just keep clicking "refresh."

If Web scraping is so easy, why was Selerity the first to report Twitter's earnings? The reason is that reliably extracting financial data from text is easy for humans—but hard for machines. And that has some interesting wider implications.

Let's look at a couple of examples from the press release (PDF) to understand why this seemingly easy task is complicated.

Example 1

Early in the document we have the following sentence:

Q1 revenue of $436 million, up 74% year-over-year, slightly below the previously forecast range of $440 million to $450 million.

In order to determine that $436 million was the actual revenue while the $440 million and $450 million figures are previous forecasts, a computer needs to understand general English grammar as well as a large set of specific phrases that are commonly used in financial statements like “[period] revenue of [currency amount]”, “[up/down] [percentage] year-over-year” and “range of [low currency amount] to [high currency amount]”. A general purpose English language parser or part-of-speech tagger is insufficient.

Example 2

It's a bit trickier to look at this table and understand that since Twitter's fiscal calendar ends on Dec 31 that the "Three Months Ended March 31" implies the first quarter and that the phrase "in thousands, except per share data" means that Revenue of 435,939 is really 435,939,000 but that "Diluted net loss per share" of (0.25) means -0.25, not -250. This requires prior knowledge of how fiscal calendars work, the common convention of reporting certain financial metrics in thousands (or sometimes millions), the conventions of tables (looking to row and column headers to understand the meaning of each cell), etc. Even humans - if they aren’t familiar with the conventions of financial tables—might not be able to interpret the information correctly.

Example 3

An important input into any fundamental analysis of a publicly traded company’s stock is the computation of “earnings per share." However, there are actually several different variations in the definition of earnings per share which might be called “GAAP,” “Non-GAAP,” “Adjusted,” etc. Understanding which variant of earnings per share is most indicative of a firm’s underlying performance requires more sophisticated understanding of equity markets and specific companies. In the case of Twitter, investors will care most about the row labelled “Non-GAAP diluted net income per share” and will want to compare this with estimates for “earnings per share” to decide if the company is more or less profitable than expected. This information must be explicitly supplied by human domain experts (e.g., Selerity analysts).

Example 4

Finally, on page two we find this sentence: "Average Monthly Active Users (MAUs) were 302 million for the first quarter, up 18% year-over-year and compared to 288 million in the previous quarter. Average Mobile MAUs represented approximately 80% of total MAUs."

Investors who pay close attention to Twitter will know that mobile users are an important subset of users of the Twitter platform but the number of monthly active mobile users isn’t explicitly stated in the press release. A person can compute that mobile monthly active users are about 302 million x 80% = 241.6 million - but that requires a parser that can reliably understand the total MAU's were 302 million (not 288 million) and that mobile MAU's were 80% of the total (not 18%).

From context to relevance

Selerity’s technology platform enables us to develop algorithms that integrate heuristics supplied by human experts, contextually-aware natural language processing and statistical machine learning techniques (from workhorse support vector machines to deep learning networks). This hybrid approach lets us optimize for a problem domain (and achieve high performance) while benefiting from the efficiencies of fully-automated systems.

When we launched the first generation of our platform in 2009 we offered the financial market’s first real-time, machine-readable data feed for information such as quarterly earnings extracted from the original, human-readable text and continue to be a leader in that market. Our data is relied upon by investors to make real-time investment and risk management decisions so the edge in accuracy that we obtained by incorporating deep context allowed us to compete with much larger, more established news and data firms.

The same platform, several generations later, is being applied to the much larger problem of information relevance and overload. Today, the Internet provides more news, research, social media and other information than any person can possibly consume. Finding the information that is needed—when it’s needed—is increasingly difficult. Understanding the context in which a user is operating, bringing them the most relevant information and presenting it to them in an engaging manner at exactly the moment they need it is what Selerity does at scale today.

Being the first to bring news of Twitter’s unexpectedly early financial results to the Twitter community is just a taste of what’s to come. Stay tuned.