Amazon Kinesis: Fast Analytics On Streaming Data

AWS Kinesis service takes in thousands of data streams, processes them on an Amazon cluster, and offers results in near real time.



Top 10 Cloud Fiascos (click image for larger view)

Kinesis, Amazon Web Services' new service for processing a high volume of real-time data, such as that pouring off a stock ticker, is open for business. The system was announced, but not made generally available, Nov. 14 during AWS's Re:Invent event in Las Vegas.

A customer can start out feeding kilobytes of data into Kinesis and move up to terabytes over the course of an hour, depending on the demands of the real-time data stream. Streams from hundreds or thousands of sources, such as social media, investment research services, or news services, can be added to an original stream, allowing Kinesis to show correlations between real-time events.

Breaking news items, such as a report that the anchovy harvest has failed off the coast of Chile, can have a big impact on trading at an exchange like the Chicago Board of Trade. Likewise, companies could track Twitter, Facebook, and Google+ traffic following business announcements, such as the close of a favorable quarter or a product line addition.

Kinesis is available through AWS's US East-1 complex in Ashburn, Va., but will be rolled out to other Amazon regional data centers in 2014.

Applications built to use Kinesis can produce near real-time dashboards, alerts, and reports that can drive real-time business decision making, such as whether to change pricing on a hot-selling product or whether to adjust an advertising strategy, according to Terry Hanold, VP, AWS cloud commerce.

Kinesis applications could collect data from server logs in real time and analyze what's happening on a website during a busy holiday shopping period, or collect data on dozens or hundreds of devices on the factory floor to spot where the next delay might occur.

One reason to do data-stream analysis in the cloud is that such a service can elastically expand to meet the data streams' demands. Hanold said in the announcement that customers can capture data streams with a few clicks on the Amazon management console or by programming an application with a simple API call.

Enterprise developers often develop such systems themselves, using open source Hadoop or other resources. But Hadoop 1.0 and data warehouses tend to need time to upload data, analyze it in batch mode, and report on the results. Real-time data feeds have not been a fit, although Hadoop 2.0 may change that.

[Want to learn more about Hadoop as a streaming system? See Hadoop 2.0 Goes GA: New Workloads Await. ]

Kinesis can absorb data feeds, perform analysis on them, and then route them to Amazon's Redshift data warehouse service, DynamoDB database system, or S3 object storage. It can use load balancing and elastic scaling to create clusters to host the data streams fed into it. It can also work with Amazon CloudWatch to supply throughput, latency, and utilization statistics back to the management console.

Khawaja Shams, a scientist at the NASA Jet Propulsion Laboratory, took the stage at Re:Invent Nov. 14 to say he had tested Kinesis by plugging in a Twitter stream of data and asking Kinesis to determine the utilization of the word "Mars." Shams hoped to measure the popularity of space exploration after India launched a mission to Mars. But he learned that the "Mars" that appeared most frequently in Tweets was Bruno Mars, the singer, not the planet. Following up, he was able to learn that the largest concentration of the singer's fans is on the West Coast. It wasn't the information he originally sought, but he had discovered a power of Kinesis and found ways to query it.

Amazon hopes Kinesis will become a way for developers to add real-time analytics to their applications, letting Kinesis and EC2 scale the system as needed. With such a service , a developer could collect and analyze very large amounts of data without needing to know a lot more than an API call. "It does the heavy lifting so you don't have to," said Shams.

Charles Babcock is an editor-at-large for InformationWeek, having joined the publication in 2003. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week.

You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.