December 31, 2014

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

Web, network and other IT logs.

Game and mobile app event data.

CDRs (telecom Call Detail Records).

“Phone-home” data from large numbers of identical electronic products (for example set-top boxes).

Sensor network output (for example from a pipeline or other utility network).

Vehicle telemetry.

Health care data, in hospitals.

Digital health data from consumer devices.

Images from public-safety camera networks.

Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

Government snooping on the contents of communications.

Communication traffic analysis.

Photos and videos (airport scanners, public cameras, etc.)

Commercial ad targeting.

Traditional medical records.

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.

The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

High-frequency trading is an extreme race; microseconds matter.

Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.

Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

The math and technology of predictive modeling both still seem pretty simple …

… but sometimes achieve mind-blowing results even so.

There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.

Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

WibiData has some innovative ideas in predictive experimentation.

Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.

It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.

Ayasdi reminded us that there’s room for innovation in clustering.

My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example linked above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.

Comments