

Denver, CO

Post #: 28 user 4173184

Keynotes



Another 8 keynotes, but these all devoid of information, except one little mildly interesting tidbit: that Intel is going to have its own Hadoop distribution. Supposedly it will be optimized for Xeon and have added security, but there were glaringly no details. What would be interesting would be if Intel Hadoop supported their Xeon Phi GPGPU/HPC, but there was no mention of that and I doubt Intel would cannibalize their lucrative supercomputing revenue with free software.



Talks



Real-time Stream Processing and Visualization Using Kafka, Storm, and d3.js

Justin Langseth, Zoomdata



Using d3.js for real-time dash boarding visualizations, but the interesting thing is that the visualizations are interactive, supporting drill-down and historical playback. For historical playback, they incorporated a MongoDB in their Storm topology to keep a record of the historical data as it comes off the Kafka queue. Their "dashboard" really looks more like Tableau (seems everyone imitates Tableau now in their UI), where additional fields can be dragged from the toolbox panel on the left onto the main visualization to effect drill-down.



Yet another presenter who indicated performance problems with Storm: he complained that Storm creates 80 threads per topology, which was a problem because they were dynamically creating and destroying Storm topologies as the user drill-down or zoomed out — each time was taking 30 seconds.



They have big plans to incorporate all the cool visualization styles featured on





Four Pillars of Effective Visualization

Noah Iliinsky, IBM



He immediately made a distinction between visualization for the purpose of analysis/exploration and visualization for presentation, and said his talk would be on the latter, so I didn't stay.





Feedback Control for Programmers and Other Strangers

Philipp Janert (Principal Value)



Given a system configuration value, such as cache size, and a desired outcome, such as 80% cache hit rate, he used the PID algorithm from industrial control, a negative feedback algorithm, to dynamically control the cache size. An interesting idea.



(PID is like a thermostat algorithm, except it prevents overshoot by taking into account momentum of the measurements. I've used it in my other life as a scientific/embedded programmer to control burn-in ovens and air pressurization systems.)





High-Volume Data Collection and Real Time Analytics Using Redis

C. Aaron Cois, Carnegie Mellon University



Redis is an in-memory database that is half-database and half-queue. It allows querying for the "most recent n elements" without first have to do something like max(timestamp). Redis also has a built-in publish/subscribe mechanism. For all these reasons Redis is a good component in a typical dashboard/operational monitoring application.



In a theme common with Druid, they maintained both a historical database and the real-time database (which in their case was Redis) and created an API layer to make the two seem as one database for ad-hoc queries.





Sociometric Badges: Using Wearable Sensors to Change Management

Ben Waber, Sociometric Solutions



The idea is employees wear badges to be tracked, for the purpose of optimizing productivity, number of managers, and epidemiology (e.g. If someone calls in sick, see who they might have contaminated the day before). Creepy.





Third Generation Tools for Realizing Machine Learning Algorithms

Dr. Vijay Srinivas Agneeswaran, Impetus Technologies



As a careful reading of the talk title informs, this talk was about third-generation implementation of machine learning algorithms (not third-generation algorithms). The speaker classified first generation as desktop (or single server), such as R, and second generation as Map Reduce (e.g. Mahout), and third generation as post-Map Reduce. He actually called Spark a third-generation machine learning tool :-) So, another plug for Spark.



The speaker stated that although second-generation (map/reduce) can handle linear support vector machines, third generation is required for non-linear SVM.





Druid: Interactive Queries Meet Real-time Data

Eric Tschetter, Metamarkets and Danny Yuan, Netflix



Druid is another post-map/reduce solution I've been tracking for a couple of months. Some young guy realized no solution out there would do what he wanted, so he ended up writing his own real-time clustered database in the course of his job. It demos well, with sub-5-second response times on queries over large distributed datasets -- it achieves this like most other post-map/reduce solution: by caching on each node in the cluster. But I get a little queasy when it comes to Druid:



1. From what I gathered from Eric's October, 2012 video, he had to talk his employer into open-sourcing Druid. I wonder how much continued support they will allow (and pay for) Eric to do.



2. Druid has a bus factor of 1: It's just Eric.



The big news since I first started following Druid is that Netflix has put Druid into production. They demonstrated querying 150,000 rows in 2 seconds. Someone asked Eric what the relationship was to Apache Drill, and he replied that "a lot of people are working on the same sorts of things." He also pointed out that a) Apache Drill can do JOINs and Druid cannot, and b) Apache Drill will only work on HDFS data, so in terms of real-time there will always be a lag whereas Druid employs the pattern (seen also on other combo real-time/archiving systems) of three node types: archiving, real-time, and a façade layer that blends those two to create the appearance of a single database.





Vendor Expo



I finally got the "Excel Big Data" demo from Microsoft that some of my colleagues have been raving about. It has point-and-click querying of HDFS that translates into Map/Reduce, so it has that in common with Platfora. But then once the data was populated in the spreadsheet, it looked like you were just left to your own regular Excel visualization devices. One other cool thing it had that I didn't see in any other tool was rapidly JOINing across multiple databases, including web sources such as HTML tables scraped off web sites such as Wikipedia.

Another 8 keynotes, but these all devoid of information, except one little mildly interesting tidbit: that Intel is going to have its own Hadoop distribution. Supposedly it will be optimized for Xeon and have added security, but there were glaringly no details. What would be interesting would be if Intel Hadoop supported their Xeon Phi GPGPU/HPC, but there was no mention of that and I doubt Intel would cannibalize their lucrative supercomputing revenue with free software.Justin Langseth, ZoomdataUsing d3.js for real-time dash boarding visualizations, but the interesting thing is that the visualizations are interactive, supporting drill-down and historical playback. For historical playback, they incorporated a MongoDB in their Storm topology to keep a record of the historical data as it comes off the Kafka queue. Their "dashboard" really looks more like Tableau (seems everyone imitates Tableau now in their UI), where additional fields can be dragged from the toolbox panel on the left onto the main visualization to effect drill-down.Yet another presenter who indicated performance problems with Storm: he complained that Storm creates 80 threads per topology, which was a problem because they were dynamically creating and destroying Storm topologies as the user drill-down or zoomed out — each time was taking 30 seconds.They have big plans to incorporate all the cool visualization styles featured on http://d3js.org Noah Iliinsky, IBMHe immediately made a distinction between visualization for the purpose of analysis/exploration and visualization for presentation, and said his talk would be on the latter, so I didn't stay.Philipp Janert (Principal Value)Given a system configuration value, such as cache size, and a desired outcome, such as 80% cache hit rate, he used the PID algorithm from industrial control, a negative feedback algorithm, to dynamically control the cache size. An interesting idea.(PID is like a thermostat algorithm, except it prevents overshoot by taking into account momentum of the measurements. I've used it in my other life as a scientific/embedded programmer to control burn-in ovens and air pressurization systems.)C. Aaron Cois, Carnegie Mellon UniversityRedis is an in-memory database that is half-database and half-queue. It allows querying for the "most recent n elements" without first have to do something like max(timestamp). Redis also has a built-in publish/subscribe mechanism. For all these reasons Redis is a good component in a typical dashboard/operational monitoring application.In a theme common with Druid, they maintained both a historical database and the real-time database (which in their case was Redis) and created an API layer to make the two seem as one database for ad-hoc queries.Ben Waber, Sociometric SolutionsThe idea is employees wear badges to be tracked, for the purpose of optimizing productivity, number of managers, and epidemiology (e.g. If someone calls in sick, see who they might have contaminated the day before). Creepy.Dr. Vijay Srinivas Agneeswaran, Impetus TechnologiesAs a careful reading of the talk title informs, this talk was about third-generation implementation of machine learning algorithms (not third-generation algorithms). The speaker classified first generation as desktop (or single server), such as R, and second generation as Map Reduce (e.g. Mahout), and third generation as post-Map Reduce. He actually called Spark a third-generation machine learning tool :-) So, another plug for Spark.The speaker stated that although second-generation (map/reduce) can handle linear support vector machines, third generation is required for non-linear SVM.Eric Tschetter, Metamarkets and Danny Yuan, NetflixDruid is another post-map/reduce solution I've been tracking for a couple of months. Some young guy realized no solution out there would do what he wanted, so he ended up writing his own real-time clustered database in the course of his job. It demos well, with sub-5-second response times on queries over large distributed datasets -- it achieves this like most other post-map/reduce solution: by caching on each node in the cluster. But I get a little queasy when it comes to Druid:1. From what I gathered from Eric's October, 2012 video, he had to talk his employer into open-sourcing Druid. I wonder how much continued support they will allow (and pay for) Eric to do.2. Druid has a bus factor of 1: It's just Eric.The big news since I first started following Druid is that Netflix has put Druid into production. They demonstrated querying 150,000 rows in 2 seconds. Someone asked Eric what the relationship was to Apache Drill, and he replied that "a lot of people are working on the same sorts of things." He also pointed out that a) Apache Drill can do JOINs and Druid cannot, and b) Apache Drill will only work on HDFS data, so in terms of real-time there will always be a lag whereas Druid employs the pattern (seen also on other combo real-time/archiving systems) of three node types: archiving, real-time, and a façade layer that blends those two to create the appearance of a single database.I finally got the "Excel Big Data" demo from Microsoft that some of my colleagues have been raving about. It has point-and-click querying of HDFS that translates into Map/Reduce, so it has that in common with Platfora. But then once the data was populated in the spreadsheet, it looked like you were just left to your own regular Excel visualization devices. One other cool thing it had that I didn't see in any other tool was rapidly JOINing across multiple databases, including web sources such as HTML tables scraped off web sites such as Wikipedia.