

Denver, CO

Post #: 26 user 4173184





Day 0



The "Ignite" event featured lightning talks from the local users group. Two that stood out:



1. Data science of "where fire trucks live" for his two-year-old daughter. Using open data, he plotted location and type of emergency vehicle and type of incident. Using standard data analysis, he spotted interesting trends, such as peak activity is at night and mid-day is a lull. He applied social networking-style graph analysis to determine the "social connectedness" of fire chief SUVs in terms of when they respond at the same time as fire engines.



2. Big Data Ah. Speaker applied open data and data science to obtain and plan romantic dates. First step was to get census.gov data that, once he plotted it on a geo-plot, showed where the greatest density of single women aged 20-30 live. Then, he used the "missed connections" Craigslist data to figure out where he should hang out for chance encounters. Then it was just about planning -- having things to talk about, such as the weather and every filming location in downtown San Francisco used in Star Trek IV: The Voyage Home.



Day 1: Tutorials



Day 1 of the conference is all half-day tutorials, whereas days 2 and 3 will be series of 20-40 minute talks. My morning tutorial was riveting and the star of the conference, and in the afternoon I skipped around.



Morning Tutorial that I attended: Berkeley Stack: Spark, Shark, and Spark Streaming



Map/Reduce is 5-10 years old now, and everyone loves to bash Hadoop's implementation of it for its "high latency". There are dozens of "replacements" for Hadoop Map/Reduce, including Impala (Cloudera), Apache Drill (MapR), Hawq (Greenplum, announced today), and Stinger (Hortonworks, announced last week).



Spark beats them all because it's both a) free and b) available (and actually "mature", at version 0.7 this week with earlier versions having been out for a couple of years).



The development funding model is also unique. 40 students, 8 faculty, and 3 staff software engineers have been set up in a startup-like physical environment at Berkeley on a 6-year mission (they are 2 years into it). That is a ton of resources for an open source project, and the students who presented in this morning were quick as a whip.



Spark gets its speed, like most of the other Map/Reduce replacements, through caching and optimizing the master/node communications.



Shark, like many other Hive replacements, maintains HiveQL compatibility.



Spark streaming, according to the presented, beats its competitor Storm at calculating metrics on the fly as they come off a queue like Kafka or Flume because Spark has fault-tolerance through node redundancy and because Spark avoids Storm's problem of double-counting events by maintaining full historical data in memory for the specified desired window (e.g. 10 minutes). He said there is a layer over Storm that can prevent double-counting, but it achieves it by wrapping each individual event in its own transaction, and most users just abandon that solution for being non-performant.



The other thing that I learned is that Shark and Spark Streaming are only the first two subsystems to utilize the Shark foundation. Other Shark subsystems are in the works, including a graph database/querying system. This makes Hadoop like J2EE and Shark like Spring. I see Hadoop and Shark living alongside each other for another year or two before Shark just takes over.



First afternoon tutorial that I attended: Python Data Science



I had no idea there was an interactive shell for Python called iPython Notebook. Combined with data analysis libraries, it turns Python into R. So I asked the presented why not just use R?



1. He answered that he can test ideas in the iPython Notebook, and then easily turn them into real Python programs.



2. I added that that would also enable easy adaption into one of the many distributed/clustering Python solutions, and he agreed. Something I learned a couple of weeks ago is that the Python community has a big chip on their shoulder about Hadoop. There are a couple dozen



Second afternoon tutorial I attended: Apache Solr



Although I was familiar with the project names previously, I learned in this tutorial that the Lucene and Solr projects merged a while back. Solr provides search engine capability to your application.



Then I learned that a common use of Solr was to marry it to Cassandra and HBase to effectively serve as an index to one of those two NoSQL databases. And finally I learned that Datastax is simply a commercial marrying of Solr to Cassandra.



Vendor Expo



I expected and ended up seeing a lot of me-too Big Data solutions. I was absolutely floored by one product, though and that was Platfora. It's like Tableau but works directly with Hadoop and allows end-users to construct reduced datasets on the fly (called "lenses") without help from IT. With a point & click interface, you create a Map/Reduce query (that takes a few minutes to execute, of course) to create a "lens", and with that lens you can then slice and dice it in interactive real-time Tableau-style.



This obviates the need for pre-segmenting and pre-computing data. As the developers of Druid point out, complete pre-segmentation on five columns is tractable, but fourteen columns is not.



With Platfora, you don't have to make segmentation decisions in advance -- the user can do it after the fact.



And it's only $60k/year/server. This is huge. It reduces all Big Data projects to that of ingestion, and then you just hook up Platfora.



(And, yes, that means maybe even Shark won't be that useful anymore, save for the fact that it's free, but Spark Streaming would still be needed for real-time dashboarding/operational monitoring systems.)

Strata is O'Reilly's conference on Big Data, held four times per year in various flavors and cities. This is year is just the second time it has been held in Silicon Valley, where most of the Big Data vendors (and users) are located. It runs Feb 26-28, 2013, with some pre-events yesterday, Monday, Feb. 25. I'll be providing daily updates.The "Ignite" event featured lightning talks from the local users group. Two that stood out:1. Data science of "where fire trucks live" for his two-year-old daughter. Using open data, he plotted location and type of emergency vehicle and type of incident. Using standard data analysis, he spotted interesting trends, such as peak activity is at night and mid-day is a lull. He applied social networking-style graph analysis to determine the "social connectedness" of fire chief SUVs in terms of when they respond at the same time as fire engines.2. Big Data Ah. Speaker applied open data and data science to obtain and plan romantic dates. First step was to get census.gov data that, once he plotted it on a geo-plot, showed where the greatest density of single women aged 20-30 live. Then, he used the "missed connections" Craigslist data to figure out where he should hang out for chance encounters. Then it was just about planning -- having things to talk about, such as the weather and every filming location in downtown San Francisco used in Star Trek IV: The Voyage Home.Day 1 of the conference is all half-day tutorials, whereas days 2 and 3 will be series of 20-40 minute talks. My morning tutorial was riveting and the star of the conference, and in the afternoon I skipped around.Map/Reduce is 5-10 years old now, and everyone loves to bash Hadoop's implementation of it for its "high latency". There are dozens of "replacements" for Hadoop Map/Reduce, including Impala (Cloudera), Apache Drill (MapR), Hawq (Greenplum, announced today), and Stinger (Hortonworks, announced last week).Spark beats them all because it's both a) free and b) available (and actually "mature", at version 0.7 this week with earlier versions having been out for a couple of years).The development funding model is also unique. 40 students, 8 faculty, and 3 staff software engineers have been set up in a startup-like physical environment at Berkeley on a 6-year mission (they are 2 years into it). That is a ton of resources for an open source project, and the students who presented in this morning were quick as a whip.Spark gets its speed, like most of the other Map/Reduce replacements, through caching and optimizing the master/node communications.Shark, like many other Hive replacements, maintains HiveQL compatibility.Spark streaming, according to the presented, beats its competitor Storm at calculating metrics on the fly as they come off a queue like Kafka or Flume because Spark has fault-tolerance through node redundancy and because Spark avoids Storm's problem of double-counting events by maintaining full historical data in memory for the specified desired window (e.g. 10 minutes). He said there is a layer over Storm that can prevent double-counting, but it achieves it by wrapping each individual event in its own transaction, and most users just abandon that solution for being non-performant.The other thing that I learned is that Shark and Spark Streaming are only the first two subsystems to utilize the Shark foundation. Other Shark subsystems are in the works, including a graph database/querying system. This makes Hadoop like J2EE and Shark like Spring. I see Hadoop and Shark living alongside each other for another year or two before Shark just takes over.I had no idea there was an interactive shell for Python called iPython Notebook. Combined with data analysis libraries, it turns Python into R. So I asked the presented why not just use R?1. He answered that he can test ideas in the iPython Notebook, and then easily turn them into real Python programs.2. I added that that would also enable easy adaption into one of the many distributed/clustering Python solutions, and he agreed. Something I learned a couple of weeks ago is that the Python community has a big chip on their shoulder about Hadoop. There are a couple dozen Python clustering solutions available.Although I was familiar with the project names previously, I learned in this tutorial that the Lucene and Solr projects merged a while back. Solr provides search engine capability to your application.Then I learned that a common use of Solr was to marry it to Cassandra and HBase to effectively serve as an index to one of those two NoSQL databases. And finally I learned that Datastax is simply a commercial marrying of Solr to Cassandra.I expected and ended up seeing a lot of me-too Big Data solutions. I was absolutely floored by one product, though and that was Platfora. It's like Tableau but works directly with Hadoop and allows end-users to construct reduced datasets on the fly (called "lenses") without help from IT. With a point & click interface, you create a Map/Reduce query (that takes a few minutes to execute, of course) to create a "lens", and with that lens you can then slice and dice it in interactive real-time Tableau-style.This obviates the need for pre-segmenting and pre-computing data. As the developers of Druid point out, complete pre-segmentation on five columns is tractable, but fourteen columns is not.With Platfora, you don't have to make segmentation decisions in advance -- the user can do it after the fact.And it's only $60k/year/server. This is huge. It reduces all Big Data projects to that of ingestion, and then you just hook up Platfora.(And, yes, that means maybe even Shark won't be that useful anymore, save for the fact that it's free, but Spark Streaming would still be needed for real-time dashboarding/operational monitoring systems.)