

Denver, CO

Post #: 27 user 4173184

Keynotes



Of the nine keynote speakers, three stood out:



Rajat Taneja, Electronic Arts CTO



Real-time processing of big data because it's critical to monetization of their games, now that games are free and sales come from in-game ad placement and sales of virtual goods. I liked the word "instrument" -- they "instrument" their games to capture events, and his soundbite "small is the new big data". They're beyond capturing everything and instead of optimized to capture the absolute minimum to give them the value they need. They capture on the order of GB per day rather than TB per day.



Eric Colson, Stitch Fix



Formerly in charge of recommendation engines for Netflix, Eric founded Stitch Fix, which takes recommendation to a whole new level, because they preemptively ship out new clothing automatically every month to you and hope you like it. They improve the accuracy of recommendations by using humans in the final step.



Jeanne Harris, Accenture



She's been going to computer conferences since the 1970's, and she said that in 1990, everyone was quoting Field of Dreams "Build it And They Will Come" in the PowerPoints when they should have been quoting Groundhog Day, because every five years there's a new cycle of "imagine the possibilities of databasing all this extra data." My colleague happened to express to me the same every-five-years opinion the evening before.



Talks



In each time slot, there were 8 simultaneous talks, so these are only the ones I went to.



Try, Learn, Buy: Operations Research Meets AI

Elisabeth Crawford, Birchbox



Yet another company that recommends products via monthly shipments — this time boxes of shampoo, lotion, etc.



Elisabeth Crawford reminded everyone like me who went to college in the 80's of doing the Simplex Method from Operations Research by hand on paper. They use the Simplex Method to figure out what to put in the box, and regular recommender systems to serve as inputs to the Simplex Method (I.e. to populate the matrix with weights before using the Simplex Method to optimize).



Introduction to Forecasting

Michael Bailey, Facebook



Introduced me to ARIMA, which I would probably be learning this week anyway if I had time for the Coursera course I signed up for. Michael Bailey had an intuitive feel for modeling data, and really inspired me to do predictive modeling.



Next Gen Data Scientists

Rachel Schutt, Google



She started a data science class at Columbia University. She has a PhD in statistics, but by working at Google 2008-2012 she learned about code reviews, revision control, so she's in a unique spot to train the "next generation" of data scientists.



The point I found most interesting though, was an aside she made. Her class was extremely diverse, with not just math and computer science students, but also many from the sciences. That's not too surprising, but what is surprising to me is that she arranged with Kaggle to set up the final exam, and even the students who had never programmed before in their life were writing machine learning algorithms by the end of the class. That breaks the common notion from 20-30 years ago, which was that it took a few months for a light bulb to go on to get the hang of programming and then it was easy, like riding a bicycle. But I guess it's different when you have bright Columbia University students motivated by the popular press touting $300k salaries for data scientists.



Real Time Analytics With Storm

Accenture Technology Labs



In a barely audible aside during the presentation, they confirmed the weakness of Storm that was stated during the previous day's Spark Streaming presentation, which is that the layer on top of Storm, Trident, that prevents double-counting is not performant.



Big Data is a Hotbed of Thoughtcrime, Part II: The Code

Jim Adler, inome



He started off with an interesting question from when he spoke at Strata a year previous, posed by a data scientist from LinkedIn. The question was whether inferring private information using only public information was a thoughtcrime (I.e. unethical). Since the point of Jim's talk today was about inferring whether someone was a felon based on their personal profile characteristics (tattoos, race, etc.) and non-felony convictions (misdemeanors and traffic offenses), the question became whether it was a thoughtcrime to detect thoughtcrimes.



One scary result that came out of their decision tree was that if a person with dark skin had any misdemeanors or traffic offenses whatsoever, then according to the data and their machine-learned decision tree, that person was extremely likely to also commit a felony.



So one of the many ethical guidelines he listed was "profiling" (which he said the Supreme Court has upheld if it is one of several factors contributing to probably cause) has to be narrow to be ethical. There were other factors, too, such as it being necessary to combine the general profiling with specific circumstances of the case.



What to Do When Your Machine Learning Gets Attacked

Vishwanath Ramarao, Impermium



Spam and malicious attacks throw off machine learning, so Vish called this environment "adversarial machine learning". In adversarial machine learning, several strategies opposed to conventional machine learning can help. One example strategy is utilizing all possible variables/features instead of pruning them in conventional machine learning (which normally creates simple robust models). The example he gave was a spammer posting web-forum comments where the spammer had a distinctive User Agent string, but the compact machine learning model didn't include that variable/feature. So they had to add it back in and re-train the ML.



Another strategy is intentionally over-fitting. A conventional ML is used in conjunction with with a later adversarial machine learning stage that is over-fitted to specific recent attacks.



Another strategy is detecting and leveraging outliers. Instead of discarding outliers, they may instead be indications of an attack in progress.



Vendor Floor



I dragged a colleague over to the Platfora booth and in the process learned about three more features:



It supports joins between multiple datasets residing on the Hadoop cluster when it employs Map/Reduce to create "lenses".



There is no constraint that a lens has to completely fit in the RAM of the Platfora server because they've implemented their own virtual memory system they call "fractal cache"



They have an instant message system built-in so users can chat about and annotate visualizations/dashboard snapshots that contain anomalous data. Yeah, kind of superfluous but I am reminded of the old adage that all software eventually incorporates an e-mail system. I guess that's changed now to be IM instead.



Of the nine keynote speakers, three stood out:Real-time processing of big data because it's critical to monetization of their games, now that games are free and sales come from in-game ad placement and sales of virtual goods. I liked the word "instrument" -- they "instrument" their games to capture events, and his soundbite "small is the new big data". They're beyond capturing everything and instead of optimized to capture the absolute minimum to give them the value they need. They capture on the order of GB per day rather than TB per day.Formerly in charge of recommendation engines for Netflix, Eric founded Stitch Fix, which takes recommendation to a whole new level, because they preemptively ship out new clothing automatically every month to you and hope you like it. They improve the accuracy of recommendations by using humans in the final step.She's been going to computer conferences since the 1970's, and she said that in 1990, everyone was quoting Field of Dreams "Build it And They Will Come" in the PowerPoints when they should have been quoting Groundhog Day, because every five years there's a new cycle of "imagine the possibilities of databasing all this extra data." My colleague happened to express to me the same every-five-years opinion the evening before.In each time slot, there were 8 simultaneous talks, so these are only the ones I went to.Elisabeth Crawford, BirchboxYet another company that recommends products via monthly shipments — this time boxes of shampoo, lotion, etc.Elisabeth Crawford reminded everyone like me who went to college in the 80's of doing the Simplex Method from Operations Research by hand on paper. They use the Simplex Method to figure out what to put in the box, and regular recommender systems to serve as inputs to the Simplex Method (I.e. to populate the matrix with weights before using the Simplex Method to optimize).Michael Bailey, FacebookIntroduced me to ARIMA, which I would probably be learning this week anyway if I had time for the Coursera course I signed up for. Michael Bailey had an intuitive feel for modeling data, and really inspired me to do predictive modeling.Rachel Schutt, GoogleShe started a data science class at Columbia University. She has a PhD in statistics, but by working at Google 2008-2012 she learned about code reviews, revision control, so she's in a unique spot to train the "next generation" of data scientists.The point I found most interesting though, was an aside she made. Her class was extremely diverse, with not just math and computer science students, but also many from the sciences. That's not too surprising, but what is surprising to me is that she arranged with Kaggle to set up the final exam, and even the students who had never programmed before in their life were writing machine learning algorithms by the end of the class. That breaks the common notion from 20-30 years ago, which was that it took a few months for a light bulb to go on to get the hang of programming and then it was easy, like riding a bicycle. But I guess it's different when you have bright Columbia University students motivated by the popular press touting $300k salaries for data scientists.Accenture Technology LabsIn a barely audible aside during the presentation, they confirmed the weakness of Storm that was stated during the previous day's Spark Streaming presentation, which is that the layer on top of Storm, Trident, that prevents double-counting is not performant.Jim Adler, inomeHe started off with an interesting question from when he spoke at Strata a year previous, posed by a data scientist from LinkedIn. The question was whether inferring private information using only public information was a thoughtcrime (I.e. unethical). Since the point of Jim's talk today was about inferring whether someone was a felon based on their personal profile characteristics (tattoos, race, etc.) and non-felony convictions (misdemeanors and traffic offenses), the question became whether it was a thoughtcrime to detect thoughtcrimes.One scary result that came out of their decision tree was that if a person with dark skin had any misdemeanors or traffic offenses whatsoever, then according to the data and their machine-learned decision tree, that person was extremely likely to also commit a felony.So one of the many ethical guidelines he listed was "profiling" (which he said the Supreme Court has upheld if it is one of several factors contributing to probably cause) has to be narrow to be ethical. There were other factors, too, such as it being necessary to combine the general profiling with specific circumstances of the case.Vishwanath Ramarao, ImpermiumSpam and malicious attacks throw off machine learning, so Vish called this environment "adversarial machine learning". In adversarial machine learning, several strategies opposed to conventional machine learning can help. One example strategy is utilizing all possible variables/features instead of pruning them in conventional machine learning (which normally creates simple robust models). The example he gave was a spammer posting web-forum comments where the spammer had a distinctive User Agent string, but the compact machine learning model didn't include that variable/feature. So they had to add it back in and re-train the ML.Another strategy is intentionally over-fitting. A conventional ML is used in conjunction with with a later adversarial machine learning stage that is over-fitted to specific recent attacks.Another strategy is detecting and leveraging outliers. Instead of discarding outliers, they may instead be indications of an attack in progress.I dragged a colleague over to the Platfora booth and in the process learned about three more features: