Interesting talks from PyData London 2017

This year’s PyData London conference was held in Bloomberg’s offices on the 6th and 7th of May, with Tutorial Day on May 5th. As was the case with PyData Amsterdam 2017, I made the time to watch all of the talks from the conference, and write a blog post about the ones I found the most interesting.

So, let’s begin. As I’m a huge fan of Random Forests, and consider them to pretty much be Data Science 101, I thoroughly enjoyed the talk given by Nathan Epstein from conference host Bloomberg. He gave a very good intuitive introduction to how the algorithm works, and also spoke about its advantages over Neural Networks - something very useful in a time when everyone is really gung-ho over Deep Learning and “AI”.

Ian Ozsvald, author of the great “High-Performance Python”, together with Guzstav Belteki and Giles Weaver, presented a piece of research they did for the NHS, using data collected from ventilators used in neonatal wards. They mentioned using Tom Augspurger’s engarde package for data sanity checking (something we should all be using in automated data cleaning scripts IMHO). They also mention using Bokeh for visualization and hand-annotation of their time series data, as well as the aforementioned Random Forest algorithm for modelling. :)

Rebecca Bilbro from Bytecubed, lecturer in ML at Georgetown University, and author of “Applied Text Analysis with Python” talked about an interesting project called Yellowbrick for visualizing model diagnostics. She also gives an excellent overview of the many diagnostic techniques available. The project is actively looking for contributors, so might be a good choice to sink one’s teeth into.

In his keynote speech, Gene Kogan, author of “Machine Learning for Artists” gave a bunch of examples of using CNNs and GANs for image recognition, style transfer, image and sound generation, automated captioning, and also gave a demo of real-time video style transfer! It’s a very fun talk, and I LOL’d quite a few times. :) (Incidentally, he also recently taught a 3-day workshop at the Resonate festival in my home town of Belgrade.)

Coming from a marketing and social research background, I found Aileen Nielsen’s talk on polling really interesting. She covered the different sampling schemes used, methods for post-processing, and approaches to aggregating data from various polls. She also talked a bit about what she thinks went wrong with the predictions about the US elections, as well as the EU referendum in the UK.

Keeping up with the theme, Will Moy and Mevan Babakar from Full Fact talked about the work their organization has been doing in checking the various facts put forth in public discourse. They covered the various forms of fake news and fact distortions that exist, with ample examples from the UK, and showed the system they’re building using Python, Solr and CoreNLP for automating fact-checking in real time.

Nuno Castro from Expedia talked about how they used Keras (covered in a really good tutorial from the first day) and the VGG-16 pre-trained model to rank hotel images on their website, in order to provide a better user experience. They also used Amazon Mechanical Turk for building their dataset, which proved to be very quick and cost-effective. The Q&A session after the talk was also very interesting, with some great questions.

Andrew Patterson from Naturalmotion gave an interesting presentation about how they detect cheating in their mobile games. He mentioned the various methods they’ve seen cheaters use, and a large part of his talk was dedicated to epxplaining the difficulties in establishing a ground truth, with a dictinction between cheaters that explicitly impact revenue and cheaters that implicitly impact revenue, by causing frustration in legitimate players. As the lines between a cheater and really good player can sometimes be really blurry, he presented some interesting methods for detecting outliers.

In a fun and engaging talk, Kathryn Harris from Not On The High Street drew parallels between working on academic projects as an astrophysicist, and working as the first Data Scientist in a mid-size tech company. She talked about a bunch of anecdotes and lessons learned, and the discussion was very lively and educational.

In a sort of a “Big Data 101” talk, Raoul Gabriel Urma and Valentin Dalibard from Cambridge Spark explained the 3 V’s of Big Data, the difference between batch and stream processing, and the difficulties of scaling. They did a live coding session using PySpark and Zeppelin Notebook, and also showed how scale out by spinning up an Amazon EMR cluster.

Zack Akil from Pivigo plays rugby in his spare time, and he needed a solution for recording his games that was better than shaky handheld videos made by his friends. So he built a robot using a wide angle camera, a Raspberry Pi, Python and a single-layer Neural Network, that almost perfectly follows and records the action on the pitch!

The never ending debate about static versus dynamic typing has, naturally,

spilled over into Data Science as well, especially when it comes to writing

ETL pipelines, productionizing models, and generally creating data products.

Marco Bonzanini gave his take on the topic when it comes to Python, and his talk is well worth a watch.

One topic that I don’t see discussed or taught enough is software engineering best practices for Data Scientists, testing being a particularly important one. Nick Radcliffe has attempted to bridge this gap with his tdda package, and he did a great tutorial on using it.

Well, that’s it for this year’s PyData London conference! Hope you enjoyed these, and see you after Berlin Buzzwords 2017!