What do we mean when we talk about 'big data', and how can be become better critical consumers of it? Data scientist Vicki Boykis recommends the best books for learning Python—a language, she says, as versatile as a Swiss Army knife—and shows that it's possible to teach yourself coding and data science.

First off, before we discuss your books on learning Python and data science, how did you become a data science consultant? What brought you on this path?

Like many other people in this field, I didn’t study it right from the start. I actually started my journey with an undergraduate degree in economics; I was interested in both English and maths, but didn’t want to pick either at the exclusion of the other. I then started working in Washington DC as a consultant, where I was mainly working with a ton of spreadsheets. I learned that my coworkers were using programs called SAS and R to work faster, and that’s how I first started working with data programmatically. I later moved to a few data analysis jobs at larger companies.

These jobs were never tied to the data engineering team; the data was provided to us and we just had to analyze it. And that’s when I decided to learn Python, how to code, and how to use big data technologies, so that I could manipulate and clean this data myself. Nowadays, I work as a consulting data scientist for many large companies in the United States, on a one-off project basis in data science and engineering.

What has been your experience of learning all of these skills by yourself? If somebody wanted to take the same path today, would you advise them to go through an official degree or to try and learn by themselves?

You can get to a certain point of competency by learning by yourself, which would be enough to get a first data science job. Once you get past this point, you do need some guardrails to really master some of the deeper subjects, especially in computer science.

“When working on my own projects, I prefer Python”

When learning on your own, you tend to skip some fundamentals that can appear superficial or unneeded at the time, but would have allowed you to have a better understanding of some issues you’ll run into as a data scientist. You don’t have to complete a full program, but even a couple of courses will be extremely helpful. One of the best things you can do is also to look for a senior person to mentor you. I’ve been lucky to find a couple of those people in my career, and it has pushed me forward tremendously.

For your data science tasks, you personally choose to use Python much more than other languages, including R. What major differences have you found between those languages, and why do you tend to favor Python?

Usually in consulting jobs, you come in and whatever the programming environment is, you adapt to it. I learned very quickly to become a programming polyglot, but when working on my own projects, I prefer Python. For me, it’s an incredibly useful Swiss-army-knife-type of language. You can use it for data analysis, to put up a webserver, to build a web app, to interact with cloud services, for machine learning, and so on. It’s a little worse than R for hardcore statistics, but most of the time in real life projects, we’re working with the full data pipeline from ingestion to analysis; and for that I really prefer Python for its flexibility.

Read 1 Learn Python the Hard Way by Zed A. Shaw Read

With that in mind, let’s talk about the first book you’ve chosen: Learn Python the Hard Way by Zed Shaw. Why did you pick this one, particularly over all the other books teaching Python?

Zed Shaw has written a lot of Learn X the Hard Way books. Initially he put them on his website for free, which is how a lot of people found out about them. He’s experimented with many ways to monetize his work, and I believe the only way to get the books now is to pay for them—but it’s completely worth it!

Support Five Books Five Books interviews are expensive to produce. If you're enjoying this interview, please support us by donating a small amount.

I chose this book because a lot of other Python books talk about theory, and Learn Python the Hard Way is first and foremost about building things. His approach is to say: ‘Coding is going to suck at first; you’re not going to understand anything. But just do this stuff that I tell you to do, and eventually it’s all going to make sense.’ He goes through all of the building blocks that you need to master Python. It’s been updated for Python 3, which is very important as well. It’s very practical and down to earth, with about 50 to 60 exercises, and it’s written in a way that doesn’t feel overwhelming and that really allows you to go through all of them.

Read 2 Coders at Work: Reflections on the Craft of Programming by Peter Seibel Read

Your next choice is Coders at Work by Peter Seibel, subtitled ‘Reflections on the craft of programming’. Is this meant to teach you to code well, beyond what basic books usually teach you about programming?

This is more of a ‘cultural’ book about programming, in which you won’t get a lot of specific, technical advice on how to program in Python, C or Java. What you’ll get is an introduction to the industry by the people who founded it; people like Brendan Eich, who wrote JavaScript, or Joshua Bloch, who was one of the main contributors to Java. It’s about how those people got into programming and how they think about it. It’s a very conversational book that really helps you to learn the culture of this industry you’re coming into, and some of its terminology. And it’s a much lighter read than Learn Python the Hard Way, of course.

When you haven’t followed an engineering curriculum and haven’t been immersed in the engineer culture, is it important to catch up with that to be able to really work in this industry?

It’s extremely important. You’ll be working with people who have been writing code since they were 10 years old. They’ve very much immersed in this world, and as a beginner, it can be intimidating to ask certain things, because it gives away your ‘status’ and the fact that you don’t know as much. To balance that, this book will for example give you a good overview of the history of computer development, what kind of programming languages there are and how they’ve come about—all these things that you wouldn’t necessarily get from online courses and technical books.

Have you found the computer/data science industry to be welcoming to outsiders, or at least people who haven’t received a proper degree in the field?

In spite of my answer to the last question, I still think it’s one of the most welcoming industries that you can get into. I worked hard but I do consider that I had some luck by stumbling into data just as it was getting big, around 2011-2012. None of my friends, be they nurses, doctors, lawyers, actuaries, could have found their job without a lot of advanced studies and official certifications. For all the problems that the computer science industry has, it’s probably one of the most egalitarian, at least on the particular issue of qualifications.

Read 3 Big Data: Principles and Best Practices of Scalable Realtime Data Systems Nathan Marz (with James Warren) Read

For your third book choice you decided to turn to ‘big data’, a very trendy concept, but one that seems really key to understanding how programmers deal with data in today’s world.

This book written by Nathan Marz is a bit ‘old’, at least in computer science time: it was published in 2015. It’s really one of the problems with books in this field: they’re very quickly considered ancient even though they were only written a few years ago.

“For all the problems that the computer science industry has, it’s probably one of the most egalitarian”

This book, however, remains a great read if you want to understand how modern data architecture works, and especially distributed data systems. For example, it explains really well what the Lambda architecture is, i.e. the combination of streaming and batch data that you can combine together for analysis. It also covers the differences between relational data and ‘NoSQL’ databases. It’s also very practical and walks you through how to implement some of the actual principles and frameworks that it presents. It’s basically a great way to catch up with all of these new fundamentals of computer science that have appeared in the last 10 years or so.

Can you try to explain in simple terms what ‘big data’ means, and how it changes the day-to-day work of a programmer or data scientist?

When we talk about ‘big data’, we basically mean any dataset that you can no longer process in memory on a single computer, and so you need multiple computers to work with it. It also means that you need to build systems to coordinate this work, to make sure that none of your data is double-processed or forgotten at any point. For a programmer, it means going beyond languages and programs, and learning to work with data architectures and distributed systems.

Read 4 How To Lie With Statistics by Darrell Huff Read

After three books about programming and computer science, let’s now turn sideways and look at one of the other building blocks of data science: statistics. You recommended How to Lie with Statistics, a popular book written by Darrell Huff and published in 1954.

Indeed it was written in the 50s; Darrell Huff was not a statistician himself. He was a journalist, who tried to introduce how you can lie with statistics to people who may not be aware of this problem. It covers things like the typical ‘correlation doesn’t imply causation’, but also how random sampling works, how pie charts and bar charts can be misleading, and so on.

“Data can be manipulated and changed in so many ways”

It’s a must-read for anyone who works in business or in this industry in general. You can very easily lie with data. Often times, companies will say: ‘this is data, therefore it’s the truth’. But data can be manipulated and changed in so many ways. This book will teach you how to think honestly about the data that you are analyzing and presenting to different people, and how to be a more critical consumer of data as well, something that is essential in today’s world.

Let’s go back to the topic of learning these subjects: what about statistics? Can you learn this by yourself, and should you think about going back to school if you want to understand all of the intricacies?

If you’re doing something very advanced like artificial intelligence—or if you want to work at Google Brain, for example—you probably do need to take formal courses. But for most data science projects, you really don’t.

Get the weekly Five Books newsletter

An interesting thing that’s happening is that the models that data scientists have put together are starting to become commoditized. Software products by Amazon, Google or Microsoft let you construct models and train them in the cloud with tremendous computing power, without worrying too much about the mathematical implementation behind it. What matters more is knowing which algorithm to pick, how to tune the parameters of the model, and how to interpret the results in the right way. It’s become easier to create models, but it’s really the interpretation of those models that matters now.

Read 5 Computer Organization and Design MIPS Edition: The Hardware/Software Interface by David A. Patterson & John L. Hennessy Read

Your last choice goes towards a different path: you picked Computer Organization and Design, a book by David Patterson and John Hennessy.

This is a textbook that covers how computers work from the ground up. It includes hardware, software, and operating systems. It’s a really thick book, but also a really good one! What I’ve found during my move to data science is that the more I work in this field, the more I need to understand how to optimize my code, how the instructions move through the various big data systems that I mentioned earlier, and how all of this can impact performance. It’s really important for me, for example, to understand how data moves through a network and can be slowed down by bandwidth limitations.

“This is the book I’d recommend reading if you missed out on a formal computer science education”

This is the book I’d recommend reading if you missed out on a formal computer science education. Some data scientists tend to forget about these aspects because the structures they use, such as R data tables or Python data frames, are abstracting away what’s going on under the hood. You don’t need to know everything from end to end, but having a general idea of why things can be under-optimized will be very helpful.

In a typical undergraduate computer science curriculum, this would be one of the first courses you’d take—but you chose to dedicate your last recommendation to this topic. For somebody learning this from scratch, does it make more sense to start with practical programming tasks, and wait a while before moving on to understanding the inner workings of computer systems?

Something that’s really motivated me in my learning was to have concrete projects to work on. The first time I really witnessed the power of computer programming was when I was doing some data cleaning at work, and made the effort of automating this cleaning; by the end I was blown away by how fast the process had become. This idea of making the computer work for me was very powerful. It’s only later that you start having an interest for understanding the inner workings of computer hardware, once you stumble upon a few problems in your programming. So yes, it does make sense to read this book a bit later on, once you’ve mastered the basics of whatever language you are interested in.

Finally, is there anything you’d particularly recommend doing (or avoiding) to someone who’d like to take a path similar to yours?

It’s easy to get overwhelmed when you go down this path. People always say to start with projects, but it can be hard at the beginning to even think of what to do. Finding a mentor on specialized websites or at data science meetups can be a great way of solving this issue, by having an experienced person telling you where to start.

Support Five Books Five Books interviews are expensive to produce. If you're enjoying this interview, please support us by donating a small amount.

For people who are already working at a job that revolves around data, I’d recommend focusing on your personal ‘pain points’ and trying to solve them with programming; for me, this was automating things I was doing with Excel spreadsheets. There’s a very good book dedicated to these kinds of very actionable tasks, called Automate the Boring Stuff with Python; this would be another great book to read for those moving from typical office applications to programming.