Interest for the search term ‘data science,’ as measured by Google, over the last five years. Source: https://trends.google.com/trends/explore?date=today%205-y&q=data%20science

In the tech industry, new skills and roles emerge faster than traditional education can keep up with. A recent example is the field of data science and the associated profession, Data Scientist.

The simplest definition of the data science field is the practice of collecting, analyzing, and interpreting data — aided by technology. Most Computer Science degrees do not yet offer Data Science as a major and, as such, many Data Scientists are self-taught. For this reason, it is possible to become a Data Scientist without a formal degree This article will explore what it’s like to be a Data Scientist, the skillset required, and how to acquire these skills using mostly free or cheap online resources.

About Me

I started my career as a non-technical Product Manager working closely with teams of software engineers. I had always thought of programming as a kind of superpower that gave my colleagues the ability to conjure up new products and ideas using nothing more than a computer and an internet connection. I decided to learn this skill and embarked on a journey to learn software engineering (with the help of a bootcamp and a whole bunch of self-learning and practice!).

I now work as a software engineer in Melbourne, Australia. I’m very interested in the potential of data science and see it as a skill that any software engineer, math geek, or budding statistician can learn how to practice. This article is based on my own experience learning data science techniques, as well as the experiences of people who have broken into the field.

How do data scientists differ from statisticians?

High-profile statistician Nate Silver argues that data scientists are no different from statisticians. This is probably true for leading statisticians in the field who use technological tools and programming languages to make sense of ever-larger data repositories. However, while all data scientists use these tools, not all statisticians do — this is the key difference between the two roles.

Another factor is the differing contexts of where statisticians and data scientists apply their trade. Statisticians have worked in all sorts of industries for many years, while data scientists are primarily found in the tech industry or in companies with a well-developed IT component. The prevalence of data scientists in the tech industry is likely due to the ability of tech companies to collect, store, and make sense of huge volumes of data — a capability that many traditional companies haven’t yet been able to master.

In practical terms, data scientists and statisticians differ on another important metric: salary. According to PayScale, the median salary for Data Scientists in the United States is $91,000, $19,000 higher than the median salary of statisticians ($72,000). While it may be true that data scientists and statisticians often do similar kinds of work, data scientists receive much higher financial compensation for doing so.

What do Data Scientists actually do?

At a high-level, Data Scientists use mathematics, programming tools and techniques, software, and statistical methods to derive insights from data. In interviews with several Data Scientists, some of the things they reported doing day-to-day included:

Extracting salary figures from job announcements, storing, and analyzing them

Simulating the spread of an epidemic

Leveraging industrial psychology to create better HR models

Dissecting data to obtain risk groups for low-socioeconomic status students

Using data, models, and analytics to make decisions on how to sell products more effectively

The Skills Data Scientists Need (and How to Learn Them)

Mathematics

The amount of mathematical skill required to be an effective Data Scientist is hotly debated. Some argue that deep mathematical knowledge is required, while others argue that since most statistical analyses are carried out via programming libraries like NumPy anyway, math knowledge is less important than you’d think. DataScienceWeekly offers this list of the minimum mathematical concepts you should be comfortable with in order to be an successful Data Scientist:

Even if you didn’t enjoy math at school, you may find that you enjoy it more in a data science context. The data in statistics represent real world concepts, unlike the numbers in many traditional math problems. For those with a practical bent, drawing insights from data about the prevalence of real world phenomena may be a more engaging and meaningful way to engage with mathematics than “solving for x.”

The ability to program helps data scientists in a variety of ways. They can write scripts to automate one of the most time-consuming tasks in data science: cleaning and preparing data for analysis. They can write scripts to transform data from one format to another, such as transforming the result of an SQL query into a neatly formatted CSV report, or the opposite, persisting CSV data to a relational database. In most cases, data analysis is carried out using purpose-built libraries that abstract away many of the repetitive or complex calculations involved, such as pandas. Matplotlib can be used to visualize the results of a data analysis.

A 2017 reader-poll at KDnuggets (a popular data science website) showed that the lingua franca of the Data Science field is Python, closely followed by the R programming language. Python’s dominance is largely due to the number of essential data analysis and visualization libraries written in Python (NumPy, pandas, Matplotlib, SciKit-Learning, etc.)

Long-time Python programmer Michael R. Bernstein suggests that Python was rapidly adopted by statisticians and scientists in the early 1990s and 2000s, giving the language and its libraries a significant head-start in these fields compared to rival programming languages.

Another language that is extremely popular with Data Scientists is R. Unlike Python, which is a general purpose programming language, R was created specifically with statistical computing and graphics tasks in mind. Most Data Scientists entering the field today will be expected to be comfortable with one or the other, but which one should you choose?

Being comfortable with both R and Python is ideal, as each language and its associated ecosystem of libraries have different strengths and weaknesses. Quartz magazine’s former data editor, Chris Groskopf, uses both languages. Groskopf has said he prefers Python for data manipulation and repeated tasks and R for ad hoc analysis and data set exploration.

If you only want to learn one of the two, I recommend Python and its ecosystem. My reasoning is that Python is a tool with wider applications than R. You can use Python for all kinds of things: from administering servers, to building web applications, to creating games.

R is much more difficult to adapt to use cases outside of its core focus on statistics and visualization. However, I’d recommend you try out a few basic tutorials on both languages and see which one you prefer. Ultimately, the tool you choose matters less than your skill with the tool, and you’re much more likely to become skillful at wielding a tool that you enjoy using.

Where to learn it

UC San Diego offers a free online course on Python for Data Science, which includes coverage of essential libraries like pandas, NumPy, and Matplotlib. Microsoft has offered the same: an online course to teach you the R programming language for data science.

Machine learning

Machine learning is finding increasing application in the world of data science. Machine learning is the means by which computers can learn (and improve at) tasks without being explicitly programmed. Machine learning techniques can be used to make decisions and predictions based on data, and has many applications in the field of data science.

Imagine that you are a Data Scientist working for a large online marketplace that is struggling to deal with an increasing number of fraudulent transactions. By the time a fraudulent transaction is discovered, it is usually too late and the damage has been done. Your company has recorded as much information as it can about the users, circumstances, and behaviors behind each fraudulent transaction. You are tasked with coming up with a way to prevent fraudulent transactions before they occur (e.g. freezing a transaction, subject to manual review).

As a Data Scientist working without machine learning, you would analyze the available data about past fraudulent transactions and look for patterns. For example, you might cluster the data and notice that transactions originating from a particular geographic location, purchasing products in a specific category, and/or using a particular payment method, are very likely to be fraudulent. Your Software Engineering team would then likely build a system to flag such transactions for manual review.

You could use machine learning to tackle the same problem by using records of both fraudulent and non-fraudulent transactions as training data to build a model. Using this model, the algorithm can identify patterns in fraudulent transactions that might be more nuanced and complex than human pattern matching could identify.

For example, a machine learning algorithm might detect patterns in variables that a human could miss, such as the time of day when fraudulent transactions are most likely to occur. Most powerfully, the algorithm can be adapted to rapidly predict whether an incoming transaction is likely to be fraudulent. Your Software Engineering team could use this prediction to handle the transaction accordingly, by freezing it and flagging it for review.

Because of the capabilities offered by machine learning, it’s becoming an integral part of data science. Familiarity with the basics of machine learning, and when they can be useful, will help you in your career as a Data Scientist.

Where to learn it

As mentioned earlier in this article, the most popular way to learn the basics of machine learning is this course from Stanford University professor Andrew Ng.

SQL

SQL, or Structured Query Language, is a language used for interacting with relational databases. Worldwide, the majority of data is stored in relational databases. To work with this data, you need to be able to query the database to extract the data you need. This is why understanding the fundamentals of SQL is essential as a Data Scientist.

Where to learn it

SQL Zoo is a free SQL tutorial with fun practical exercises.

Software

Software packages used by Data Scientists include Tableau, Microsoft Excel, RapidMiner, and KNIME. You may be surprised to see Excel on this list, but CSV reports are sometimes the only common language between Data Scientists and business at large (in 2016, Excel was almost as commonly used as SQL among Data Scientists).

If you are trying to become a Data Scientist, the only software package that you must be comfortable with is Excel. This is simply because it is guaranteed to be used at any given company you might apply to, while other software packages, such as Tableau and RapidMiner, may not be. It is worth noting that you will likely be using Excel as a communication tool to share your results, rather than solely doing data analysis inside Excel directly. As a Data Scientist, you will often be working with data sets that are too large to be analyzed using Excel alone.

Where to learn it

Udemy offers a number of courses teaching advanced Microsoft Excel skills. You should be comfortable both with doing data analysis and reporting on data with Excel.

Statistical Methods

A strong understanding of statistics is probably the most important skillset for Data Scientists. Simply put, all of the programming, mathematical, and software skills in the world will not help you if you don’t understand how to analyze and report on statistics accurately and fairly.

For example, if you don’t understand when it is appropriate to report on the median or the mean for a given set of values, you may produce output that is skewed by outliers, and as such, tells a misleading story. If you don’t understand the theory behind confidence intervals, appropriate sample size, and statistical significance, you may end up making definitive claims that should, in fact, be estimates.

All good Data Scientists differ in their skills and chosen technologies, but one thing they all share is a deep understanding of statistics.

If you don’t yet know a programming language, my suggestion is to learn the foundations of statistics first, without using programming libraries. Programming libraries like NumPy abstract away the internals of how a statistic is derived and make it too easy to get a result that you accept without really understanding it. Take a basic statistics course that focuses on calculating statistics by hand or using statistical software like IBM’s SPSS. You should also learn how to fairly, accurately, and clearly report on statistics.

Where to learn it

You can learn about statistics and probability for free at Khan Academy. Another excellent resource is Andy Field’s book Discovering Statistics Using IBM SPSS Statistics, 4th Edition.

Getting Your First Job as a Data Scientist

Based on my research, the likelihood of being considered for an entry-level Data Scientist role at a company boils down to a few different factors: your education and your demonstrated skills. Experience is usually a factor, but we’ll assume you don’t have any since you’re reading an article about breaking into the industry.

Any gaps in your education need to be compensated for by showcasing the depth of your skills. For example, someone who is completely self-taught will likely need an impressive portfolio of projects to compensate. Conversely, someone with a degree that is respected in the Data Science field, such as a mathematics or Computer Science degree, will likely need to demonstrate fewer applied skills to be considered for a role.

As always, these general rules apply to the majority of companies, but probably don’t apply to Big Four companies like Google and Facebook. These companies are extremely competitive and are likely to expect a relevant post-graduate degree from a prestigious university, coupled with personal projects.

I’ve developed a simple points system to help determine where you fit. For most companies, a minimum of three points will be required to be considered for an entry-level role doing data science:

Portfolio of many interesting data science projects showcasing applied skills: 3 points

Degree in a strongly relevant field (mathematics, computer science, economics, statistics): 3 points

General degree (humanities, social sciences): 2 points

Completed a Data Science bootcamp or intensive course: 2 points

You’ve completed a couple of interesting data science projects you can share and talk about: 1 point

No personal data science projects: 0 points

No tertiary degree: 0 points

Keep in mind that some companies will refuse to hire a candidate who doesn’t have a tertiary degree. That being said, some companies will claim on their job ads that only candidates with tertiary degrees will be considered, but will nix this requirement if they like the candidate enough. My advice is to apply to these companies anyway... just in case.

Developing a Portfolio

From the above, you can see that an applied data science portfolio can be just as powerful as a relevant tertiary degree. Simply put, a relevant degree demonstrates that you have the potential to practice data science. A portfolio shows that you’re already doing it.

I suggest building up a data science portfolio before you apply for jobs in the field. One interesting project that you can talk about at length is better than many tiny projects that you’ve never really finished.

An excellent way to start building up your portfolio is to solve some of the challenges and competitions available at Kaggle. If possible, try to track your scripts and output using GitHub so recruiters can see how you approached solving the problem. Often, the way you go about solving the problem is much more important than coming up with the ‘right’ answer.

Lastly, put effort into how you present your results. A bunch of Python scripts alone is much less impressive than a PDF report including a clear summary of results and relevant visualizations of the data.

Data Science Interviews

This list of 20 questions to detect fake data scientists provides an example of the kinds of interview questions you might field during an interview. As you can see, common themes are: the relative strengths and weaknesses of particular techniques, when to use particular techniques, the appropriate approach to a given problem, and how to use and report on statistics appropriately. You can also review this list of 109 commonly asked data science interview questions. Keep in mind, however, that these questions may not all be appropriate for entry-level candidates.

You should be prepared to complete a data science challenge before, or during, the interview process. If your interview room has a white board (it should!), make use of it to communicate your ideas and build a shared understanding with the interviewer — even if you haven’t been explicitly been given a “solve on the whiteboard” challenge.

Share Your Journey

If you do decide to enter the field of Data Science, I wish you the best of luck.

Are you interested in becoming a Data Scientist? If you’re currently working as a Data Scientist, how did you break into the industry? We’d love to hear about your journey in the comments below.