Completing your first project is a major milestone on the road to becoming a data scientist and helps to both reinforce your skills and provide something you can discuss during the interview process. It can be fun to sift through dozens of data sets to find the perfect one. But it can also be frustrating to download and import several csv files, only to realize that the data isn’t that interesting after all. Luckily, there are online repositories that curate data sets and remove the uninteresting ones.

If you are changing your career to Data Science, then you will find following articles very helpful –

1. AWS Public Data Sets

Amazon makes large data sets available on its Amazon Web Services platform. You can download the data and work with it on your own computer, or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here, and check out the data sets for yourself here (although you’ll need a free AWS account first).

Here are some examples:

Lists of n-grams from Google Books — common words and groups of words from a huge set of books.

Common Crawl Corpus — data from a crawl of over 5 billion web pages.

Landsat images — moderate resolution satellite images of the surface of the Earth.

2. Google Public Data sets

Google also has a cloud hosting service, which is called Google Cloud. With Google Cloud, you can use a tool called BigQuery to explore large data sets. Google lists all of the data sets on this page. You’ll need to sign up for a Google Cloud account to see it, but the first 1TB of queries you make each month are free, so as long as you’re careful, you won’t have to pay anything.

Here are some examples:

USA Names — contains all Social Security name applications in the US, from 1879 to 2015.

Github Activity — contains all public activity on over 2.8 million public Github repositories.

Historical Weather — data from 9000 NOAA weather stations from 1929 to 2016.

3. Google Books Ngrams

If you’re interested in truly massive data, the Ngram viewer data set counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.

4. Wikipedia

Wikipedia is a free, online, community-edited encyclopedia. It contains an astonishing breadth of knowledge, containing pages on everything from the Ottoman-Habsburg Wars to Leonard Nimoy. As part of Wikipedia’s commitment to advancing knowledge, they offer all of their content for free, and regularly generate dumps of all the articles on the site. Additionally, Wikipedia offers edit history and activity data, so you can track how a page on a topic evolves over time, and who contributes to it. Methods and a how-to guide for downloading the data are available here.

Here are some examples:

All images and other media from Wikipedia — all the images and other media files on Wikipedia.

Full site dumps — of the content on Wikipedia, in various formats.

Wikipedia provides instructions for downloading the text of English-language articles, in addition to other projects from the Wikimedia Foundation.

5. FiveThirtyEight

If you’re interested in data at all, you’ve almost certainly heard of FiveThirtyEight; it’s one of the best-established data journalism outlets in the world. They write interesting data-driven articles, like “Don’t blame a skills gap for lack of hiring in manufacturing” and “2016 NFL Predictions”.

What you may not know is that FiveThirtyEight also makes the data sets used in its articles available online on Github and on its own data portal.

Here are some examples:

Airline Safety — contains information on accidents from each airline.

US Weather History — historical weather data for the US.

Study Drugs — data on who’s taking Adderall in the US.

6. Walmart

Walmart has released historical sales data for 45 stores located in different regions across the United States.

Twitter has a good streaming API, and makes it relatively straightforward to filter and stream tweets. You can get started here. There are tons of options here — you could figure out what states are the happiest, or which countries use the most complex language. If you’d like some help getting started working with this Twitter API, check out our tutorial here.

8. Github

GitHub has an API that allows you to access repository activity and code. You can get started with the API here. The options are endless — you could build a system to automatically score code quality, or figure out how code evolves over time in large projects.

9. The World Bank

The World Bank is a global development organization that offers loans and advice to developing countries. The World Bank regularly funds programs in developing countries, then gathers data to monitor the success of these programs. You can browse world bank data sets directly, without registering. The data sets have many missing values (which is great for cleaning practice), and it sometimes takes several clicks to actually get to data.

Here are some examples:

World Development Indicators — contains country level information on development.

Educational Statistics — data on education by country.

World Bank project Costs — data on World Bank projects and their corresponding costs.

10. Reddit – /r/datasets

Reddit, a popular community discussion site, has a section devoted to sharing interesting data sets. It’s called the datasets subreddit, or /r/datasets. The scope and quality of these data sets varies a lot, since they’re all user-submitted, but they are often very interesting and nuanced. You can browse the subreddit here without an account (although a free account will be required to comment or submit data sets yourself). You can also see the most highly-upvoted data sets of all time here.

Here are some examples:

All Reddit Submissions — contains reddit submissions through 2015.

Jeopardy Questions — questions and point values from the gameshow Jeopardy.

New York City Property Tax Data — data about properties and assessed value in New York City.

Reddit released a really interesting data set of every comment that has ever been made on the site. It’s over a terabyte of data uncompressed.

11. Academic Torrents

Academic Torrents is data aggregator geared toward sharing the data sets from scientific papers. It has all sorts of interesting (and often massive) data sets, although it can sometimes be difficult to get context on a particular data set without reading the original paper and/or having some expertise in the relevant domains of science. You can browse the data sets directly on the site. Since it’s a torrent site, all of the data sets can be immediately downloaded, but you’ll need a Bittorrent client. Deluge is a good free option that’s available for Windows, Mac, and Linux.

Here are some examples:

Enron Emails — a set of many emails from executives at Enron, a company that famously went bankrupt.

Student Learning Factors — a set of factors that measure and influence student learning.

News Articles — contains news article attributes and a target variable.

12. Quantopian

Quantopian is a site where you can develop, test, and optimize stock trading algorithms. In order to help you do that, the site gives you access to free minute-by-minute stock price data, which you can use to build a stock price prediction algorithm.

13. Wunderground

Wunderground has an API for weather forecasts that is free up to 500 API calls per day. You could use these calls to build up a set of historical weather data, and then use that to make predictions about the weather tomorrow.

14. Data.world

Data.world is a user-driven data collection site (among other things) where you can search for, copy, analyze, and download data sets. You can also upload your own data to data.world and use it to collaborate with others.

The site includes some key tools that make working with data from the browser easier. You can write SQL queries within the site interface to explore data and join multiple data sets. They also have SDKs for R and Python that make it easier to acquire and work with data in your tool of choice (and you might be interested in reading our tutorial on using the data.world Python SDK.) All of the data is accessible from the main site, but you’ll need to create an account, log in, and then search for the data you’d like.

Here are some examples:

Climate Change Data — a large set of climate change data from the World Bank.

European soccer data — data on soccer/football in 11 European countries from 2008-2016.

Big Cities Health — health data for major cities in the US.

15. Data.gov

Data.gov an aggregator of public data sets from a variety of US government agencies, as part of a broader push towards more open government. Data can range from government budgets to school performance scores. Much of the data requires additional research, and it can sometimes be hard to figure out which data set is the “correct” version. Anyone can download the data, although some data sets will ask you to jump through additional hoops, like agreeing to licensing agreements before downloading. You can browse the data sets on Data.gov directly, without registering. You can browse by topic area, or search for a specific data set.

Here are some examples:

Food Environment Atlas — contains data on how local food choices affect diet in the US.

School System Finances — a survey of the finances of school systems in the US.

16. Kaggle

Kaggle is a data science community that hosts machine learning competitions. There are a variety of externally-contributed interesting data sets on the site. Kaggle has both live and historical competitions. You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition. You can download data from Kaggle by entering a competition. Each competition has its own associated data set. There are also user-contributed data sets available here, though these may be less well cleaned and curated than the data sets used for competitions.

Here are some examples:

Satellite Photograph Order — a set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.

Manufacturing Process Failures — a collection of variables that were measured during the manufacturing process. The goal is to predict faults with the manufacturing.

Multiple Choice Questions — a data set of multiple choice questions and the corresponding correct answers. The goal is to predict the answer for any given question.

17. UCI Machine Learning Repository

The UCI Machine Learning Repository is one of the oldest sources of data sets on the web. Although the data sets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting data sets. You can download data directly from the UCI Machine Learning repository, without registration. These data sets tend to be fairly small, and don’t have a lot of nuance, but they’re great for machine learning

Here are some examples:

Email spam — contains emails, along with a label of whether or not they’re spam.

Wine classification — contains various attributes of 178 different wines.

Solar flares — attributes of solar flares, useful for predicting characteristics of flares.

18. Quandl

Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to the large amount of available data, it’s possible to build a complex model that uses many data sets to predict values in another.

Here are some examples:

19. BuzzFeed

BuzzFeed may have started as a purveyor of low-quality clickbait, but these days it also does high-quality data journalism. And, much like FiveThirtyEight, it publishes some of its datasets publicly on its Github page.

Here are some examples:

Federal Surveillance Planes — contains data on planes used for domestic surveillance.

Zika Virus — data about the geography of the Zika virus outbreak.

Firearm background checks — data on background checks of people attempting to buy firearms.

20. ProPublica

ProPublica is a nonprofit investigative reporting outlet that publishes data journalism on focused on issues of public interest, primarily in the US. They maintain a data store that hosts quite a few free data sets in addition to some paid ones (scroll down on that page to get past the paid ones). Many of them are actively maintained and frequently updated. ProPublica also offers five data-related APIs, four of which are accessible for free.

Here are some examples:

Political advertisements on Facebook — a free collection of data about Facebook ads that is updated daily.

Hate crime news — regularly-updated data about hate crimes reported in Google News.

Voting machine age — data on the age of voting machines that were used in the 2016 election.

21. Socrata OpenData

Socrata OpenData is a portal that contains multiple data sets that can be explored in the browser or downloaded to visualize. The offerings here are less well curated, so you’ll have to sort through what’s available to find data that’s clean and up-to-date, but the ability to look at the data in table form right in the browser is very helpful, and it has some built-in visualization tools as well.

Here are some examples:

White House staff salaries — data on what each White House staffer made in 2010.

Workplace fatalities by US state — the number of workplace deaths across the US.

22. United States Census Data

The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. It is a fantastic data set for students interested in creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean and very comprehensive.

23. FBI Crime Data

The FBI crime data is fascinating and one of the most interesting data sets on this list. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.

24. CDC Cause of Death

The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.

25. Medicare Hospital Quality

The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons.

26. SEER Cancer Incidence

The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program.

27. Bureau of Labor Statistics

Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.

28. Bureau of Economic Analysis

The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates.

29. IMF Economic Data

For access to global financial statistics and other data, check out the International Monetary Fund’s website.

30. Dow Jones Weekly Returns

Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.

31. UNICEF

If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others.

32. Airbnb

Inside Airbnb offers different data sets related to Airbnb listings in dozens of cities around the world.

33. Yelp

Yelp maintains a free dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge.