Doing data analysis can be fun and rewarding. It’s something I do a lot of in my free time. Without the right tools though, it can be frustrating and extremely time consuming. I break down the process of working with data into 4 steps.

Data gathering: Finding and getting the dataset you are interested in Data cleaning: Getting data into the proper format Data exploration: Finding trends and interesting patterns Data visualization: Visualizing the awesome trends you’ve found

Data gathering

The process of gathering data has gotten significantly better in the last 5 to 10 years. It’s now possible to find a huge number of datasets online.

Kaggle

Kaggle introduced a new Datasets feature in 2016, and it has quickly become my favorite place to browse and explore datasets. It allows you to upload your own datasets as well as freely access others. Many people create their own “kernels” which are little scripts to tell a story / analysis about a certain dataset. The caveat of this source is that it’s kind of a free for all and some of the datasets aren’t well documented and the source of the data isn’t clear.

A visualization of a Pokemon dataset found on Kaggle. Don’t ask me to interpret it.

Google BigQuery

Another big player that has really come to fruition in the past few years is Google BigQuery. They host a number of huge, public datasets. Additionally, it’s easy to explore the data via SQL, often times only costing pennies.

Data.gov

Data.gov is a great place to start searching for data involving the government. I’ve found the site to be somewhat hit or miss, often linking me to some unworkable government website. The US government though is getting serious about open data, and this will be tool that I’m sure will improve with time.

Another player in the open data for government realm is Socrata. Look for large city governments to often host their data with them. Some examples include NYC Open Data and the Chicago Data Portal.

Reddit

/r/datasets can often have some very new and nifty datasets. You can also post a request for a piece of information or dataset and occasionally you will get a response.

Another hack that I use occasionally is to browse through /r/dataisbeautiful. All OC (Original Content) posts are required to include in a comment where they got their dataset from.

Awesome Public Datasets

The github repository awesome-public-datasets has links to many types of datasets, aggregated by category.

Scrapy

Sometimes the best data isn’t available via a download button or an easily accessible API. I’ve tried multiple web scrapers, and time and again, I return to Scrapy. If you have programming skills or aren’t afraid to dive into a little bit of python, Scrapy is a very approachable web scraping tool that works well and has great documentation and tooling around it.

My favorite feature is the scrapy shell <url> which will scrape a web page and open a REPL for you to run python commands against until you’ve determined the set of commands required to get the data you’re interested in.

Google

This one is obvious, but still worth mentioning. There are loads of other resources on the web. Googling whatever you’re looking for plus the word “dataset” is a good place to start.

Freedom of Information Act (FOIA)

Last but not least, if there is some data you really want to get your hands on, you can submit a FOIA request. This law allows you to request data from any federal agency, and they are required to hand it over, unless it falls under an exemption.