Data Analysis of 10.000 AI Startups

Extracting insights from AngelList companies

Introduction

AngelList is a place that connects startups to investors and job candidates looking to work at startups. Their goal is to democratize the investment process, helping startups with both fundraising and talent. Be it to find a job, investors for a startup, or even just to make connections, it’s a platform which everyone in the tech field should be aware of. Since the website was created in 2010, more than 4M companies, 8M investors and at least 1M candidates have registered on their website.

In times when machine learning is growing exponentially, I wanted to take a look at those AI startups and make an exploratory data analysis around them and their field of activity. How big is the investment in the AI sector? How do AI startups scale? What markets are the most promising for them?

Data Extraction

To find commonly related words, a nice tool to use is SenseToVec, by explosion.ai. It’s a neural network model that reads every comment posted to Reddit in 2015 and built a semantic map using word2vec and spaCy. You can search for a word or phrase and get the most similar words to that (I even use it to look up synonyms once in a while). So I typed in machine learning and came up with terms like:

Data Science

Natural Language Processing

Computer Vision

And dozens more. After filtering some terms out, I used the remaining as queries to be typed on Angel’s search box.

The web scraper was made using Selenium and Beautiful Soup. It creates a driver that access the URL (https://angel.co/companies), clicks on the search bar and writes a specific query. Then it scrolls through every company in the list and stores its data. Since the website limits the search by 400 companies per search, I opted to use filters and increased the number of queries, to make sure I’d get almost all companies related to each one.

Angel Scraper

After removing duplicates, the result was a CSV file containing 10.139 unique data points, comprising features like:

‘ name ’ → Name of the company

’ → Name of the company ‘ joined ’ → Date that the company joined Angel

’ → Date that the company joined Angel ‘ type ’ → Company type (Startup, Private Company, Incubator…)

’ → Company type (Startup, Private Company, Incubator…) ‘ location ’ → City where the company is based

’ → City where the company is based ‘ market ’ → Company’s field of activity (E-Commerce, Games…)

’ → Company’s field of activity (E-Commerce, Games…) ‘ pitch ’ → Company’s slogan

’ → Company’s slogan ‘ raised ’ → Amount raised by the company with investments

’ → Amount raised by the company with investments ‘tech’ → Main programming language (Python, Javascript…)

Data Analysis

Before looking for insights in the data, I had to clean and pre-process it to become useful for analysis. That included some steps like formatting dates, normalizing texts and converting money strings to float numbers. After that, I imported the Geopy library to extract coordinates information from the location column, so that we can work with latitudes and longitudes later on. Here’s a sample of the processed data frame: