Funderbeam has more than 160k startups in its database. This is amazing, but it’s also partly troublesome as navigating through the myriad startups becomes very difficult if there is no coherent classification of startups to help filter them by similar industries. Recognizing this, it’s one of our priorities to make the experience of finding information on startups as user-friendly as possible. Here is how we’re accomplishing this.

We started out by pulling data from various sources where certain tags were applied to startups by the databases or the startups themselves. By doing this, we generated thousand of different tags for most of the 160k startups. Some tags were far too general, like “software”, while others were too specific and didn’t appear often independently. For example, similar tags like music video, music streaming, music chart, music entertainment, music, music label, music venues, and independent music (there are 100's more like this), are related, so we cluster them together as music and audio. On the other hand, there were some broad tags that were too trivial, i.e. advertising, apps, shopping and so on. In addition, some of the tags were often not descriptive of a company’s actual business and some just had the singular vs. plural problem like designer vs. designers.

To overcome these limitations, we used a process called ‘hierarchical clustering’ to group tags with similar characteristics into clusters. What helped us with computing the similarity between tags was using information about the keywords in Wikipedia. Specifically, we compared the content similarity of Wikipedia articles between the different tags. If our algorithm spotted frequently occurring keywords (terms that are specific and representative of a given industry) between two Wikipedia articles, the tags were considered similar, and thus the dendrogram below was born: