Today, we’re excited to share that Linkedin is open-sourcing our URL-Detector Java library. LinkedIn checks hundreds of thousands of URLs for malware and phishing every second. In order to guarantee that our members have a safe browsing experience, all user-generated content is checked by a backend service for potentially dangerous content. As a prerequisite for us to be able to check URLs for bad content at this scale, we need to be able to extract URLs in text at scale.

URLs get sent to our service in two different ways:

As a single URL

As a large piece of text

If it is sent as a single URL, we proceed to check the URL through our content-validation service. If it is sent as a large piece of text, we run our URL-Detector algorithm to try to search the text for any potential URLs. Before we proceed in visiting how this URL-Detector works and what functionalities it can provide, let's visit the motivation behind this project.

We want to detect as many malicious links as possible and to do this, we do not want to limit ourselves to checking URLs as defined in RFC 1738 and instead define a URL to be anything that can resolve into a real site when typed into the address bar of a browser. The browser address bar’s definition of a URL is very loosely defined while the RFC is very strict. And of course, there are many browsers and different browsers have different behaviors, so we try to find text that would resolve in the most popular ones. Thus, it wasn’t as simple as following the grammar defined in the RFC.

Initially, we started out with a solution based on regular expressions. It detected many potential URLs, many of which were real intentional URLs, many of which were not, and many of which we have missed. It was a very iterative process to catch more URLs. It would start out with something as innocent as: