Creating The Web Crawler

Before you can count the number of Backlinks of a given web resource, you must have browsed the web from URLs to URLs. The program in charge of carrying out this work is called a Web Crawler.

Our first job is to create a Web Crawler.

To do this, I create a MyCrawler class within which I define the following properties:

A String for the root domain from which the Web Crawler will start its work.

A Boolean indicating whether the Web Crawler should only be limited to the path of URLs in the same domain as the root.

An integer indicating the maximum depth to which the search should go.

A multi-valued map that will store all discovered links in memory. Thus, if site A makes an HTML link to site B with an anchor “Site A”, the map will store these 3 informations.

In order to store the outgoing link from site A to site B with the anchor “Site A”, I create an Out object:

I then have to create a crawl method that takes as input a URL and the current depth level associated with the URL being processed.

To start, I display the URL during the visit and then I retrieve the domain associated with this URL. This is necessary if you want to limit the work of the Web Crawler to the domain associated to the root of the exploring.

If the domain can be visited and the maximum exploring depth has not been reached, the URL passed as a parameter will then be processed. To do this, I connect to the URL using the static connect method of the Jsoup object by passing the URL as an input. Then, I call the get method to get the Document instance representing the content of this URL.

The Document object got offers a select method that allows you to pass selection requests on the content of the HTML page it represents. In this specific case, the following request is therefore entered: “a[href]”.

Then we increment the search depth and iterate on all the Element objects obtained thanks to this query.

For each link, I retrieve the anchor by calling the text method of the Element object instance and the content of the href attribute. I store the association between the URL being processed and this destination URL associated with the anchor pointing to it.

Finally, all that remains is to recursively call the crawl method for visiting this destination URL, which in turn will become the processing URL.

All this gives us the following code for the Web Crawler: