Google, I submit, is the 8th wonder of the world. What made Google synonymous with the internet, and turned it into the multinational conglomerate that it is today, is none other than its core business, the search engine. Having a basic understanding of how a search engine works under the hood may seem unattainable. Even some of the most tech savvy among us feel this way. Yet our health, safety, and income become ever increasingly a matter of ones and zeros. It is imperative for the larger public, and especially our legislators, to understand how companies like Google serve us information and shape our society. In doing this, it is my hope that more individuals like myself, in the fields of data science and artificial intelligence, do more to inform our law makers and the broader public about the inner-workings of such technologies. Needless to say, these technologies will continue to have enourmous impacts on our lives. In informing the public in an easily digestible manner, perhaps we can avoid some of the unproductive and, flat out silly, discourse that prevailed during the Zuckerberg and Pichai congressional testimonies, in the future.

Comment bellow if you remember this!

The Web Crawler

A search engine is only as good as its web crawler. It is, by far, the most important component of a search engine. We will, therefore, discuss it in great detail in this first installment of How Search Engines Work. Understanding the crawler is central to understanding how web search works. The web crawler is a specific code, with the sole function of automatically locating and storing information from websites. This code is not run once. It does this repeatedly since the internet is constantly changing and websites are constantly updated with new content. The crawler, quite literally, crawls all over the text, media, and the internal and external links within a website.

The crawler does not just crawl one site, it often follows the links within a site to other websites. For example: when there is an article on CNN.com about some new proposal by The White House, CNN may have a link to the whitehouse.gov press release, which the crawler will follow. It will then go through every (publicly accessible) nook and cranny of whitehouse.gov and follow whatever links there may be in it.

Restricting the Crawler

The crawler can do this very fast, much faster than you or I could manually click through a website. Google is not the only entity that can create these crawlers (also referred to as robots or spiders since they crawl through the web of interconnected links that is the internet). In fact, many people, with the knowledge, including myself, can create crawlers to automate their information gathering tasks. As you can imagine, these “robots” can put a toll on the servers and bandwidth of a website, and even bring it down! Hackers often attack websites with scripts that function like a crawler and overburden the site with link clicks and requests to shut it down. Those who use crawlers legitimately, adhere to “politeness policies” or crawling policies of websites. These can be located in the Robot.txt file on most websites which state when, what, how, and how often you are allowed to crawl a website with your crawler.

When some information in the site is hidden or account restricted the crawler, including google’s crawler, can not generally reach it. In addition, creating crawling policies on your website that restrict google or other crawlers to sift through your website, usually leads to that website not appearing in searches. Sometimes such content is said to be in the “DEEP WEB”. While the term sounds insidious, it simply means information on the internet that is not accessible to everyone who is online.

Aside from crawling policies, things that can make the job of crawlers difficult can include large files, character encoding issues, having to convert files to HTML or XML, password protection, inaccessible content blocks, having content that is nearly duplicated or plagiarized. Having said that, crawlers from most modern search engines have advanced capabilities to overcome such challenges. The “freshness” and “age” of your content can also dictate if, and how often Google may visit your website to crawl it. The more established you website is and the more frequently new content is added, the more your site stands out to the crawler.

Helping the Crawler

Believe it or not, most website administrators are not trying to stop legitimate crawling requests, especially when they come from Google. Millions of dollars are spent by all sorts of business on the mysterious art and science of search engine optimization (SEO). They want their sites to be as accessible as possible to Google by not only allowing full access to Google’s crawler, but also making sure their websites are composed in the most readable and accessible format to Google. Businesses do this because how high your site appears in a google search result can directly affect your bottom line. Appearing on the first page or the second page of results can mean the difference between thriving or having to close shop for some companies.

What Happens Next?

What people do with web crawled information is limitless, however, what naturally comes next in the case of Google is storage. A highly compressed version of the webpages is stored in their massive, decentralized, data warehouses. More specifically, Google stores their data in a storage system that, simply put, stores web data in columns and rows, aptly named BigTable.

Stay Tuned…

In the next installment of How Search Engines Work, we will explore how the data that has been crawled and stored, is searched, retrieved, and ranked. Thanks for reading and stay tuned!