The following comprehensive listings were produced by analyzing our large member database, extracting websites that our members mentioned or liked, and for each web site, identifying

When it is first mentioned by one of our members

The number of times it was mentioned

Keywords found when visiting the front page with a web crawler, using a pre-selected list of seed keywords

The design of the member database (non-mandatory sign-up questions and choices offered to new members on sign-up) was done by our home data scientist (me) long ago, precisely with the purpose in mind of performing analyses like this one, down the road. Other analyses produced in the past include: 6,000 companies hiring data scientists, best cities for data scientists, demographics of data scientists, and 400 job titles for data scientists: see related links at the bottom of this article.

The Whole Internet (Source: Wikipedia)

Seed Keywords

Seed keywords were used to identify, for each website, whether one or more of the keywords in our list was found on the front page, using a web crawler. This helps categorize websites - the final goal being the creation of a data science webste taxonomy.The seed keywords that we used (hand-picked) are very popular data science related keywords:

analytics,

data science,

database,

hadoop,

predictive modeling,

big data,

business intelligence,

machine learning,

data mining,

text mining,

operations research,

statistics.

General Methodology

We used a web crawler to browse all the URLs, after identifying and cleaning the websites fields (URLs listed by members), in our member database. Click here to get the script used to sumarize the data, as well as a sample of raw data. Note that improving this study is now a new project added to our list of projects for DSA candidates: In short, it consists of creating a niche search engine for datascience, better than Google, and a taxonomy of these websites. Candidates interested in this project will have access to the full data. Because this is based on data submitted by users, the raw data is quite messy and requires both cleaning and filtering. Details are found in my script - it's a good example of code used to clean relatively unstructured data.

Here we categorize the websites in four major clusters:

Websites mentioned at least 3 times, containing at least one of the seed kewyords in our list Websites mentioned less than three times, containing at least one of the seed keywords in our list Websites that we were unable to crawl, mentioned at least twice Websites containing no seed keywords from our list, and mentioned at least 4 times

We provide direct clickable links for domains in category 1 (above and below) only. The choices of these various parameters is to guarantee robustness in our results, filter out noise, and for internal security reasons: listing hundreds of little know websites (with clickable links) can get you penalized by Google, can results in many requests for link removal, and many might of these links can die in the next few months, creating a bad user experience (and additional Google penalties).

The 2,500 Website Listing

Here are the links to the four major categories of data science websites:

The field between parentheses represents the year when the website in question was first mentioned - it does not represent when the website was created, thought it's a good proxy to tell how old the website is. The member database goes as far back as 2007. The list of keywords attached to each website represents which seed keywords were found on the front page, when crawling the website. The number of stars (1, 2 or 3) represents how popular the website is: it's an indicator of how many members mentioned it. Of course, brand new websites might not have 3 stars yet.

Data and Source Code

Source code (two scripts including a web crawler / parser / summarizer, and code to produce final HTML pages), as well as raw, intermediate and final data (samples, screen shots), and details about the 3-step procedure used to publish these listings, can be found here.

Detailed Methodology

Our methodology, to build our semi-categorized website listing, has the following additional features:

All webpages were stored as strings (after download), all in lower case. The seed keywords were also in lower case.

Within each sub-group in each of the four major categories, websites are displayed in random order: using a stars system (rather than detailed score) makes for more robust, accuruate results. Sorting websites by score (score = number of members mentioning the website in question) would result in various drawbacks: website owners complaining about their score, and sometimes for good reasons!

My script takes about 20 minutes to run on one machine to crawl 2,800 websites. I only read the first 64K of each page, and the http requests times out after 1 second. It would be much faster if multi-threaded.

The fourth major category of websites (those containing no seed keywords, and mentioned at least 4 times) is interesting nevertheless: it shows which non-analytic (general, mainstream) websites our members also visit.

Some of the websites where no seed keywords were found are actually analytic websites, and the lack of analytic keywords might be caused either by a glitch in our script, or in the way the webpage is encoded (iFrames, heavy Javascript, Flash and other page creation techniques giving a headache to our webcrawler, and indeed to all webcrawlers including Google). These represent only a small percentage (< 5%) of all websites. Maybe crawling a few webpages, not just the frontpage (for each website returning no seed keywords),could fix the issue. This implies deep crawling, following internal links found on the frontpage.

Uncrowlable websites, bad domains

As many as 800 out of 2,800 original all websites could not be crawled. I re-run the crawler on these websites a few hours later, increasing the value of the time-out parameter, and using a different user agent string in the code (the $ua->agent argument for those familiar with the web crawling LWP::UserAgent library). I then re-run it a few more times the same day, and eventually managed to reduce the number of un-crawlable websites to about 300. Maybe trying another day, with a different IP address, following the robot.txt protocol (crawling robots.txt on each failed website) might further reduce the number of failed crawls. However, about 250 of the uncrawlable websites were just simply non-existent, mostly because of typos in member fields (user-entered information) in our database.

argument for those familiar with the web crawling LWP::UserAgent library). I then re-run it a few more times the same day, and eventually managed to reduce the number of un-crawlable websites to about 300. Maybe trying another day, with a different IP address, following the robot.txt protocol (crawling robots.txt on each failed website) might further reduce the number of failed crawls. However, about 250 of the uncrawlable websites were just simply non-existent, mostly because of typos in member fields (user-entered information) in our database. Some of the uncrawlable websites result from various redirect mechanisms that cause my script not to work, or sometimes because it redirects to an https address (rather than http)..

I extracted the error information for all uncrawlable websites. Typically, the "500 Bad Domain" error means that the domain does not exist (rarely, it is a redirect issue). Sometimes adding www will help (changing mydomain.com to www.mydomain.com).

Some of the "bad domains" with 1 or 2 mentions, were actually irrelevant and dead websites posted by spammers. So this analysis allowed us to find a few spammers, and eliminate them!

Possible Improvements, Next Steps

There are various ways to improve my methodology and the quality of the results. Here I mention a few:

Order results by year (showing most recent websites first)

Perform some real clustering on these websites, using the stars, year and keyword metrics available in my listings.

Create your own seed keywords list, by extracting all one- and two-tokens keywords found on these 2,500 webpages (nice seed keywords to add are data and visualization)

Break down websites into two groups: those containing data or analytic in domain name, versus thos who don't

Browse multiple webpages per website (identify internal pages with web crawler)

Browse multiple external pages per website, to grow your list of 2,500 websites to a much bigger list (make sure the new websites added are analytic-related, use the seed keywords list for this purpose). You can go two levels deep in your external crawling.

Create and use a segmented seed keywords list (keywords related to visualization, big data, infrastructure, storage, databases, analytics and so on; this will help with website clustering)

Run the crawler on Hadoop or at least use some distributed architecture

Run the crawler in batch mode. Allow your script to easily resume if it stops for whatever reasons (Internet goes off etc.) One way to do this is to save the results for each website, one at a time, immediately after crawling it, and produce a log history of all websites that have been crawled, as your script progresses over the website list. This way, you can resume your crawling with a single command, at any time.

Use recent data only (2013, 2014). Some old websites (2008) have three stars because they were popular back then, but have now little traffic.

Handle https as wel as http requests

Look at keyword density. Rather than checking if data science is found on a webpage, look at how many times it is found.

Also, if you want your website to be listed, create a DSC profile and publish your website on your profile (look at the question about "favorite website", on sign-up).

Finally, if interested, join our Data Science Apprenticeship to work on and improve this project, and turn it into a search engine and full taxonomy, changing automatically every day based on data gathered by the web crawlers. Check project #7 in our list of business/applied projects. Time permitting, I will publish more advanced web crawling based on my infringement detection app (currently paused).

Related Links