Here is a list of potential projects to help you complete your master in data science or in a related field.

Project #8: Detecting fake reviews on Amazon

Business and Applied Data Science

Clustering 2,000+ data science websites, matching each of them against a pre-selected list of 100 top data science keywords (machine learning, AI, deep learning, IoT, Spark, NLP, business analytics, predictive modeling, big data etc.) You must count keyword frequency for each website website, for each keyword in the list. And finally, perform website clustering based on these counts. In addition, using publication date, relevancy (number of "likes" or comments per article) and posting frequency, if possible, will make the model more robust. The project requires data cleaning, and production of scores that measures website popularity and trends, broken down per year. See here for a solution. RSS Feed Exchange. Detecting reputable big data, data science and analytics digital publishers that accept RSS feeds ( click here for details ), and create an RSS feed exchange where publishers can swap or submit feeds. Analyze 40,000 web pages to optimize content. I can share some traffic statistics about 40,000 pages on DSC, and you work on the data to identify the types of articles and other metrics associated with success (and how do you measure success in the first place?), such as identifying great content for our audience, forecasting articles' lifetime and pageviews based on subject line or category, assessing impact of re-tweets, likes, and sharing on traffic, and detecting factors impacting Google organic traffic. Also, designing a tool to identify new trends and hot keywords would be useful. Lot's of NLP - natural language processing - involved in this type of project; it might also require crawling our websites. Finally, categorizing each page - creating a page taxonomy - to suggest "related articles" at the bottom of each article or forum question. URL shortener that correctly counts traffic. Another potential project is the creation of a redirect URL shortener like http://bit.ly, but one that correctly counts the number of clicks. Bit.ly (and also the Google URL shortener) provides statistics that are totally wrong for traffic originating from email clients (e.g. Outlook, which represents our largest traffic source). Their numbers are inflated by more than 300%. It's possible that an easy solution consists of counting and reporting the number of users/visitors (after filtering out robots), rather than pageviews. Test your URL re-director and make sure only real human beings are counted (not robots or fake traffic). Meaningful list and categorization of top data scientists, Other project: create a list of top 500 data scientists or big data experts using public data such as Twitter, and rate them based on number of followers or better criteria (also identify new stars and trends - note that new stars have fewer followers even though they might be more popular, as it takes time to build a list of followers). Classify top practitioners into a number of categories (unsupervised clustering) based on their expertise (identified by keywords or hashtags in their postings). Filter out automated from real tweets - in short identify genuine tweets posted by the author rather than feeds automatically blended with the author's tweets (you can try with my account @AnalyticBridge, which is a blend of external RSS feeds with my own tweets - some posted automatically, some manually). Create groups of data scientists. I started a similar analysis a while back, click here for details. Data science website. Creating and monetizing (maybe via Amazon books) a blog like ours from scratch, using our RSS feed to provide initial content to visitors: see http://businessintelligence.com/ for an example of such a website - not producing content, but instead syndicating content from other websites. Scoop.it (and many more such as Medium.com, Paper.li, StumbleUpon.com) have a similar business model. Creating niche search engine and taxonomy of all data science / big data / analytics websites, using selected fields from our anonymized member database, and a web crawler. In short, it consists of creating a niche search engine for data science, better than Google, and a taxonomy for these websites. Candidates interested in this project will have access to the full data, not just the sample that we published. Because this is based on data submitted by users, the raw data is quite messy and requires both cleaning and filtering. I actually completed this project myself (a basic version) in a couple of hours, and you can re-use all my tools, including my script (web crawler) - it's a good example of code used to clean relatively unstructured data. However, it is expected that you will create a much better version of this taxonomy, using a better seed keyword list (you will have to create it) and true clustering of all the data science websites. Read our section Possible Improvements in our article Top 2,500 Data Science, Big Data and Analytics Websites: this actually describes what you might want to do to make this taxonomy better (more comprehensive, user friendly etc.) You will also find many of the tools and explanations. Detecting Fake Reviews. Click here for details. In this project, you will have to assess the proportion of fake book reviews on Amazon, test a fake review generator, reverse engineer an Amazon algorithm, and identify how the review scoring engine can be improved. Extra mile: create and test your own review scoring engine. Scrape thousands of sampled Amazon reviews and score them, as well as users posting these reviews. Before starting, read this article.

Data Science Research

Stochastic Processes

For more recent projects with a more theoretical, probabilistic flavor, yet solved with the help of data science techniques, you can check the following:

If you are looking for data sets, check out this resource.

Good luck!

DSC Resources