Tools for Discovering Patterns in Data: Extracting Value from Tables, Text, and Links

Presenter:

John Elder, Ph.D.



Charlottesville, Virginia

Coming in Fall 2018



Can't attend the on-site course?

Register here for the online video course.

Course Description

Find the useful information hidden in your data! This course surveys computer-intensive methods for inductive classification and estimation, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of leading algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We'll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into powerful modern methods. The course emphasizes practical advice and focuses on the essential techniques of Resampling, Visualization, and Ensembles. Actual scientific and business examples will illustrate proven techniques employed by expert analysts. Along the way, relative strengths and distinctive properties of the leading commercial software products for Data Mining will be discussed.

Instructor

John F. Elder IV, Ph.D. heads the US’s top data mining consulting team, based in Charlottesville, Virginia, and in Washington DC, Baltimore MD, and Raleigh NC. Founded in 1995, Elder Research, Inc. focuses on commercial, investment, and security applications of advanced analytics including stock selection, text mining, social networks, image recognition, biometrics, process optimization, drug efficacy, credit scoring, and fraud detection. John holds Engineering degrees from Rice University, and the University of Virginia, where he’s an Adjunct Professor teaching Optimization or Data Mining. Prior to founding ERI, he spent a decade in aerospace consulting, investment management, and academia.

Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and chairs international analytics conferences. He was honored to serve five years on a panel appointed by President Bush to guide technology for national security. He has co-authored award-winning books on practical data mining, ensembles, and text mining. John is grateful to be a follower of Christ and the father of 5.

Intended Audience

Those from industry and academia who work with data and wish to understand recent developments in data science and machine learning. At the conclusion of this course, one should be able to discern the basic strengths of competing methods and select the appropriate tools for one's applications. Participants should have prior working experience with computers and interest in applied statistical techniques. (It helps, as well, to have a motivating application you wish to solve.)

Course Outline

I. Pattern Discovery: An Overview

Inducing Models from Data: Benefits and Dangers

Example Projects from Science and Business

Characteristics of successful projects

Leading Software Tools and Vendors

II. Classical Statistical Techniques (brief review)

Regression

Principle Components

Nearest Neighbors

III. Modern Methods

Neural Networks

Decision Trees

IV. Key General Tools

Scientific Visualization: Grand Tour, Projection Pursuit, limitations

Bootstrapping/Resampling: Essential!

Optimization: local and global

Target Shuffling: learning true significance

V. Data Trouble-Shooting

Case Diagnostics (Outlying, Influential, Leverage, & Missing points)

Feature Creation and Selection

VI. Text Mining

Stemming, Collocation, Feature Engineering

Statistical vs. Language-dependent methods

“Bag of Words” & Vector Space

Active Learning

VII. Social Network Analysis

The power of the "network effect"

Visualization, modeling tools, and examples

VIII. Comparing and Combining Algorithms

Adaptive model structure

Matching an algorithm to your application

Experimental test results

Combining models to improve accuracy

Bagging & Boosting

Why Ensembles work

IX. Top 10 Data Mining Mistakes

Lack data

Focus on Training

Rely on 1 technique

Ask the wrong question

Listen (only) to the data

Leaks from the Future

Discount pesky cases

Extrapolate

Answer every inquiry

Sample without care

Believe the best model

A note about the course scope

Each of the major topics discussed could comprise a semester-long course if presented in full detail! What this (intensive) short course provides is a broad overview of the highlights, drawing connections between major developments in the diverse fields that contribute to Predictive Analytics, including cutting-edge ways to mine text and graphical networks. Previous participants have found this "big picture" to be very useful for identifying techniques to use immediately, as well as approaches worthy of further exploration, for research or practical problem-solving.

Comments from previous attendees