INTRODUCTION TO PYTHON FOR DATA MINING¶

Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the Anaconda Package. It comes with most of the libraries you will need and provides and IDE and package manager.

I do most of my work from the command line, but Anaconda comes with a launcher app that can be found in the ~/anaconda directory. To get the launcher to work with a Mac, you need to do the following:

Go to your terminal (hit command-space_bar and then type terminal) Type conda install -f launcher After that runs, type conda install -f node-webkit

Now you can open the launcher and see:

glueviz - This lets you link multiple plots across files Ipython Notebook - A great way to display and work on your data mining projects Ipython qtconsole - Basically an Ipython terminal for coding Spyder - An IDE for Ipython

IPython vs Python¶

Ipython is what makes Python interactive. Meaning that you can type some code, get some results, and then type some more code. This is very useful for exploring data because you don't always know what you are looking for and it can be annoying to have to run your entire program every time you make changes.

Libraries You Should Know About¶

Pandas - Provides R like data structures and a high level API to work with data Numpy - Provides fast numerical computing such as arrays and linear algebra Scipy - For scientific computing such as drawing from distributions Matplotlib - For plotting Seaborn - To make your plots look better Scikit-Learn - For machine learning; great documentation and tutorials Statsmodels - For more traditional statistics

Getting Seaborn¶

In the terminal type pip install seaborn

An Example¶

Read in Data¶

I will use pandas to read in some data from the web and quickly remove the NA rows.