Learning Python for Social Scientists

Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar

I’ve compiled a list of Python tutorials and annotated analyses. I've tried to list pages that are accessible to social scientists with little background in Python and/or machine learning.

If you are totally new to Python, I would recommend installing Continuum's Anacoda Python distribution. It works on Macs and Windows, makes using IPython notebooks trivial, and solves most of the problems associated with installing various packages.

If you know of anything I've left out or if links go dead, please let me know.

Walkthroughs

One of the great things about IPython notebooks is that they can easily blend text and code. This has led to a sharp increase in the number of data analysis projects where people carefully explain an entire research project, including data collection/importation, management and analysis. The code is right there, and you can usually run it and/or modify yourself. Looking at a few of these is an excellent introduction to what people are currently doing, even if you don't understand everything.

Overviews

Introductions to using Python for data analysis that make sense to social scientists.

Using APIs

When a service wants you to use their data, they often provide it through an API. There are often specific Python libraries for accessing popularing, complex and/or APIs requiring authentication. Otherwises, requests is quite useful.

Web Scraping

When they don't want to give you the data, you can sometimes grab it anyway by visiting one or more web pages and then extracting the parts you need. requests is a useful library for accessing web pages, and BeautifulSoup is a popular choice for pulling out the good stuff. If you don't know any HTML, regular expressions can sometimes work well too.

Data Management

Going for raw data--numbers of words--to Xs that can be included in a regression equation is about 80% of the work. There's a lot of data management in the walkthroughs, but I've found a couple of others that show the process quite clearly. Pandas is popular and super useful, especially the data frames.

Text Management

Playing with words.

Introduction to data analysis

Introductions and/or overviews of data analysis, usually using scikit-learn.

Classification

When the outcome variable is categorical. Social scientists usually start and stop with variations on logistic regression. Turns out, there's a lot of other things out there.

Unsupervised Learning

When you don't have an outcome variable and/or want to combine your explanatory variables. Sociologists usually learn about factor analysis and then never use it. For text data, topic modeling is what all the cools are doing.

Regression

While continuous outcomes are common in the social sciences, machine learning folks rarely talk about them.

Multiple Regression using Statsmodels by by DataRobot. The stuff you already know how to do but this time in Python.

Gradient Boosted Regression Trees by DataRobot. Scikit-learn analysis of a continuous outcome measure.

Model/Feature Selection

Picking which model or variables to use often happens offstage in social science research. It doesn't have to be that way, though.

Networks

NetworkX and igraph are both fairly powerful tools for network analysis. I don't think you can use them for regression analysis, but you can use them to do things like compute centrality measures and make pretty pictures. You can also use Python to create/manipulate your network data for analysis/display elsewhere.

Plotting

matplotlib is the default plotting library for data scientists and plays well with pandas. seaborn makes it prettier. Other programs, like mpld3, Plotly, or bokeh are also worth trying out, especially for putting stuff together on the web.

Images as Data

Social scientists don't really analyze images much, but that might be the next big thing.