Exercise to detect Algorithmically Generated Domain Names.¶

In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). We have 'legit' in quotes as we're using the domains in Alexa as the 'legit' set. The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn't work out that well and results that could definitely be improved upon. In general capturing what worked and what didn't is not only more realistic but often much more informative. :)

Python Modules Used:¶

Pandas: Python Data Analysis Library (http://pandas.pydata.org)

Scikit Learn (http://scikit-learn.org) Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Matplotlib: Python 2D plotting library (http://matplotlib.org)

Suggestions/Comments: Please send suggestions or bugs (I'm sure) to clicklabs at clicksecurity.com. Also if you have some datasets or would like to explore alternative approaches please touch base.