Data mining is introduced through Rattle in the new book, Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R) , published 2011 by Springer-Verlag (DOI: 10.1007/978-1-4419-9890-3).

An extended in-progress version of the book (consisting of early drafts for the chapters published as above) is freely available as an open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) The books simply explain the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining.

Books on my Bookshelf Include:

A literate, agile, approach to data mining projects means the data miner's toolbox will include R and LaTeX (for using Sweave).

Other Resources

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM's Intelligent Miner).

Open standards are important for users, but vendors resist them for obvious reasons, and would prefer to lock you in to their products. A number of commercial tools claim support of, for example, the open standard PMML for interoperability (sharing models between applications). But the support is patchy and not worth the effort. We have started a PMML effort in R to attempt to address the desire for interoperability.

Specific commercial statistical products are excellent in handling very large datasets. But they are limited in the analytic algorithms they provide. Commercial vendors, naturally, need to be convinced of the usefulness of implementing new algorithms. On the other hand, a vast selection has been available for deployment in R for a long time.

Copyright © 2006-2014 Togaware Pty Ltd

This site is hosted in the cloud by Web Faction.

Last Modified 2014-04-09 06:11:23 gjw