[This article was first published on, and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At the Bay Area R User Group meeting this week, Antonio Piccolboni gave an overview of the design goals and implementation of the RHadoop Project packages that connect Hadoop and R: rhdfs, rhbase and rmr:

(The image above was captured from Antionio's slides.) The most revealing part of the talk for me was the comparison of implementing the K-means clustering algorithm the “standard” way (using Python, Pig and Java, as shown on slides 8-10) compared to using just R (with the rmr package, shown on slides 14-15): it takes much less code, and can be implemented in a single language. Antonio expands on this example at the RHadoop wiki, which makes for a great place to start if you're looking to implement big-data statistical models with the rmr package.

RHadoop wiki: Comparison of high level languages for mapreduce: k means