When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating, my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to miss the structure of Java/C++. At the time, Python felt like a good compromise so I switched again. After joining Mango Solutions I noticed I was not an anomaly, most data scientists here know both Python and R.

Nowadays whenever I do my work in R there is a constant nagging voice in the back of my head telling me “you should do this in Python”. And when I do my work in Python it’s telling me “you can do this faster in R”. So when the reticulate package came out I was overjoyed and in this blogpost I will explain to you why.

re-tic-u-late (rĭ-tĭkˈyə-lĭt, -lātˌ) So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality. Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to library but instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported. library (reticulate) np <- import( "numpy" ) np$kron(c( 1 , 2 , 3 ), c( 4 , 5 , 6 )) ## [1] 4 5 6 8 10 12 12 15 18 In the above code I import the numpy module which is a powerful package for all sorts of numerical computations. reticulate then gives us an interface to all the functions (and objects) from the numpy module. I can call these functions just like any other R function and pass in R objects, reticulate will make sure the R objects are converted to the appropriate Python objects. You can also run Python code through source_python if it’s an entire script or py_eval / py_run_string if it’s a single line of code. Any objects (functions or data) created by the script are loaded into your R environment. Below is an example of using py_eval . data( "mtcars" ) py_eval( "r.mtcars.sum(axis=0)" ) ## mpg 642.900 ## cyl 198.000 ## disp 7383.100 ## hp 4694.000 ## drat 115.090 ## wt 102.952 ## qsec 571.160 ## vs 14.000 ## am 13.000 ## gear 118.000 ## carb 90.000 ## dtype: float64 Notice the use of the r. prefix in front of the mtcars object in the python code. The r object exposes the R environment to the python session, it’s equivalent in the R session is the py object. The mtcars data.frame is converted to a pandas DataFrame to which I then applied the sum function on each column. Clearly RStudio have put in a lot of effort to ensure a smooth interface to Python, from the easy conversion of objects to the IDE integration. Not only will reticulate enable R users to benefit from the wealth of functionality from Python, I believe it will also enable more collaboration and increased sharing of knowledge.

Enter mailman So what is it exactly that you can do with Python that you can’t with R? I asked myself the same question until I came across the following use case. While helping a colleague out with a blogpost it was suggested that I should publish it on a Tuesday. No rationale was given so naturally I wondered if I could provide one using data. The data would have to come from R-bloggers. This is a great resource for reading blogposts about R (and related topics) and they also provide a daily newsletter with a link to the blogposts from that day. At the time the newsletter seemed the easiest way to collect data . All I needed to do now is extract the data from my Gmail account. Therein lies the problem as I want to avoid querying the Gmail server (it wouldn’t make it easy to reproduce). Fortunately, Google have made it easy to download your data (thanks to the Google Data Liberation Front) through Google Takeout. Unfortunately, all the e-mails are exported in the mbox format. Although this is a plain text based format it would take some effort to write a parser in R, something I wasn’t willing to do. And then came along Python, which has a built-in mbox-parser in the mailbox module. Using reticulate I extracted the necessary information from each e-mail. mailbox <- import( "mailbox" ) cnx <- mailbox$mbox( "rblogs_box.mbox" ) message <- cnx$get_message( 10L ) message$get( "Date" ) ## [1] "Mon, 12 Dec 2016 23:56:19 +0000" message$get( "Subject" ) ## [1] "[R-bloggers] Building Shiny App exercises part 1 (and 7 more aRticles)" And there we have it! I just read an e-mail from an mbox-file with very little effort. Of course I will need to do this for all messages, so I wrote a function to help me. And because we’re living in the Age of R I placed this function in an R package. You can find it on the MangoTheCat github repo, it is called mailman.

To publish or not to publish? I have yet to provide a rationale for publishing a blogpost on a particular day so let’s quickly get to it. With the package all sorted I can now call the function mailman::read_messages to get a tibble with everything I need. We can extract the number of blogposts on a particular date from the subject of each e-mail. Aggregating that to day of week will then give us a good overview of which day is popular. library (dplyr) library (mailman) library (lubridate) library (stringr) messages <- read_messages( "rblogs_box.mbox" , type= "mbox" ) %>% mutate(Date = as.POSIXct(Date, format= "%a, %d %b %Y %H:%M:%S %z" ), Day_of_Week = wday(Date, label= TRUE , abbr= TRUE ), Number_Articles = str_extract(Subject, "[0-9](?=[\

]* more aRticles)" ), Number_Articles = as.numeric(Number_Articles) + 1 , Number_Articles = ifelse(is.na(Number_Articles), 1 , Number_Articles)) %>% select(Date, Day_of_Week, Number_Articles) Judging by the graph, weekends would be a good time to publish a blogpost as there is less competition. Then again, not many people might read blogposts in the weekend. The next candidate would then be Monday which has the lowest average among the weekdays. Coming back to my original quest, I can conclude that publishing on a Tuesday is not the best option.