Some Thoughts on R Packages and Namespaces John Koo

Package management in R is a fairly straightforward process. The library function simply loads all the functions (and other objects, such as datasets) from that package to the namespace.

library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats': filter, lag

The following objects are masked from 'package:base': intersect, setdiff, setequal, union

Now that we loaded dplyr , the function mutate now lives in our namespace.

mutate

function (.data, ...) { mutate_(.data, .dots = lazyeval::lazy_dots(...)) } <environment: namespace:dplyr>

head(mutate(iris, SPECIES = toupper(Species)))

Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES 1 5.1 3.5 1.4 0.2 setosa SETOSA 2 4.9 3.0 1.4 0.2 setosa SETOSA 3 4.7 3.2 1.3 0.2 setosa SETOSA 4 4.6 3.1 1.5 0.2 setosa SETOSA 5 5.0 3.6 1.4 0.2 setosa SETOSA 6 5.4 3.9 1.7 0.4 setosa SETOSA

Pretty simple, right? If there’s a package containing some functions you want to use, just load it using library , and repeat this for all the packages containing functions to be used.

library(plyr)

-------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)

-------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:dplyr': arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

library(tidyr) library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:tidyr': extract

library(ggplot2) library(foreach)

But if you look closely at some of these messages, you start to see where there might be problems. plyr and dplyr share many function names, and one of the messages explicitly states that the function name extract is shared between magrittr and tidyr . R simply overwrites objects as they’re loaded–if you load magrittr after tidyr , then extract will be from magrittr and vice versa. Recently, RStudio’s very own Hadley Wickham has come under fire for the lag function in dplyr . lag happens to be a widely used function from the stats package, which is loaded automatically when you start RStudio. However, once you load dplyr the stat::lag function is removed from your namespace and is replaced by dplyr::lag . Not only are conflicting function names problematic, such changes to widely used packages can easily break legacy code.

In Python, the analogue of import(package.name) is from module import * , as in “I want to load everything from this module to the namespace.” This is often considered bad practice for same reasons import(package.name) causes problems. Instead, it’s preferred that you load only the objects that will be used: from module import function_1, function_2, dataset_1 . If there are too many to list conveniently, then it’s best to use import module and then whenever an object is called from that module, it is prepended, e.g., module.function_1(x) .

Some consider R’s analogue of Python’s import module as best practice when writing in R. Regardless of whether a package is loaded, all package-dependent functions must be prepended (e.g., dplyr::select , stats::lag , xgboost::xgboost , etc.) to avoid conflicts. The problem here is this causes R scripts to become very bloated, especially when it comes to functions like magrittr::`%>%` or foreach::`%do%` . In Python, a combination of import module and from module import <objects> is used. In general, import module is for when a large number of objects from the module are needed and from module import <objects> is for when only a select few objects from the module are needed. import module can further be modified to import module as mod to make prepending less verbose.

So for R, it would be great if we had the option to do something like:

import dplyr as dp from magrittr import `%>%`, `%<>%` from ggplot2 import * iris %>% dp$filter(Species == 'versicolor') %>% ggplot() + geom_histogram(aes(x = Petal.Length))

It turns out that this is possible in R with the base::loadNamespace and import::from functions!

loadNamespace('package') (I won’t prepend this since it’s from the base package and loadNamespace probably isn’t used elsewhere) is the package “environment”, similar to a Python module. And this can be assigned to a variable. That variable will contain all the objects from that package which can be called using $ . For example,

# first, make sure dplyr is unloaded detach(package:dplyr, unload = TRUE) # assign dp as the dplyr environment dp <- loadNamespace('dplyr') # dp contains all of dplyr's objects sample(names(dp), 6)

[1] "Progress" "print.sql_variant" "op_sort.tbl_lazy" [4] "recode.factor" "select_vars" "arrange"

# for example ... dp$mutate

function (.data, ...) { mutate_(.data, .dots = lazyeval::lazy_dots(...)) } <environment: namespace:dplyr>

# and it works as expected head(dp$mutate(iris, SPECIES = toupper(Species)))

Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES 1 5.1 3.5 1.4 0.2 setosa SETOSA 2 4.9 3.0 1.4 0.2 setosa SETOSA 3 4.7 3.2 1.3 0.2 setosa SETOSA 4 4.6 3.1 1.5 0.2 setosa SETOSA 5 5.0 3.6 1.4 0.2 setosa SETOSA 6 5.4 3.9 1.7 0.4 setosa SETOSA

So in short, R’s loadNamespace('package') is basically the analogue of Python’s import module as mod ! This presents a nice compromise between keeping code concise and avoiding namespace conflicts– dp$mutate is much easier to swallow than having to write dplyr::mutate while maintaining clarity.

But what about functions like `%>% ` ? Even if we use m <- loadNamespace('magrittr') , having to write m$`%>%` every time we pipe is annoying. It would be nice if there was something like from magrittr import `%>% ` . The import package lets us do exactly that.

# first, detach magrittr detach(package:tidyr, unload = TRUE) # tidyr has to be detached first due to dependencies detach(package:magrittr, unload = TRUE) # confirm that %>% is gone exists('%>%')

[1] FALSE

# load only the pipe operators from magrittr import::from(magrittr, `%>%`, `%<>%`, `%T>%`) # confirm that it worked sapply(c('%>%', '%<>%', '%T>%'), exists)

%>% %<>% %T>% TRUE TRUE TRUE

# confirm that it works as expected iris %>% head()