By on •

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

You have your PostgresSQL database up and running. Now you want to work with the data in that database in R. First, let’s create a data frame that we want to insert into the database.

# An example data frame to play with iris = as.data.frame(iris) summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50

The column names of this data frame are problematic for databases (and especially PostgreSQL) for a few reasons: the “.”s in the names can be an issue, and PostgreSQL expects column names to be all lowercase. Here’s a function to make the column names db safe:

# make names db safe: no '.' or other illegal characters, # all lower case and unique dbSafeNames = function(names) { names = gsub('[^a-z0-9]+','_',tolower(names)) names = make.names(names, unique=TRUE, allow_=TRUE) names = gsub('.','_',names, fixed=TRUE) names } colnames(iris) = dbSafeNames(colnames(iris)) summary(iris) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## setosa :50 ## versicolor:50 ## virginica :50

Now let’s open up a database connection and insert the table.

# Create a connection to the database library('RPostgreSQL') ## Loading required package: DBI pg = dbDriver("PostgreSQL") # Local Postgres.app database; no password by default # Of course, you fill in your own database information here. con = dbConnect(pg, user="ninazumel", password="", host="localhost", port=5432, dbname="ninazumel") # write the table into the database. # use row.names=FALSE to prevent the query # from adding the column 'row.names' to the table # in the db dbWriteTable(con,'iris',iris, row.names=FALSE) ## [1] TRUE

The function dbWriteTable() returns TRUE if the table was successfully written. Note this call will fail if iris already exists in the database. Use overwrite=TRUE to force overwriting of an existing table, and append=TRUE to append to an existing table.

Now you can read the table back out.

# read back the full table: method 1 dtab = dbGetQuery(con, "select * from iris") summary(dtab) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## Length:150 ## Class :character ## Mode :character ## # read back the full table: method 2 rm(dtab) dtab = dbReadTable(con, "iris") summary(dtab) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## Length:150 ## Class :character ## Mode :character

Of course, the point of using a database is to extract subsets or transformations of your data, using SQL.

# get part of the table rm(dtab) dtab = dbGetQuery(con, "select sepal_length, species from iris") summary(dtab) ## sepal_length species ## Min. :4.300 Length:150 ## 1st Qu.:5.100 Class :character ## Median :5.800 Mode :character ## Mean :5.843 ## 3rd Qu.:6.400 ## Max. :7.900

You can use dbSendQuery for sending queries that don’t return a data-frame-like result.

# remove table from database dbSendQuery(con, "drop table iris") # commit the change dbCommit(con)

When you are done, disconnect.

# disconnect from the database dbDisconnect(con)

And that’s it!

Extra: RPostgreSQL and sqldf

If you are accustomed to manipulating data with SQL, you may prefer SQL notation to the sometimes convoluted calling conventions of the analogous R operations like aggregate() , or the functions in the dplyr package. In this case you have probably already discovered the sqldf package, which allows you to manipulate data frames using SQL. If you are a sqldf user, there is an additional subtlety if you are also using RPostgreSQL or other R packages for talking to databases: sqldf uses its own internal (and ephemeral) database to perform its operations, but if RPostgreSQL is loaded, sqldf will pick up your PostgreSQL driver by default. This is probably not what you want.

options(gsubfn.engine = "R") library(sqldf) ## Loading required package: gsubfn ## Loading required package: proto ## Loading required package: RSQLite ## sqldf will default to using PostgreSQL <=== NOTE THIS! # Now try to use sqldf to aggregate sepal_length by species query = "select avg(sepal_length) avg_sepal_length, species from dtab group by species" sqldf(query) # Error in postgresqlNewConnection(drv, ...) : # RS-DBI driver: (could not connect postgres@localhost on dbname "test" # ) # Error in !dbPreExists : invalid argument type

To use sqldf on local data frames, you must specify the driver and dbname explicitly.

sqldf(query, drv="SQLite", dbname=":memory:") ## avg_sepal_length species ## 1 5.006 setosa ## 2 5.936 versicolor ## 3 6.588 virginica

Extra: PostgreSQL and dplyr

If you do use dplyr , the good news is that you can connect to a PostgreSQL database directly through the dplyr function src_postgres() .

library('dplyr') # Connect to local PostgreSQL via dplyr localdb <- src_postgres(dbname = '', host = 'localhost', port = 5432, user = 'ninazumel', password = '') # cheat and access the db connection directly # assume we have made the colnames db safe dbWriteTable(localdb$con,'iris',iris, row.names=FALSE) ## [1] TRUE

The tbl() command lets you access tables in the database remotely, and sql() lets you send queries.

# this is not a data frame; it's a dplyr PostgreSQL handle into the database d = tbl(localdb, "iris") d ## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel] ## From: iris [150 x 5] ## ## sepal_length sepal_width petal_length petal_width species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## .. ... ... ... ... ... # this is a data frame dtab = as.data.frame(d) # send a query through dplyr query = "select avg(sepal_length) avg_sepal_length, species from iris group by species" dsub = tbl(localdb, sql(query)) dsub ## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel] ## From: [?? x 2] ## ## avg_sepal_length species ## 1 5.936 versicolor ## 2 6.588 virginica ## 3 5.006 setosa ## .. ... ... # make it local dsub = as.data.frame(dsub) summary(dsub) ## avg_sepal_length species ## Min. :5.006 Length:3 ## 1st Qu.:5.471 Class :character ## Median :5.936 Mode :character ## Mean :5.843 ## 3rd Qu.:6.262 ## Max. :6.588 # shuts down database rm(list=c('d','localdb')); gc()

Share this: Twitter

LinkedIn

Facebook

Reddit

Email

Like this: Like Loading... Related

Categories: Coding data science Expository Writing Practical Data Science Pragmatic Data Science Tutorials