EDIT: In response to this post I have had good suggestions, both in the comments here and on reddit in /r/statistics. Thanks to all!

I use R every day. I can think of very few times when I have booted up my computer and not had an instance of R or RStudio running. Whether for data exploration, making graphs, or just fiddling around, R is a staple of my academic existence.

So I love R, but there are a few things that drive me nuts about its behaviour:

Plotting a large data.frame produces unreadable plots. Read and write.csv don’t behave the same with respect to row names. stringsAsFactors isn’t default.

1. Plotting a large data.frame:

I get it, it’s a way to tell me to stop doing it, but I can’t help it. Every once in a while I forget to stick the columns into my plot command, and I wind up getting something that looks like the floor of my bathroom (Figure 1). The default plot behaviour for data.frames is frustrating, especially when you have a very large data set. With one simple command you can unleash minutes of processing time for a plot that yields no information at all. I decided to time it, for no particular reason than to include a code snippet here:

timer <- rep(NA, 50) for(i in 1:50){ rowtest <- i * 20 example <- matrix(runif(rowtest * 20, 0, 3), ncol=20) %*% matrix(rt(400, 2), ncol=20) #This is going to take a while. . . timer[i] <- system.time(plot(as.data.frame(example)))[1] } plot(1:50 * 20, timer, type='b', xlab='Data frame rows', ylab='Wasted Time')

I’m sure there are good reasons for this, and it might be possible to change the default behaviour by writing up a code snippet, but changing default behaviour of some of the base classes is probably pretty dangerous. It would be nice though if the default behaviour were to plot all columns up to a maximum number, possibly based on the pch size, so that you could at least plot until minimum readability was reached.

2. read.csv and write.csv don’t agree on whether or not rownames are default.

Try this out:

aa <- data.frame(column.one = runif(10, 0, 2), column.two = rnorm(10, 0, 1)) write.csv(aa, 'dead.file.csv') bb <- read.csv('dead.file.csv')

How many columns does bb have? Three. And if you’re not paying attention, or you’re not careful coding your functions, all that great analysis you did the first time has completely changed. Of course, if you used read.table everything would work out fine. then you’d have only two columns. It’s frustrating, and, probably, the fact that we all account for it in one way or another means that a major change in the bedrock of R would break a lot of code.

3. stringsAsFactors == FALSE isn’t the default.

Oh data.frames, oh read.table, sometimes you drive me bonkers. Maybe it’s just me, maybe it’s just the scripts I’m writing, but the fact that my strings get converted to factors all the time drives me crazy. As far as I’m concerned a factor is a derived data-type, you ought to convert to a factor, not convert from a factor. It leads to even more complicated problems when, for whatever reason you have a column that is largely numeric, with some sort of character symbol stuck in randomly (multiple NA type strings indicating different types of unusable data for example). To get the numbers back from a factor is much more complicated than just converting the string using as.numeric, you’ve got to as.numeric(as.character(whatever)).

Okay, I’ve got that off my chest. I am genuinely interested though, is there a reason for these default behaviours? Are they some sort of legacy that hasn’t been changed, or am I pretty much the only one who encounters them. If you know, just pop your answer in the comments.