Recently I was analysing a bunch of access logs searching for some tricky to find bug and found out that R may be quite handy in such a situation. So let’s take a look at a quick analysis of access logs written in Common Log Format.

Each line of an access log file looks like this:



so the first thing we need to do is to read the data into our R session. This is a bit tricky as timestamps have whitespaces separating the time zone:

ReadLogFile <- function(file = log.file) { # http://en.wikipedia.org/wiki/Common_Log_Format access_log <- read.table(file, col.names = c("ip", "client", "user", "ts", "time_zone", "request", "status", "response.size")) access_log$ts <- strptime(access_log$ts, format = "[%d/%b/%Y:%H:%M:%S") access_log$time_zone <- as.factor(sub("\\]", "", access_log$time_zone)) access_log$status <- as.factor(sub("\\]", "", access_log$status)) access_log } log.file <- "access_log" access_log <- ReadLogFile(log.file)

Now let’s plot the server load:

library(ggplot2) library(colorspace) ggplot(access_log, aes(x = ts)) + geom_density(stat = "bin", binwidth = 3600, colour = "black", fill = "darkgreen") + ylab("Requests/hour") + xlab("Time")

which looks like this



so a clear user activity pattern emerges – a lot more users during working hours and very few at nighttime.

A quick analysis of the the response codes reveals that there were some 5** errors:

df <- as.data.frame(table(access_log$status)) colnames(df) <- c('status','freq') ggplot(df , aes(x="",y=freq, fill=status) )+ geom_bar(position= "fill",alpha=0.8,width=0.5, stat="identity")+ scale_fill_manual (values=rainbow_hcl(9, start = 200, end = 20))+ coord_flip() + xlab("") + ylab("Frequency")

Let’s take a closer look:

server.errors <- grep('5..',access_log$status) ggplot(access_log[server.errors,], aes(x=status)) + geom_histogram(colour="black", fill="red")

And finally: how the errors were distributed over time?

ggplot(access_log[server.errors,], aes(x=ts)) + geom_density(stat='bin',binwidth=3600, colour="black",fill="red") + ylab('Errors/hour') + xlab('Time')