Weird Data Science, and kindly contributed to Want to share your content on R-bloggers? [This article was first published on, and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The British Isles are ancient, haunted places. Pre-Roman legends and folk songs are filled with dragons, magic, and strange, wild creatures. Spirits, fairies, and all kinds of imps and goblins roam the countryside with intent ranging from the mischievous to the malevolent. Will-o’-the-wisps lead unwary travellers deep into marshes in the night, before vanishing. The sad ghosts of past tragedies reside in ancient castles and stately homes.

In more recent history, strange beasts have been rumoured to live wild in the open spaces, whether large predators escaped from zoos, the last survivals of prehistory, or spirits. Every village, town, and county has its own stories and traditions.

The Paranormal Database is a collection of both traditional and recent paranormal events in Britain and Ireland. It contains details of almost 20,000 hauntings, cryptozoological sightings, legends, monsters, UFO’s, and other strange phenomena, with details of the date and time of sightings, the location, and brief descriptions.

The data is not easily accessible beyond directly reading pages, and required some effort and time to scrape and make usable. Paranormal Database entries contain names, dates, locations, and comments as unstructured text and so will require further effort to perform a more thorough analysis. The R code used to scrape the website is included at the end of this post.

To understand the range and breadth of the paranormal life of the British Isles, we will focus on the data stored in the Paranormal Database. For this initial entry, we will take a first look at the data and get an overview of what mind-numbing horrors are most commonly encountered by the unsuspecting traveller in the United Kingdom and beyond.

Show frequency table of manifestations Paranormal Manifestations in the British Isles Manifestation Type Occurrences Haunting Manifestation 12376 Legend 1662 Cryptozoology 835 Shuck 688 Poltergeist 625 Unknown Ghost Type 614 Fairy 550 Other 427 Alien Big Cat 382 UFO 336 Crisis Manifestation 277 Dragon 190 Curse 183 Post-Mortem Manifestation 179 Environmental Manifestation 152 Manifestation of the Living 44 Werewolf 36 Vampire 34 Spontaneous Human Combustion 28 Experimental Manifestation 2

As we can see from the diagram and the frequency table, hauntings are by far the most common manifestation in paranormal Britain, being an order of magnitude greater than the number of legend recorded. Examining the list, beyond “Haunting Manifestation” we see that several of the most common types are variants: both poltergeist activity and unknown types of ghost represent a significant amount of the total events recorded.

Cryptozoology, in its various forms, is also well-represented. The phenomenon of the Black Shuck, a ghostly black dog, is one of the highest categories next to the main cryptozoology category, and alien big cats are close behind. Dragons, werewolves, and vampires, perhaps, deserve to be classed more as monstrous entities than cryptozoological oddities and are, in any case, far less common.

In brief conclusion, then, the unquiet dead are by far the most numerous beings to trouble the unhappy folk of the British Isles; twisted mockeries of natural fauna are far from rare.

Do particular phenomena cluster in regions and, if so, where? Are werewolves truly more commonly seen when the moon is full? Have certain manifestations become more common as time passes? Are certain sightings clustered temporally as well as geographically? Are the most haunted areas also the most cryptozoologically active? With access to the full horror of the data we can begin to answer these question about the darkest corners of the United Kingdom.

Full code for scraping the data and producing the plot are given below.

You can keep up to date with our latest visions of the statistical unknown on Twitter at @WeirdDataSci.

Show analysis code Data: The Paranormal Database: http://www.paranormaldatabase.com Other: JSL Ancient font: http://www.1001fonts.com/jsl-ancient-font.html

Rvest web scraping library: http://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ | https://cran.r-project.org/web/packages/rvest/index.html

xkcd styling for ggplot2: http://xkcd.r-forge.r-project.org/ Paranormal Database Scraping Code: library(rvest) library(magrittr) library(tidyr) library(stringr) library(dplyr) # Base URL for scraping paranormal.base % html_nodes( '.hero-unit' ) %>% html_nodes( 'a' ) %>% html_attr( 'href' ) %>% url_absolute( base.url ) } # Function to extract a paranormal database entries page into a dataframe. # Returns a dataframe of the entries from the given HTML. extract.entries % html_nodes( xpath='//td[ contains( @width, "100" ) ]' ) %>% html_text() %>% str_replace_all( "[\r

]" , "" ) # If any entries were found, process them. if( ( length( page.entries ) > 0 ) && (str_detect(page.entries, "(.*[[:graph:]])\\s*Location: (.*[[:graph:]])\\s*Type: (.*[[:graph:]])\\s*Date / Time: (.*[[:graph:]])\\s*Further Comments: (.*[[:graph:]])\\s*")) ) { # Each entry is a string containing: # - Location: # - Type: # - Date / Time: # - Further Comments: # So split the string on these and put the entries into a dataframe. page.entries.df % tidyr::extract(entry, into=c('title', 'location', 'type', 'date', 'comments'), "(.*[[:graph:]])\\s*Location: (.*[[:graph:]])\\s*Type: (.*[[:graph:]])\\s*Date / Time: (.*[[:graph:]])\\s*Further Comments: (.*[[:graph:]])\\s*" ) # Further split these into subtypes where appropriate. Both 'location' and 'type' are often split into a main and a subtype by a space-surrounded hyphen. # As we'll do this twice, make it a function. This takes a base column name, a new name for the sub-column, and the separator string. coalesce.join % tidyr::extract( base.column, into=c("tmp.main", sub.column), separator, remove=FALSE) # Combine the old base column with the new tmp.main column, using base.column to fill in the NAs in tmp.main # where the separator was not matched. base.data[ base.column ] % select(-one_of("tmp.main")) } # Split 'type' page.entries.df % bind_rows() # Bind entries from sub-pages to this page's entries. all.entries Paranormal Manifestations in the British Isles Plotting Code: library(ggplot2) library(jpeg) library(xkcd) library(showtext) library(grid) library(gridExtra) # Create a summary barplot for paranormal activity in the UK. # Load the data from scraping http://www.paranormaldatabase.com/ load( "../data/paranormal.Rdata" ) # Process the data to frequency counts of each type of event. counts