Kaggle Public LB Scraper

In this project, we’re going to build an advanced scraper using R, leveraging RSelenium and rvest to extract the complete Leaderboard of a Kaggle Competition.

The reason why we need an advanced scraper for this project is that by default, the Kaggle Public Leaderboard (LB) page displays only the top leaderboard entries, and for someone to scroll through the list, they’d need multiple browser clicks to expand the table.

Our advanced scraper will be designed in such a way that it does all these programmatically and finally outputs two visualization plots from the extracted LB data. This advanced Kaggle LB scraper could be used for a variety of use-cases like Automated Kaggle LB Score Alerts or Fellow-Teams score tracking.

Code Design

As a prerequisite for this project, let’s begin by installing the required R packages:

RSelenium — R Bindings for Selenium 2.0 Remote WebDriver

rvest — R package for Web scraping

tidyverse — Collection of R packages designed for Data science

Packages Installation

All these packages can be installed from CRAN using the following code:

install.packages("RSelenium")

install.packages("rvest")

install.packages("tidyverse")

Packages Loading

After installation, we have to load these R packages into our current R session.

library(RSelenium) # Selenium Automation

library(rvest) # Web Scraping

library(tidyverse) # Data Manipulation and Visualization

Starting a Selenium Server and Browser

The next step is to start a Selenium server and browser. Because our scraping requires advanced browser emulations, we have to perform this step. In our case, we’ll use chrome as our browser of choice (using firefox will also produce similar results). The port value is optional, and explicitly specifying a port value will avoid any existing conflicts.

rD <- rsDriver(port = 124L, browser = "chrome") remDr <- rD[["client"]]

Scraping URL

In this section, we’ll specify the URL from which we’re scraping our required data. For our project, a Kaggle public LB is associated with a Kaggle Competition, so for ease of use, we’ll specify the Kaggle Competition URL and then build the public LB URL from that.

Please note, this code is written for the current active competition and can be modified with minimal changes to make it work for past competitions.

URL in the Browser

Now that we’re ready with our URL and browser, we just have to request that our browser navigate to the specified URL.

remDr$navigate(lb_url)

More Browser Emulations

As specified above, we’re in the process of building an advanced scraper because the Kaggle public LB page doesn’t display the entire table when the web page is loaded initially. So we have to scroll down to the bottom of the page and then click the expand button at the end. (As in the below screenshot).

Kaggle Public LB Page Bottom Expand Button

### scrolling the page to its bottom remDr$executeScript("window.scrollTo(document.body.scrollHeight,10000)") smart_list <- remDr$findElement("class name","competition-leaderboard__load-more-count") smart_list$clickElement() remDr$setImplicitWaitTimeout(milliseconds = 10000) remDr$executeScript("window.scrollTo(0, document.body.scrollHeight)")

After the above code execution and along with some page load time, the browser should show the expanded Leaderboard table (as in the below screenshot).

Kaggle Public LB Page with Complete Data

Data Extraction

Now that we have the web page displaying the complete LB table, we can simply extract the page source and use traditional web scraping data extraction methods to extract the cleaned up table data. The below code does the same.

source <- remDr$getPageSource() #web page source lb <- read_html(as.character(source)) %>% html_table() %>% as.data.frame() #content extraction using rvest write.csv(lb,"lb.csv",row.names = F)

As you can see in the above code, first we save the complete web page source code as an R object source . This source contains a huge corpus of html data, which is then parsed using read_html() . Since our Kaggle public LB data is using a standard html table, we can extract the table using html_table() and save it as a dataframe in the R object lb .

Finally, we’ll save the dataframe lb in a file lb.csv for archiving purposes and other possible uses.

Data Visualization and Insights

With the extracted LB data, we can build a few charts to derive insights about the competition and team performances.

Public LB Score Density Plot

## Public LB Score Density Plot ggplot(lb) +

geom_density(aes(Score)) +

scale_x_log10() +

theme_minimal() +

labs(title = "Public LB Score Density Plot",

subtitle = "with Logarithmic Score")

Public LB Score Density Plot

This plot helps us understand which score-range is most crowded—this is where the majority of scores fell.

Number of Entries Density Plot

Another data point of Kaggle competitions is the number of times each Kaggler has submitted their competition submission (solution). The below code is used to generate a density plot with the Number of Entries .

## Number of Entries Density Plot ggplot(lb) +

# geom_histogram(aes(Entries)) +

geom_density(aes(Entries)) +

scale_x_log10() +

theme_minimal() +

labs(title = "Number of Entries Density Plot")

Number of Entries Density Plot

The above plot tells us that the first peak is at 1, which is expected because most people become inactive after their first submission. And the second peak is before 5 entries—we see a steep decline from there.

Summary

The primary objective of this tutorial was to introduce the concept of advanced scraping and build an advanced web scraper using Rselenium and rvest. We then used this advanced scraper to extract Kaggle public leaderboard data, which could help Kagglers who are active in competitions. The entire code and plots used in this tutorial are available on my GitHub.

Do you have any interesting use cases for an advanced web scraper? Would love to hear about them in the comments!