Do you want to access vast amounts of data and test your website without lifting a finger? Do you have a mountain of online grunt work that a computer program could handle? Then you need browser automation with the Selenium WebDriver. Selenium is the premier tool for testing and scraping Javascript-rendered web pages, and this tutorial will cover everything you need to set up and use it on any operating system. Selenium dependencies can be downloaded in a Docker container running on a Linux Virtual Machine; thus, these technologies are introduced and discussed. Lastly, an introduction to programming in Docker and a step-by-step protocol for setting up Selenium and binding it to RStudio is given. Get ready to enter the world of browser automation, where your tedious tasks will be delegated to a daemon, and your prefrontal faculties will be freed to focus on philosophy.

Selenium: 34th in the periodic table, 1st in browser automation

The Why

Why use Selenium to automate browsers?

Why use Selenium to automate web browsers? As mentioned above, the two main reasons are web testing and data scraping. Without web testing, programmers at companies like Apple would be unable to check whether new features work as expected before they go live, which could lead to unfortunate bugs for users, (like those that occurred in the iOS 12 update.) While customers are usually shocked when a company such as Apple releases buggy software, the sheer complexity of an iPhone and the number of new or updated features with each update (nearly 100 for iOS 12) make at least some mishaps extremely likely. Not only must each new component be tested, but its interactions with the rest of the phone must also be checked.

However, bugs can be avoided by thorough testing, which is where browser automation comes in. While manual testing remains an integral component of a testing protocol, it is impractical to test so many complex functionalities and their interactions entirely manually. With browser automation, use cases can be tested thousands of times in different environments, thus pulling out bugs that only occur under unusual circumstances. Then, when Apple rolls out another big update, it can rerun a saved testing protocol instead of devising a new one, called regression testing. Thus, automated testing allows companies to increase customer satisfaction and avoid bugs.

Some of the new features on iOS12, including notification grouping and Screen Time.

The second reason for driving a web browser inside of a programming environment is web scraping, which is the process of extracting content from web pages to use in your own projects or applications. While web scraping can be performed without a webdriver like Selenium, the capabilities of such tools are limited. These “driver-less” packages, which include Rvest in R and beautiful soup in Python, cannot execute Javascript in a browser and thus cannot access any Javascript-rendered elements. While they are able to download a website’s source code as an HTML document, they cannot access any data that results from user interaction. This is due to the peculiar manner that HTML, CSS, and Javascript cooperate to build modern web pages.

When you first open a website, the content you see comes from its source code, which you can view in Google Chrome at any time by pressing Ctrl+Shift+I. This source code is primarily written in HTML and CSS, with the former responsible for the website’s structure and the latter the website’s style. While a detailed discussion of HTML and CSS is beyond the scope of this article, all we need to know is that HTML tags and CSS selectors format web elements, and the combination of the two gives each web element its own unique identifier. These unique identifiers allow driverless web scrapers to differentiate web elements and pull out only the relevant information from the source code.

As a user interacts with the website, however, new elements are formed and existing ones are altered by Javascript. These elements are not contained in the page’s source code, and are thus inaccessible to the packages mentioned above. For example, most websites require a user to login before accessing information, meaning everything after login will be beyond reach of the driverless web scrapers. If the internet is really the “information highway,” source code is only the onramp, and it ends at the entrance. With Selenium, we can drive our way to Destination Data, population infinite- provided we know how to steer.

The internet as a highway: Once you’re on, there’s no telling where you’ll go.

The How

Virtual Machines and Containers

Selenium can’t simply be downloaded and run because it needs 1) specific libraries and 2) a specific operating system (for example, it won’t run on OS X). While in the past virtual machines (VMs) supplied such dependencies, today it is much more efficient to use the combination of a container and Docker. To understand why, we first need to review the uses and underpinnings of these tools.

VMs are emulations of one computer inside of another. Multiple VMs can run inside one host, and the interactions between host and emulation are managed by a hypervisor. Since VMs contain everything a physical computer does, they are often used to “test drive” new operating systems (OSs) without having to buy new hardware. The reason for this is that, while many people think of operating systems as virtually interchangeable, there can be significant differences between them. An OS is more than just an insignia on a motherboard; instead, it is the toolset that you must use to get the hardware to do what you want. If the operator is a painter then the OS is the paintbrush, the translator from thought to reality that can restrict or expand what is possible. Therefore, since your choice of OS can have far-reaching consequences, the ability to test-drive a new one with ease is invaluable.

While VMs remain useful for testing new OSs, containers are a better method of providing libraries for a software. Since each VM you download comes with its own OS, using one to simply provide libraries is like packing up your entire closet for a weekend camping trip: Sure you may have everything you need, but the four ties and the wingtips certainly seem like overkill. (Maybe just take the quarter brogues, bro.) This is where containers come in. If VMs are voracious overpackers, containers are cutthroat carry-on crammers, because each container only comes with the libraries necessary to run one software. Every container on a machine shares 1 (Linux) OS, which greatly decreases their storage space and runtime. But where do they get this Linux OS to run off of? That is where Docker comes in.

The different architectures of VMs and containers.

Docker

Docker is the leading software for running and distributing containers, and its primary purpose it to provide the Linux OS that containers run on. This Linux OS is managed in Windows and Mac by each OS’s native hypervisor (Hyper-V and HyperKit, respectively). Containers can thus make use of every aspect of the Linux operating system, including its filesystem and kernel, during bootup and runtime. Two important Linux components, the daemon and the command line interface (CLI), together comprise the Docker engine, which is used to perform most of your container’s tasks. The docker daemon is a server that runs unobtrusively in the background, waiting for a specific event or for the user to call on it. When we want something done in Docker, we use the CLI to send a message to the daemon (also called dockerd ). The commands that call upon the Docker daemon, called Docker commands follow a general template for their usage, which I review below.

Most docker commands contain an action, a path, and options. The action is written as docker followed by what we want the daemon to do. For example, if we want the daemon to start a container, it has to run an image, so the action is docker run . (An image is simply a file that, when executed, starts the container- if the container is a cake, the image is the recipe.) The path specifies what file we want the daemon to perform the action on and where the file is located. In docker run , the path would tell the daemon where to find the image, (which by default is Docker Hub, Docker’s cloud-based image repository) and the name of the image file. If there are different versions of a file, you can choose one by supplying a tag; if no tag is specified, the latest version is pulled automatically. Lastly, options modify the command. In the case of docker run , there are hundreds of options (which you can see on its reference page). While you can ignore or use the default for most of them, some do need to be specified, as we will see below.

The Docker Engine: When you type docker pull in the CLI, the Docker daemon looks in the Docker Hub for your image.

Options in docker run

Since docker run starts our containers, it is one of the most important Docker commands. It thus makes sense that it has so many options. This can make the code look complex, as you can see in the below example that starts the Selenium ChromeDriver:

docker run -d -v LOCAL_PATH://home/seluser/Downloads -p 4445:4444 — shm-size = 2g — name YOUR_CONTAINER_NAME selenium/standalone-chrome

In fact the above code is very simple, and there are only 5 options specified between the action and the path. Let’s review these 5 options below.

The -d option tells the container to run in detached mode, which means in the background. This keeps the application’s output hidden, allowing us to continue using the terminal. The -v option is known as a bindmount, and it is essential for data scraping. This option tells Docker to bind some directory in the Linux VM that the container is running in to some folder in the host machine (i.e., our home computer). This means that anything downloaded to that Linux directory will be transferred to the folder we specified on our machine. When you close Docker and your running containers, the data saved to them does not persist, so this is a very important step in saving our data! (Another option is using persistent volumes.) To use the -v option, first specify the folder on your home computer that you want the data transferred to and then the directory on the Linux VM that you want to use, split by a colon. When you actually start running Selenium, make sure to have your code save your data to the Linux directory you specified! The — shm-size option increases the size of the /dev/shm directory, which is a temporary file storage system. This is because the default shared memory on the container is too small for Chrome to run. I have had success with the size set at 2 gigabytes, following this github discussion. The -p option specifies which ports the container and the Linux VM should connect through. We first specify the port on the Linux VM and then the port on the container. The Selenium image exposes port 4444 by default and here we use 4445 for the host port. When we bind the Linux VM and the Selenium container inside of it to RStudio later, we will use this outer-facing port 4445. The — name option allows us to give our container a specific name. If we don’t specify a name, Docker will name our container for us using its default naming system, which is actually really cool. Instead of using the UUID, a long jumble of numbers and letters that is difficult to read and remember, Docker randomly merges an adjective and a famous scientist, and somehow the combination is always catchy. (It also feels pretty sweet coding in a container called kickass_chandrasekhar.)

And that’s it! These 5 options are all you need to create the complicated-looking docker run command above.

Now that we have a firm grasp of Selenium, VMs, containers, and Docker, it’s time to finally download and set up the Selenium ChromeDriver. Let’s go!

Steps to download and set up Docker and Selenium ChromeDriver

Download the right version of Docker for your OS and work type (business, personal, etc.). Docker provides both enterprise and community editions (CE). For those looking to go deeper with containers, Moby splits up the components of containers and allows users to individually assemble them like Legos. For our purposes, Docker CE will work fine. Download the ChromeDriver. In Windows, you will need to make sure virtualization is enabled so that Docker can start the Linux VM. You can do this by navigating to BIOS and enabling Virtualization, (called VT-AMD in BIOS). (To access BIOS, press F10 during Windows Startup and go to System Configuration.) Execute the steps to install and set up Docker. At the end, Docker’s characteristic whale will appear in the terminal. Pull the image for the Selenium ChromeDriver by typing docker pull selenium/standalone-chrome in the terminal. Since we did not specify a version, the most recent one will be pulled. You should see Using default tag: latest: Pulling from selenium/standalone-chrome . Then you will see Status: Downloaded newer image for selenium/standalone-chrome:latest Run the Selenium Chromedriver, using the command from above. Remember to replace LOCAL_PATH and YOUR_CONTAINER_NAME with the folder and name you want to use.

docker run -d -v LOCAL_PATH://home/seluser/Downloads -p 4445:4444 — shm-size = 2g — name YOUR_CONTAINER_NAME selenium/standalone-chrome

Now that we have Docker set up and running, I’ll show you how to bind it to RStudio using RSelenium. If you are not a R user, there are articles on how to bind Selenium to other programming languages, such as Python and Ruby, or you can simply write scripts in the Docker CLI.

Downloading RSelenium

Type install.packages(RSelenium in the console in RStudio. Then: library(RSelenium) Set the options for the Chrome Driver. There are others that you can set, but these three are essential. The first will block popups, the second will ensure that files download without needing a prompt from you, and the third will determine where downloaded files end up (which should be the directory in the Linux VM that you specified earlier in docker run ).

eCaps <- list(

chromeOptions =

list(prefs = list(

“profile.default_content_settings.popups” = 0L,

“download.prompt_for_download” = FALSE,

“download.default_directory” = “home/seluser/Downloads”

)

)

)

4. Create the bind from R to the Linux VM. The browser name is Chrome, the port is the port specified in docker run , the extra capabilities were specified above in step 3, and the remoteServerAddr is the IP of the Linux VM.

remDr <- remoteDriver(browserName= “chrome”, port=4445L, extraCapabilities = eCaps, remoteServerAddr = "192.168.99.100", )

5. Finally, typing remDr$open will bind R to the virtual OS. In your global environment you should see that remDr is an <Object containing active binding> .

That’s all folks! You’re now ready to start using Docker for your amazing web testing and data scraping projects! Thank you so much for reading, and please feel free to follow up with any questions at my twitter handle, @halfinit. See you next time!