Pyppeteer, the snake charmer

Or how to remotely control a browser from python.

Photo by Jordan Gellie

Puedes leer este artículo también en español aquí.

After years developing software, one of the tasks that I like the most when I get involved in a new project, is to investigate the possible solutions to be used. Thanks to the enormous amount of free software available today, doing so can help you find the most appropriate approach to a problem and sometimes, hopefully, the direct solution. But even if you’re not lucky, I’m sure that along the way you find utilities, libraries, and software that in the future may be useful or at least curious.

That is how I came to pyppeteer, a port of puppeteer to python, looking for information to satisfy part of the requirements of one of the last projects in which I participate in Commite Inc., which consists of extracting and analyzing data from different web pages and apps.

The truth is that it is not precisely an unexplored field in the world of web development, especially with the languages that we usually use in the stack of Commite: python and javascript. Both languages have a massive amount of projects and libraries available related to web scraping and testing of web applications, so many that it becomes difficult to decide which ones to use.

But returning to the requirements of the project, and in particular, to the extraction of information from different websites, some are very dynamic, some of them are made with React, others with angular and others have parts in javascript with old friend JQuery. All this, a priori, should not be a problem. The problem is that ‘the web’ today is quite complicated, and the data we need can only be found by interacting with the interface or by clicking a button that makes an ajax call to the server. Even a section may only appear if the cursor is placed on an element or worse, the whole structure of the page may be dynamic as in an SPA.

What if we could directly extract the information from a browser and manage it in an automated way? What if we could control the pointer or simulate the entry of data by keyboard?

The solution: A browser!, enters pyppetter

Pyppeteer, written in python, is a port of puppeteer, a Javascript library for the control and automation of Chrome / Chromium, developed by Google. It is a modern snake charmer for our browser. Pyppeteer allows us almost total control of a Chromium / Chrome, open tabs, analyze the DOM in real time, execute Javascript, connect to a running browser and even download a Chromium.

Up until relatively recently, being able to use a browser for this type of tasks required using projects such as PhantomJS or “trimmed” browsers, usually developed from the Chromium project code. With the incorporation of “headless” modes to Firefox and Chrome, even that is not necessary. The headless mode allows to render and parse a web page without the need of the user interface, obtaining the same result as in the traditional mode. This makes the browsers can be run remotely on a server, without a desktop environment, and even use them in a Docker container.

What are the available alternatives?

The idea of controlling a browser starts from the venerable Selenium. Without going into too much detail, Selenium is a series of technologies to control the browser remotely, besides, for quite some time Selenium is the de facto standard for the task. Developed in Java, it works practically in any browser and has libraries for almost any language. However, the W3C is in the process of standardizing WebDriver (which is the standardization of a protocol for remote management of browsers) with GeckoDriver and ChromeDriver being their respective implementations for Firefox and Chrome.

In particular, Firefox has Marionette, which is quite simple to use and is decently documented. In fact, it was my initial choice for the project. However, it has some drawbacks: at the moment it only supports Python 2.7 (let’s go Mozilla!), due to base dependencies of the library, other one, is that it is not asynchronous, so it feels a bit strange to work with.

In the case of Chromium, it has DevTools protocol as a low-level communication protocol, offering a lot of functionality and on top, the best-known Puppeteer in Javascript, widely used, well documented and used as a base for other libraries.

And related, in python, of course, there is Scrapy, and I have also found this little gem http://html.python-requests.org/ from the creator of requests and pipenv among others (dipping in its code was how I discovered pyppeteer).