PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS.

Extracting advertisement blocks requires fairly simple CSS selectors. Google can’t change the format too often, since each publisher must paste a code into their site. Some ad networks render advertisements inside an iframe, so running the script may run into browser security restrictions. Extracting ad data from a page of Home Depot’s website gives us the following results:

Drywall Materials Sale, http://www.compare99.com/compare.html%3Fq%3Ddrywall-products%26ort%3DDrywall-Materials-Sale%26adid%3DiaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W Sheetrock, http://shopping.yahoo.com/search%3B_ylc%3DX3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-%3Fp%3Dsheetrock%26sem%3DGoogle Sheetrock Material Sale, http://www.buycheapr.com/us/result.jsp%3Fga%3Dus19%26q%3Dsheetrock%2Bmaterial Installation Framing Door, http://www.moifriefacility.com Architectural GFRG, http://www.sbgrace.com WallBuilders Library, http://www.logos.com/products/details/2982%3Fgoogleads

I’ve written a short demo, which retrieves ad text and a screenshot for testing. It is invoked as follows (source is below, and on Github)

phantomjs adsense.js http://www.homedepot.com/Building-Materials-Drywall/FibaTape/h_d1/N-5yc1vZar3dZ38m/h_d2/Navigation?catalogId=10053&Nu=P_PARENT_ID&langId=-1&storeId=10051

The code is almost a little too easy- tell PhantomJS to load a page, run Javascript in the page context, and parse the Adsense URL format. As a programming paradigm, it’s a little complex to track scope, since some code runs in the PhantomJS context and some in the page context. PhantomJS scripts do not exit when a script ends, because many browser actions are asynchronous. This requires scripts to track state and add exit() calls at the end of every branch.