



artoo.js

The client-side scraping companion

artoo.js is a piece of JavaScript code meant to be run in your browser’s console to provide you with some scraping utilities.

This nice droid is loaded into the JavaScript context of any webpage through a handy bookmarklet you can instantly install by dropping the above icon onto your bookmark bar.

Disclaimer

Web security has widely improved in the last years and most websites prevent JavaScript code injection nowadays by relying on Content Security Policy headers.

But fear not, it is always possible to shunt them using browser extensions and/or proper configuration:

For chrome see the Disable Content-Security-Policy extension, for instance

For firefox, check this stackoverflow response explaining how to shunt CSP

Bootcamp

Now that you have installed artoo.js let’s scrape the famous Hacker News in four painless steps:

Copy the following instruction.

artoo . scrape ( ' td.title:nth-child(3) ' , { title : { sel : ' a ' }, url : { sel : ' a ' , attr : ' href ' } }, artoo . savePrettyJson );

Go to Hacker News. Open your JavaScript console and click the freshly created bookmarklet (the droid should greet you and tell you he is ready to roll). Paste the instruction and hit enter.

That’s it. You’ve just scraped Hacker News front page and downloaded the data as a pretty-printed json file*.

* If you need a more thorough scraper, check this out.

Features

Scrape everything, everywhere : invoke artoo in the JavaScript context of any web page.

: invoke artoo in the JavaScript context of any web page. Loaded with helpers : Scrape data quick & easy with powerful methods such as artoo.scrape.

: Scrape data quick & easy with powerful methods such as artoo.scrape. Data download : Make your browser download the scraped data with artoo.save methods.

: Make your browser download the scraped data with artoo.save methods. Spiders : Crawl pages through ajax and retrieve accumulated data with artoo’s spiders.

: Crawl pages through ajax and retrieve accumulated data with artoo’s spiders. Content expansion : Expand pages’ content programmatically thanks to artoo.autoExpand utilities.

: Expand pages’ content programmatically thanks to artoo.autoExpand utilities. Store : stash persistent data in the localStorage with artoo’s handy abstraction.

: stash persistent data in the localStorage with artoo’s handy abstraction. Sniffers : hook on XHR requests to retrieve circulating data with a variety of tools.

: hook on XHR requests to retrieve circulating data with a variety of tools. jQuery : jQuery is injected alongside artoo in the pages you visit so you can handle the DOM easily.

: jQuery is injected alongside artoo in the pages you visit so you can handle the DOM easily. Custom bookmarklets : you can use artoo as a framework and easily create custom bookmarklets to execute your code.

: you can use artoo as a framework and easily create custom bookmarklets to execute your code. User Interfaces: build parasitic user interfaces easily with a creative usage of Shadow DOM.

Philosophy

« Why on earth should I scrape on my browser? Isn’t this insane? »

Well, before quitting the present documentation and run back to your beloved spiders, you should pause for a minute or two and read the reasons why artoo.js has made the choice of client-side scraping.

Usually, the scraping process occurs thusly: we find sites from which we need to retrieve data and we consequently build a program whose goal is to fetch those site’s html and parse it to get what we need.

The only problem with this process is that, nowadays, websites are not just plain html. We need cookies, we need authentication, we need JavaScript execution and a million other things to get proper data.

So, by the days, to cope with this harsh reality, our scraping programs became complex monsters being able to execute JavaScript, authenticate on websites and mimic human behaviour.

But, if you sit back and try to find other programs able to perform all those things, you’ll quickly come to this observation:

Aren’t we trying to rebuild web browsers?

So why shouldn’t we take advantage of this and start scraping within the cosy environment of web browsers? It has become really easy today to execute JavaScript in a a browser’s console and this is exactly what artoo.js is doing.

Using browsers as scraping platforms comes with a lot of advantages:

Fast coding : You can prototype your code live thanks to JavaScript browsers’ REPL and peruse the DOM with tools specifically built for web development.

: You can prototype your code live thanks to JavaScript browsers’ REPL and peruse the DOM with tools specifically built for web development. No more authentication issues : No longer need to deploy clever solutions to enable your spiders to authenticate on the website you intent to scrape. You are already authenticated on your browser as a human being.

: No longer need to deploy clever solutions to enable your spiders to authenticate on the website you intent to scrape. You are already authenticated on your browser as a human being. Tools for non-devs: You can easily design tools for non-dev people. One could easily build an application with a UI on top of artoo.js. Moreover, it gives you the possibility to create bookmarklets on the fly to execute your personnal scripts.

The intention here is not at all to say that classical scraping is obsolete but rather that client-side scraping is a possibility today and, what’s more, a useful one.

You’ll never find yourself crawling pages massively on a browser, but for most of your scraping tasks, client-side should enhance your productivity dramatically.

Contribution

Contributions are more than welcome. Feel free to submit any pull request as long as you added unit tests if relevant and passed them all.

To install the development environment, clone your fork and use the following commands:

# Install dependencies npm install # Testing npm test # Compiling dev & prod bookmarklets gulp bookmarklets # Running a test server hosting the concatenated file npm start # Running a https server hosting the concatenated file # Note that you'll need some ssl keys (instructions to come...) npm run https

Authors

artoo.js is being developed by Guillaume Plique @ SciencesPo - médialab.

Logo by Daniele Guido.

R2D2 ascii logo by Joan Stark aka jgs .

Under a MIT License.