Cool stuff we’ve learned thus far

Listed below are some tips & tricks used almost daily at Phantombuster. The code examples are using our own scraping library but they’re easy to rewrite for any other Headless Chrome tool. We’re really more interested in sharing the concepts here.

Put the cookies back in the cookie jar 🍪

Scraping with a full-featured browser gives you peace of mind. No need to worry about CORS, sessions, cookies, CSRF and other modern web stuff. Just simulate a human and you’re in.

But sometimes login forms are so hardened that restoring a previously saved session cookie is the only solution to get in. Some sites will send emails or text messages with codes when they feel something is off. We don’t have time for that. Just open the page with your session cookie already set.

Bypassing the LinkedIn login form by setting a cookie

A famous example of that is LinkedIn. Setting the li_at cookie will guarantee your scraper bot access to their social network (please note: we encourage you to respect your target website ToS).

We believe websites like LinkedIn can’t afford to block a real-looking browser with a valid session cookie. It’s too risky for them as false-positives would trigger too many support requests from angry users!

jQuery will never let you down

If there’s one important thing we’ve learned, it’s this one. Extracting data from a page with jQuery is very easy. In retrospect, it’s obvious. Websites give you a highly structured, queryable tree of data-containing elements (it’s called the DOM) — and jQuery is a very efficient DOM query library. So why not use it to scrape? This “trick” has never failed us.

A lot of sites already come with jQuery so you just have to evaluate a few lines in the page to get your data. If that’s not the case, it’s easy to inject it:

Scraping the Hacker News homepage with jQuery (yes, we know they have an API)

What do India, Russia and Pakistan have in common?

Screenshot from anti-captcha.com (they’re not kidding 😀)

The answer is CAPTCHA solving services*. You can buy them by the thousands for a few dollars and it generally takes less than 30 seconds per CAPTCHA. Keep in mind that it’s usually more expensive during their night time as there are fewer humans available.

A simple Google search will give you multiple choices of APIs for solving any type of CAPTCHA, including the latest reCAPTCHAs from Google ($2 per 1000).

Hooking your scraper code to these services is as easy as making an HTTP request. Congratulations, your bot is now a human!

On our platform, we make it easy for our users to solve CAPTCHAs should they require it. Our buster library can make calls to multiple solving services:

Handling a CAPTCHA problem like it’s nothing

*It’s a joke. I have to say it otherwise I receive emails…

Wait for DOM elements, not seconds

We often see scraping beginners make their bot wait for 5 or 10 seconds after having opened a page or clicked a button — they want to be sure the action they did has had time to have an effect.

But that’s not how it should be done. Our 3 steps theory applies to any scraping scenario: you should wait for the specific DOM elements you want to manipulate next. It’s faster, clearer and you’ll get more accurate errors if something goes wrong:

It’s true that in some cases it might be necessary to fake human delays. A simple await Promise.delay(2000 + Math.random() * 3000) will do the trick.

MongoDB 👍

We’ve found MongoDB to be a good fit for most of our scraping jobs. It has a great JS API and the Mongoose ORM is handy. Considering that when you’re using Headless Chrome you’re already in a NodeJS environment, why do without it?

JSON-LD & microdata exploitation

Sometimes scraping is not about making sense of the DOM but more about finding the right “export” button. Remembering this has saved us a lot of time on multiple occasions.

Kidding aside, some sites will be easier than others. Let’s take Macys.com as an example. All of their product pages come with the product’s data in JSON-LD form directly present in the DOM. Seriously, go to any of their product page and run: JSON.parse(document.querySelector("#productSEOData").innerText)

You’ll get a nice object ready to be inserted into MongoDB. No real scraping necessary!

Intercepting network requests

Because we’re using the DevTools API, the code we write has the equivalent power of a human using Chrome’s DevTools. That means your bot can intercept, examine and even modify or abort any network request.

We tested this by downloading a PDF CV export from LinkedIn. Clicking the “Save to PDF” button from a profile triggers an XHR in which the response content is a PDF file. Here’s one way of intercepting the file and writing it to disk:

Here “ tab” is a NickJS tab instance from which we get the Chrome Remote Interface API.

By the way, the DevTools protocol is evolving rapidly. There’s now a way to set how and where the incoming files are downloaded with Page.setDownloadBehavior() . We have yet to test it but it looks promising!

Ad-blocking

Example of an extremely aggressive request filter. The blacklist further blocks requests that passed the whitelist.

In the same vein, we can speed up our scraping by blocking unnecessary requests. Analytics, ads and images are typical targets. However, you have to keep in mind that it will make your bot less human-like (for example LinkedIn will not serve their pages properly if you block all images — we’re not sure if it’s deliberate or not).

In NickJS, we let the user specify a whitelist and a blacklist populated with regular expressions or strings. The whitelist is particularly powerful but can easily break your target website if you’re not careful.

The DevTools protocol also has Network.setBlockedURLs() which takes an array of strings with wildcards as input.

What’s more, new versions of Chrome will come with Google’s own built-in “ad-blocker” — it’s more like an ad “filter” really. The protocol already has an endpoint called Page.setAdBlockingEnabled() for it (we haven’t tested it yet).

That’s it for our tips & tricks! 🙏