Jordan Scrapes With scrapestack

Demo code here

This is a sponsored blog post by scrapestack. All reviews and opinions expressed here are, however, based on my personal experience.

I am a regular reader on the scraping hub subreddit. Just this last week someone posted about getting product status from Amazon without getting blocked. This kind of question is fairly common on this subreddit and others that discuss web scraping.

The slow down

My experience has been that it is very rare that companies will try and block you. You are their customers and blocking is something very final. They want you to use their products. If they block you, then you can’t pay them.

A much more effective way of detering web scraping is to slow you down with captchas or rate limiting. If you get timed out for a minute then your web scraping is hurt real bad. If a captcha pops up, then you’re pretty much out of luck. The website can keep you as a customer while still slowing down the web scraping.

Enter scrapestack

scrapestack is an extremely simple, and affordable, tool to help with the slow down or risk of getting blocked. This is literally all you need to get started:

http://api.scrapestack.com/scrape ? access_key = YOUR_ACCESS_KEY & url = https://apple.com

Bam. Done. It handles proxies, ip addresses, and rotating geolocations. It has a very high success rate AND, most impressive of all, it’s FAST. Way faster than I would have thought possible using proxies. I actually did a post earlier this year where I did some proxying and the slow down was a lot more significant than what I experienced with scrapestack.

I did testing with scrapestack using axios and puppeteer. It performed amazingly well with axios but using it with puppeteer, while still pretty good, was a bit more complicated. I think the end goal with using something like scrapestack would be to use it with something like axios exclusively. If you wanted to do some page manipulation however, puppeteer is still the way you’d want to go.

scrapestack + axios

Axios and scrapestack were made to be friends forever. I tested against both Amazon and Google.

When testing with Amazon, I hit the same page 10 times in a row pretty much as fast as I could. The requests were made to a single product page –

https://www.amazon.com/dp/B00170EB1Q. Here are the results:

Method Proxied Speed Batch size Errors Target axios yes 6.705 10 0 https://www.amazon.com/dp/B00170EB1Q axios yes 6.349 10 0 https://www.amazon.com/dp/B00170EB1Q axios yes 6.187 10 0 https://www.amazon.com/dp/B00170EB1Q axios no 5.526 10 0 https://www.amazon.com/dp/B00170EB1Q axios no 4.747 10 0 https://www.amazon.com/dp/B00170EB1Q axios no 4.867 10 0 https://www.amazon.com/dp/B00170EB1Q

As you can see, the speed difference is very small. That was the most impressive part to me. I’m classifying an error here as getting blocked/captcha’d or detected in some way. In the attempts I made with axios to Amazon I didn’t have any problem.

When testing with google, the results were still crazy fast but google is a lot more strict on robot checking and it showed.

Method Proxied Speed Batch size Errors Target axios no 1.126 10 0 https://www.google.com/search?q=javascript+web+scraping+guy axios no 1.622 10 0 https://www.google.com/search?q=javascript+web+scraping+guy axios no 1.179 10 0 https://www.google.com/search?q=javascript+web+scraping+guy axios yes 1.727 10 1 https://www.google.com/search?q=javascript+web+scraping+guy axios yes 1.345 10 1 https://www.google.com/search?q=javascript+web+scraping+guy axios yes 1.242 10 3 https://www.google.com/search?q=javascript+web+scraping+guy

There were a couple of times when google detected that it was automated and that is where the errors came in. In those instances, it hit google’s recaptcha screen:

The nice thing about this recaptcha screen is that it displays the ip address and so I was able to easily see that IP rotation was happening on each request. So if the same ip address isn’t being used on each request, that means that there are some known (to google) bad IPs that are being used in the rotation. Because it wasn’t consistent, in this case I’d just put something to check if they caught it and if so, then just retry the request.

Note: I feel that it is worth mentioning that I saw something similar to this when doing my own proxying. A lot of the IP addresses I proxied with were blocked on my first request, leading me to believe that there is a list of IP addresses that are already blacklisted.

scrapestack + puppeteer

Using scrapestack and puppeteer together had a bit more issues. I seemed to get flagged by google more often, though that could be because of my smaller sample size.

The other interesting thing is when I hit google with scrapestack, it served up a different, more light weight version of google and as a result it was a LOT faster. I think this was less due to scrapestack itself and more due to the fact that possibly this light weight version of google was served to those IP addresses or geolocations that scrapestack was using.

Google version that was served to scrapestack

Method Proxied Speed Batch size Errors Target puppeteer yes 7.525 10 2 https://www.google.com/search?q=javascript+web+scraping+guy puppeteer yes 8.551 10 3 https://www.google.com/search?q=javascript+web+scraping+guy puppeteer yes 8.991 10 0 https://www.google.com/search?q=javascript+web+scraping+guy puppeteer no 19.021 10 0 https://www.google.com/search?q=javascript+web+scraping+guy puppeteer no 13.253 10 0 https://www.google.com/search?q=javascript+web+scraping+guy puppeteer no 16.641 10 0 https://www.google.com/search?q=javascript+web+scraping+guy

The speed test here isn’t really apples to apples due to the different versions of google served but that is pretty quick speed with a proxy. Like, barely noticeable from what I would expect for basic puppeteer usage. It should also be noted that with these google tests I ran them concurrently which is always kind of tricky with puppeteer because it’s a lot more dependent on the machine running it than on strictly the speed of the http requests.

For the amazon test I made them synchronous so it wouldn’t have speed issues because of ram or processor speed dealing with chrome opening up 10 windows at the exact same time. As a result the overall time for both is a lot slower but I feel it’s a more accurate comparison between scrapestack and just normal puppeteer usage.

Method Proxied Speed Batch size Errors Target puppeteer yes 74.965 10 0 https://www.amazon.com/dp/B00170EB1Q puppeteer yes 77.31 10 0 https://www.amazon.com/dp/B00170EB1Q puppeteer yes 83.401 10 0 https://www.amazon.com/dp/B00170EB1Q puppeteer no 56.443 10 0 https://www.amazon.com/dp/B00170EB1Q puppeteer no 53.615 10 0 https://www.amazon.com/dp/B00170EB1Q puppeteer no 51.528 10 0 https://www.amazon.com/dp/B00170EB1Q

With this, the speed difference is more noticeable when using the proxy. Still, it honestly isn’t that much slower compared to other proxying or other proxy services I’ve used.

One final note with puppeteer: Click navigation when proxying didn’t work so well. I think it has to do with the relative routes to the domain. Whenever I clicked a link, it did not work. The better way to handle this is to just collect the urls you want to visit and then visit them directly rather than navigating via click.

Final verdict

Honestly, I’m super impressed. I realize this is a sponsored post but the numbers speak for themselves. The speed is barely impacted when using scrapestack and their pricing is crazy affordable. They even have a free plan that allows you to make 10,000 requests per month!

10/10, will scrape with again. I highly recommend scrapestack!

Demo code here

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!