Posted 26 February, 2018 by screamingfrog in Screaming Frog SEO Spider

I’m delighted to announce the release of Screaming Frog SEO Spider 9.0, codenamed internally as ‘8-year Monkey’.

Our team have been busy in development working on exciting new features. In our last update, we released a new user interface, in this release we have a new and extremely powerful hybrid storage engine. Here’s what’s new.

1) Configurable Database Storage (Scale)

The SEO Spider has traditionally used RAM to store data, which has enabled it to have some amazing advantages; helping to make it lightning fast, super flexible, and providing real-time data and reporting, filtering, sorting and search, during crawls.

However, storing data in memory also has downsides, notably crawling at scale. This is why version 9.0 now allows users to choose to save to disk in a database, which enables the SEO Spider to crawl at truly unprecedented scale for any desktop application while retaining the same, familiar real-time reporting and usability.

The default crawl limit is now set at 5 million URLs in the SEO Spider, but it isn’t a hard limit, the SEO Spider is capable of crawling significantly more (with the right hardware). Here are 10 million URLs crawled, of 26 million (with 15 million sat in the queue) for example.

We have a hate for pagination, so we made sure the SEO Spider is powerful enough to allow users to view data seamlessly still. For example, you can scroll through 8 million page titles, as if it was 800.

The reporting and filters are all instant as well, although sorting and searching at huge scale will take some time.

It’s important to remember that crawling remains a memory intensive process regardless of how data is stored. If data isn’t stored in RAM, then plenty of disk space will be required, with adequate RAM and ideally SSDs. So fairly powerful machines are still required, otherwise crawl speeds will be slower compared to RAM, as the bottleneck becomes the writing speed to disk. SSDs allow the SEO Spider to crawl at close to RAM speed and read the data instantly, even at huge scale.

By default, the SEO Spider will store data in RAM (‘memory storage mode’), but users can select to save to disk instead by choosing ‘database storage mode’, within the interface (via ‘Configuration > System > Storage), based upon their machine specifications and crawl requirements.

Users without an SSD, or are low on disk space and have lots of RAM, may prefer to continue to crawl in memory storage mode. While other users with SSDs might have a preference to just crawl using ‘database storage mode’ by default. The configurable storage allows users to dictate their experience, as both storage modes have advantages and disadvantages, depending on machine specifications and scenario.

Please see our guide on how to crawl very large websites for more detail on both storage modes.

The saved crawl format (.seospider files) are the same in both storage modes, so you are able to start a crawl in RAM, save, and resume the crawl at scale while saving to disk (and vice versa).

2) In-App Memory Allocation

First of all, apologies for making everyone manually edit a .ini file to increase memory allocation for the last 8-years. You’re now able to set memory allocation within the application itself, which is a little more user-friendly. This can be set under ‘Configuration > System > Memory’. The SEO Spider will even communicate your physical memory installed on the system, and allow you to configure it quickly.

Increasing memory allocation will enable the SEO Spider to crawl more URLs, particularly when in RAM storage mode, but also when storing to database. The memory acts like a cache when saving to disk, which allows the SEO Spider to perform quicker actions and crawl more URLs.

3) Store & View HTML & Rendered HTML

You can now choose to store both the raw HTML and rendered HTML to inspect the DOM (when in JavaScript rendering mode) and view them in the lower window ‘view source’ tab.

This is super useful for a variety of scenarios, such as debugging the differences between what is seen in a browser and in the SEO Spider (you shouldn’t need to use WireShark anymore), or just when analysing how JavaScript has been rendered, and whether certain elements are within the code.

You can view the original HTML and rendered HTML at the same time, to compare the differences, which can be particularly useful when elements are dynamically constructed by JavaScript.

You can turn this feature on under ‘Configuration > Spider > Advanced’ and ticking the appropriate ‘Store HTML’ & ‘Store Rendered HTML’ options, and also export all the HTML code by using the ‘Bulk Export > All Page Source’ top-level menu.

We have some additional features planned here, to help users identify the differences between the static and rendered HTML.

4) Custom HTTP Headers

The SEO Spider already provided the ability to configure user-agent and Accept-Language headers, but now users are able to completely customise the HTTP header request.

This means you’re able to set anything from the accept-encoding, cookie, referer, or just supplying any unique header name. This can be useful when simulating the use of cookies, cache control logic, testing behaviour of a referer, or other troubleshooting.

5) XML Sitemap Improvements

You’re now able to create XML Sitemaps with any response code, rather than just 200 ‘OK’ status pages. This allows flexibility to quickly create sitemaps for a variety of scenarios, such as for pages that don’t yet exist, that 301 to new URLs and you wish to force Google to re-crawl, or are a 404/410 and you want to remove quickly from the index.

If you have hreflang on the website set-up correctly, then you can also select to include hreflang within the XML Sitemap.

Please note – The SEO Spider can only create XML Sitemaps with hreflang if they are already present currently (as attributes or via the HTTP header). More to come here.

6) Granular Search Functionality

Previously when you performed a search in the SEO Spider it would search across all columns, which wasn’t configurable. The SEO Spider will now search against just the address (URL) column by default, and you’re able to select which columns to run the regex search against.

This obviously makes the search functionality quicker, and more useful.

7) Updated SERP Snippet Emulator

Google increased the average length of SERP snippets significantly in November last year, where they jumped from around 156 characters to over 300. Based upon our research, the default max description length filters have been increased to 320 characters and 1,866 pixels on desktop within the SEO Spider.

The lower window SERP snippet preview has also been updated to reflect this change, so you can view how your page might appear in Google.

It’s worth remembering that this is for desktop. Mobile search snippets also increased, but from our research, are quite a bit smaller – approx. 1,535px for descriptions, which is generally below 230 characters. So, if a lot of your traffic and conversions are via mobile, you may wish to update your max description preferences under ‘Config > Spider > Preferences’. You can switch ‘device’ type within the SERP snippet emulator to view how these appear different to desktop.

As outlined previously, the SERP snippet emulator might still be occasionally a word out in either direction compared to what you see in the Google SERP due to exact pixel sizes and boundaries. Google also sometimes cut descriptions off much earlier (particularly for video), so please use just as an approximate guide.

8) Post Crawl API Requests

Finally, if you forget to connect to Google Analytics, Google Search Console, Majestic, Ahrefs or Moz after you’ve started a crawl, or realise at the very end of a crawl, you can now connect to their API and ‘request API data’, without re-crawling all the URLs.

Other Updates

Version 9.0 also includes a number of smaller updates and bug fixes, outlined below.

While we have introduced the new database storage mode to improve scalability, regular memory storage performance has also been significantly improved. The SEO Spider uses less memory, which will enable users to crawl more URLs than previous iterations of the SEO Spider.

The ‘exclude‘ configuration now works instantly, as it is applied to URLs already waiting in the queue. Previously the exclude would only work on new URLs discovered, and rather than those already found and waiting in the queue. This meant you could apply an exclude, and it would be some time before the SEO Spider stopped crawling URLs that matched your exclude regex. Not anymore.

The ‘inlinks’ and ‘outlinks’ tabs (and exports) now include all sources of a URL, not just links (HTML anchor elements) as the source. Previously if a URL was discovered only via a canonical, hreflang, or rel next/prev attribute, the ‘inlinks’ tab would be blank and users would have to rely on the ‘crawl path report’, or various error reports to confirm the source of the crawled URL. Now these are included within ‘inlinks’ and ‘outlinks’ and the ‘type’ defines the source element (ahref, HTML canonical etc).

In line with Google’s plan to stop using the old AJAX crawling scheme (and rendering the #! URL directly), we have adjusted the default rendering to text only. You can switch between text only, old AJAX crawling scheme and JavaScript rendering.

You can now choose to ‘cancel’ either loading in a crawl, exporting data or running a search or sort.

We’ve added some rather lovely line numbers to the custom robots.txt feature.

To match Google’s rendering characteristics, we now allow blob URLs during JS rendering crawl.

We renamed the old ‘GA & GSC Not Matched’ report to the ‘Orphan Pages‘ report, so it’s a bit more obvious.

URL Rewriting now applies to list mode input.

There’s now a handy ‘strip all parameters’ option within URL Rewriting for ease.

We have introduced numerous JavaScript rendering stability improvements.

The Chromium version used for rendering is now reported in the ‘Help > Debug’ dialog.

List mode now supports .gz file uploads.

The SEO Spider now includes Java 8 update 161, with several bug fixes.

Fix: The SEO Spider would incorrectly crawl all ‘outlinks’ from JavaScript redirect pages, or pages with a meta refresh with ‘Always Follow Redirects’ ticked under the advanced configuration. Thanks to our friend Fili Weise on spotting that one!

Fix: Ahrefs integration requesting domain and subdomain data multiple times.

Fix: Ahrefs integration not requesting information for HTTP and HTTPS on (sub)domain level.

Fix: The crawl path report was missing some link types, which has now been corrected.

Fix: Incorrect robots.txt behaviour for rules ending *$.

Fix: Auth Browser cookie expiration date invalid for non UK locales.

That’s everything for now. This is a big release and one which we are proud of internally, as it’s new ground for what’s achievable for a desktop application. It makes crawling at scale more accessible for the SEO community, and we hope you all like it.

As always, if you experience any problems with our latest update, then do let us know via support and we will help and resolve any issues.

We’re now starting work on version 10, where some long standing feature requests will be included. Thanks to everyone for all their patience, feedback, suggestions and continued support of Screaming Frog, it’s really appreciated.

Now, please go and download version 9.0 of the Screaming Frog SEO Spider and let us know your thoughts.

Small Update – Version 9.1 Released 8th March 2018

We have just released a small update to version 9.1 of the SEO Spider. This release is mainly bug fixes and small improvements –

Monitor disk usage on user configured database directory, rather than home directory. Thanks to Mike King, for that one!

Stop monitoring disk usage in Memory Storage Mode.

Make sitemap reading support utf-16.

Fix crash using Google Analytics in Database Storage mode.

Fix issue with depth stats not displaying when loading in a saved crawl.

Fix crash when viewing Inlinks in the lower window pane.

Fix crash in Custom Extraction when using xPath.

Fix crash when embedded browser initialisation fails.

Fix crash importing crawl in Database Storage Mode.

Fix crash when sorting/searching main master view.

Fix crash when editing custom robots.txt.

Fix jerky scrolling in View Source tab.

Fix crash when searching in View Source tab.

Small Update – Version 9.2 Released 27th March 2018

We have just released a small update to version 9.2 of the SEO Spider. Similar to 9.1, this release addresses bugs and includes some small improvements as well.

Speed up XML Sitemap generation exports.

Add ability to cancel XML Sitemap exports.

Add an option to start without initialising the Embedded Browser (Configuration->System->Embedded Browser). This is for users that can’t update their security settings, and don’t require JavaScript crawling.

Increase custom extraction max length to 32,000 characters.

Prevent users from setting database directory to read-only locations.

Fix switching to tree view with a search in place shows “Searching” dialog, forever, and ever.

Fix incorrect inlink count after re-spider.

Fix crash when performing a search.

Fix project saved failed for list mode crawl with hreflang data.

Fix crash when re-spidering in list mode.

Fix crash in ‘Bulk Export > All Page Source’ export.

Fix webpage cut off in screenshots.

Fix search in tree view, while crawling doesn’t keep up to date.

Fix tree view export missing address column.

Fix hreflang XML sitemaps missing namespace.

Fix needless namespaces from XML sitemaps.

Fix blocked by Cross-Origin Resource Sharing policy incorrectly reported during JavaScript rendering.

Fix crash loading in large crawl in database mode.

Small Update – Version 9.3 Released 29th May 2018

We have just released a small update to version 9.3 of the SEO Spider. Similar to 9.1 and 9.2, this release addresses bugs and includes some small improvements as well.

Update SERP snippet pixel widths.

Update to Java 1.8 update 171.

Shortcuts not created for user account when installing as admin on Windows.

Can’t continue with Majestic if you load a saved crawl.

Removed URL reappears after crawl save/load.

Inlinks vanish after a re-spider.

External inlinks counts never updated.

HTTP Canonicals wrong when target url contains a comma.

Exporting ‘Directives > Next/Prev’ fails due to forward slash in default file name.

Crash when editing SERP description using Database Storage Mode.

Crash in AHREFs when crawling with no credits.

Crash on startup caused by user installed java libraries.

Crash removing URLs in tree view.

Crash crawling pages with utf-7 charset.

Crash using Datebase Storage Mode in a Turkish Locale.

Loading of corrupt .seospider file causes crash in Database Storage Mode.

Missing dependencies when initializing embedded browser on ubuntu 18.04.

Small Update – Version 9.4 Released 7th June 2018

We have just released a small update to version 9.4 of the SEO Spider.