The problems with speed optimization tools

If you are very familiar with me you know I tend to obsess over site speed. It is one thing that I have become very passionate about over the years and I find the correlation to it and e-commerce immensely interesting. But at the same time, I find it confusing as hell too. It has been a ranking factor in Google for a couple of years, so obviously it is important. There are also a plethora of tools on the market that can measure your site’s speed and give a diagnosis of problems and suggested fixes. So with all the tools on the market which one do you listen to? which ones give the best insight into an end-user experience? Which ones do not? Its hard to tell, I come across many site optimization requests and one thing I learned is people favor specific tools. Also, optimizing a site in one tool does not necessarily translate to a site being optimized in a different tool. This is the height of frustration for any developer I would imagine. I have decided its time to break the most common tools apart and figure out their strengths and weaknesses and how they relate to real-world optimization.

The tools

Like I mentioned before there are several different tools that are used to measure site performance. For the purpose of this article I am going to go over the 5 most popular ones. The selected tests are Google PageSpeed Insights, GTmetrix, Pingdom, WebPageTest, and Test My Site from Think with Google.

Each tool uses different metrics, which makes it nearly impossible to have one site that will satisfy all the metrics. But which metrics are better and which metrics do each tool use? First, we have to figure out the metrics that each tool uses because some of them become detrimental to scores in later tests. One of the important metrics for me is screen size. Most sites are designed responsively these days, so they should be able to handle screen size changes without huge issues, but when you are optimizing a site you can start to notice issues. A good example would be a full-width slider on your home page. Most developers and designers design a full-width slider around 1920 pixels wide since that is the standard 1080p width, which is the most common desktop size. But what happens when a user is using a monitor that only supports 1366? You have two choices, you can scale the image down using CSS and serve the same image, or you can use a responsive break point to serve a smaller version of that image. Both HTML5 and CSS have this ability built in. But what happens when you are testing your site against a tool that uses an odd screen resolution? It will likely ding your score and lead you astray. So that is why I have started the first test at screen resolutions.

Google PageSpeed Insights Desktop – 1366px wide x 768px tall

Google PageSpeed Insights Mobile – 412px wide x 732px tall

GTmetrix – 1367px wide x 858px tall using Google Chrome (the default setting)

GTmetrix – 1366px wide x 861px tall using Firefox

GTmetrix – 360px wide x 640px tall using Chrome Android Galaxy Nexus

Pingdom – 1280px wide x 1024px tall

WebPageTest – 1920px wide x 1200px tall using the default setting (they have the most options)

Google Test My Site – 412px wide x 732px tall

Why does this matter? Not knowing or accounting for these differences can lower your score on one or more tests in a meaningless way. A great case in point is with srcsets, they are becoming more common, we have included them limitedly in thirty bees already. A normal process in front-end design is picking responsive breakpoints. With srcsets you can fine tune even more and pick mid break points, especially when you are optimizing for quickly loading sites. What would happen if we chose the normal 1366px wide as a breakpoint on a site that was made to scale to 1920px? Using the default setting on GTmetrix we would be hit with a scaled image penalty. But is there really a problem? Or is it just an uncommon browser window size the issue?

Another issue is designing around expected screen height. The most common desktop resolution is 1366 x 768 in speed optimization and design it is one of the sizes that is designed around. Specifically because as Google puts it you need to “prioritize above the fold content”, which is true. Imagine a site that when loaded at 1366 x 768 that has the normal e-commerce things like a navigational menu, header with the company logo, cart, search, and account login. Then it has a slider below that. Most designers design around the slider ending close to 768 pixels to the bottom give or take. Next, imagine below the slider products are in a 5 x 4 grid. Best practice would be to lazy load either the whole container or at the very least the images that are contained. Running GTmetrix, Pingdom, or WebPageTest these images will be loaded without lazy loading them. This will produce a different score on all these tests since different assets are being loaded depending on the testing engine has been run.

Connection Speed

Another factor playing into the tests is the connection speed of the testing service. Some services simulate mobile connections, others are “unthrottled”, even others simulate an unknown cable connection speed. But no one outright displays their connection speed for tests. So what I did to test the connection speeds was a little tricky. I used an image of a known size, 5mb and timed the image to download. I used javascript to print out on the page the figured connection speed. Since all of the testing services also screenshot your site as it loaded for them, I just used big text to let me know how fast it loaded. Have I mentioned that I am old? I actually came up with this method almost 18 years ago when I was a Flash developer. For hosting, I used a Digital Ocean droplet in New York in the 2nd zone. Digitial Ocean advertises 1gb uplinks with an expected 300Mbps transfer rate, so I imagined that would satisfy the testing engines. Let us see how each did.

Google PageSpeed Insights Desktop – 876.23Mbps

Google PageSpeed Insights Mobile – 876.23Mbps

GTmetrix Dallas location unthrottled Chrome – 105.87Mbps

GTmertrix Dallas location unthrottled FireFox – 98.99Mbps

GTmetrix Vancouver Chrome Android unthrottled – 23.98Mbps

Pingdom New York City location – 86.62Mbps

WebPageTest – 21.6Mbps

Google Test My Site – 786.45Mbps

All of the tests were run 5 times, then the times were averages. None of the tests varied by more than 3% which is to be expected with dynamic network and load conditions. Also, it is worth mentioning that all the tests were done with a static HTML page and using an ip address. That way there was no domain lookup or SSL handshake to slow the raw speed tests down. We can see that the connection speeds that all of the tests use are in the normal speed range and are pretty closely grouped together with the exception of the mobile tests. Even though the network speeds are fairly close together when optimizing for milliseconds you can start to see issues.

Connection Protocol

Now, this starts to get real world, since a few years back Google has recommended using SSL for all sites. The Chrome team has even started implementing notices for non-SSL enabled sites. Google has even gone so far as to make the use of SSL on a site a ranking factor. If you keep up the latest technologies, security procedures, and speed optimization, you are likely familiar with HTTP/2 and its benefits over HTTP/1.1. Currently, Can I Use lists HTTP/2’s adoption rate users that can support it at between 77% and 82% globally. While in the US it is between 87% and 93%. Needless to say, using HTTP/2 is pretty standard stuff nowadays. Much like with HTTP/1.1 the newer HTTP/2 comes with its own set of recommended design patterns for speed.

In HTTP/1.1 to attain the best speed, you needed to minify and concat your resources into as few bundles as possible to attain the best speed. HTTP/1.1 was limited by download channels when you visited a website your browser would only download 8 files at a time when one finished a new file was added to the list of files being downloaded. It used a queuing method where requests were queued. With HTTP/2 that is no longer the case since the connection streams and multiple downloads are handled in the same connection without an SSL handshake renegotiation.

Needless to say, if you are like me you have started making development patterns around utilizing HTTP/2. You might be doing things such as bundling your CSS or Javascript into multiple bundles now. An example would be to have one bundle that represents the site’s grid and the above the fold content. This quickens the display above the fold and reduces any repaints that would be noticed when a user first lands on your site. Then have another bundle for your middle of the page content, and yet another bundle for the footer content that is downloaded last. The same technique can be done with Javascript to even more of a degree. Load the minimal library that brings your site up to functionality, then after the page is loaded push external bundles such as Google Analytics, remarketing tags, live chats, ect. Using these methods though is not recommended practice with HTTP/1.1, remember that. So how do the different testing engines handle HTTP/2 support?

Google PageSpeed Insights Desktop – Not supported

Google PageSpeed Insights Mobile – Not supported

GTmetrix Dallas Chrome – Enabled and Supported

GTmetrix Dallas FireFox – Enabled and Supported

GTmetrix Vancouver Chrome Android – Enabled and supported

Pingdom – Not Supported

WebPageTest – Enabled and Supported

Google Test My Site – Not supported

Looking at these results is kind of surprising. None of the Google testing engines are using HTTP/2 and neither is Pingdom. Google has been a leader in pushing a faster web, but their own testing engines are not using HTTP/2. This means that some of the techniques and advice they are giving out are not valid for the current best practices of speed optimization. To me, this is pretty backward and counterproductive to what they are trying to accomplish. It is the same case with Pingdom as well.

Real-world Tests

Knowing all of this information about the different testing engines, it is time to start doing some real-world tests to see how different sites stack up in the different tests. I selected some of the more well-known e-commerce sites with large budgets and large dev teams behind them to test with. We are just testing the index pages of these sites for purposes of keeping this article short. The sites we will be testing are Amazon.com, Walmart.com, and Ebay.co.uk. All of the American sites will be tested from the locations we have been using throughout this article, for Ebay.co.uk we will use the European locations for tests that allow you to change the locations. Also, we are only using one GTmetrix location/browser for these tests as well.

Amazon.com

Google PageSpeed Insights Desktop

Good 88/100

Google PageSpeed Insights Mobile

Poor 54/100

GTmetrix Dallas Chrome

PageSpeed 58%

YSlow 68%

Load time – 6.3 seconds

Pingdom New York

Performance 81%

Load time – 2.21 seconds

WebPageTest

First Byte Time – A

Keep-alive Enabled – A

Compress Transfer – A

Compress Images – A

Cache Static Content – A

CDN – Yes

First byte – .375 seconds

Fully loaded 10.52 seconds

Google Test My Site

Load time – 3 seconds

Excellent, visitor loss low

The variance in these scores is pretty interesting, especially the variance for the two different Google mobile tests. One gives a poor score, the other gives an excellent score. Which Google product is their search algorithm using for ranking? For the testing engines that display the loading time, there is a huge variance also, over 7 seconds of variance from the fastest test to the slowest test.

Walmart.com

Google PageSpeed Insights Desktop

Needs work 72%

Google PageSpeed Insights Mobile

Poor 57%

GTmetrix Dallas Chrome

PageSpeed 53%

YSlow 54%

Load time- 6.3 seconds

Pingdom New York

Performance 69%

Load time – 5.13 seconds

WebPageTest

First Byte Time – A

Keep-alive Enabled – A

Compress Transfer – A

Compress Images – B

Cache Static Content – B

CDN – Yes

First byte – .59 seconds

Fully loaded – 6.08 seconds

Google Test My Site

Load time – 7 seconds

Fair, 26% visitor loss

This grouping comes in fairly close. There are not really any big irregularities between the tests and the Walmart.com site. Everything is pretty on point that the Walmart site needs to work on their optimization techniques.

Ebay.co.uk

Google PageSpeed Insights Desktop

Needs work 34%

Google PageSpeed Insights Mobile

Poor 57%

GTmetrix London Chrome

PageSpeed 32%

YSlow 72%

Load time- 5.4 seconds

Pingdom Stockholm

Performance 89%

Load time – 1.79 seconds

WebPageTest

First Byte Time – A

Keep-alive Enabled – A

Compress Transfer – A

Compress Images – A

Cache Static Content – B

CDN – Yes

First byte – .454 seconds

Fully loaded – 2.98 seconds

Google Test My Site

Load time – 5 seconds

Good, 19% visitor loss

This test is like the first test, it is more pronounced with different tests showing different results. If Ebay.co.uk only used Pingdom to test, everything would look great. Thier site loads well under 2 seconds, with 270 requests and 2.6mb of files. That’s not bad. Sure, they could build on the 89 performance grade that Pingdom gives them and raise that even further. But which test is the right one to go by in this circumstance? Is the first Google test, the PageSpeed Insights the one developers should be using? Or is it the Google Test My Site?

The tests are broken

There, I said it. They are horribly broken and a horrible metric for judging whether a site is optimized or not. What they do is give suggestions, not actually judge optimization. There is a huge difference between the two ideas. One just blindly recommends optimizations if they are not present, the other looks at the whole picture. For any test to work, the whole picture has to be looked at and actionable issues raised. Let me go over the good and bad points of each of the different testing engines. So we can get an idea of what needs to be changed and why some of them are just wrong.

Google PageSpeed Insights

This test is generally considered the gold standard test by a lot of developers and its results are considered to be a ranking factor for the Google search engine. This is one of the most flawed tests though. It gives guidance without actually looking at meaningful issues. A case in point, the page size is not calculated or figured into the metric. I have seen sites with 35mb pages score in the 90s on both the mobile and desktop test. The site was full of non-lazy loaded images, standard image tags with no height or width. They were just full-size backgrounds in a vertical list. Can you imagine loading a 35mb page on a mobile connection? His site scored in the 90’s though, it is hard to convince a site owner that a test is broken.

Another issue with the Google PageSpeed test is that it dings on Google products. If you use a Google web font you get a render blocking CSS ding, if you use Google Analytics you get two cache life dings. I understand that these services both do slow a page load down but do they that much? With the fonts, if you just select one font to load from Google Fonts you get a message that says the load time will be fast. Not in the test though. See the test has a CSS limit in the header for one file. Once you reach that limit you get a hard set number of points taken off. You get the same penalty for once 150kb CSS file as you get for one 200kb CSS file that embeds another 300kb of fonts.

The test also gives a flat rating on speed. Page load time is not taken into consideration only time to first byte. This is a horrible metric to use. Time to first byte is important, but using just one metric with one value for the penalty leads people astray. An example is if your time to first byte is .5 seconds, that is not great, but if your fully loaded page displays in 1.3 seconds that is not terrible. Another site having a lower time to first byte, say .3 seconds, that fully loads in 15 seconds will score better than your site, even if your site is interactive after 1.3 seconds and theirs takes the full 15 seconds to become interactive simply because their time to first byte was lower.

I do understand that cache expiration and using it for resources is a good thing that should be used most of the time. Use long ones for static resources and short cache times for resources that might change it is a simple principle. But using it as a metric in these types of tests seem off base to me. The reason being is that PageSpeed is optimizing for search since it is a ranking factor for search. Generally, with search people are visiting for the first time and having an expires header set for an image that the user has never loaded before is pointless. It does help in subsequent reloads, but not on the first load.

There are several failings in this test that do not represent real-world conditions and are counter to the actual point of optimizing a site. This really makes it hard to rely on this test and a little bit scary that it is an SEO factor.

GTMetrix

There are a couple of features that I really care for in tests like GTmetrix, the first three being the complete page load time is shown, the number of requests being shown, and the page size is shown. To me as a developer, those are among the most useful metrics. Right now I am working on a project for a future version of thirty bees that combines srcset, webp, and a jpg polyfill as a fallback. Knowing exactly how many resources I am loading and the size is vital to making something like this work.

GTmetrix does, however, have some of the same shortcomings of the other testing engines. One of the first that presents is self is when it asks for image dimensions to be specified and deducts from your score if they are not. I understand the usefulness of this in a controlled environment, but this does not translate over to the dynamic web. There was a time when desktops and laptops ruled the internet, having one set size for your site you could speed the painting time by having the image sizes in the img tag. Now with srcset, fluid mobile design, responsive breakpoints, they just add un-needed overhead.

Another issue is the serving of scaled images. I get that this is important as well, there is no need to downscale a 1920 pixel wide image to a mobile device. You should have a dedicated mobile image for that size. But come one, your default test is 1px larger than the most common screen resolution. If the design of your site starts at 1920px, then breaks to a smaller format at 1366px, it will lower your score because it will not load the 1366px size, it will miss your breakpoint by one pixel.

This is another test, like the Google PageSpeed, that I feel gets nitpicky as well. Especially in the Javascript and CSS minifications. Yes, I think we all know that minifying and gziping these resources do save download time. But, as most developers do, we depend on libraries. Libraries that have license headers. I am looking at a test from GTmetrix right now on a page that loads 7mb of images, not lazy loading, not deferring them, just all at once in the HTML. I also see I have been deducted 5 points for 6kb of Javascript headers being included in files and another 1 point for a 366b CSS header. So a total of 6 points lost in this fiasco of respecting licenses. This equates to less than one second on a 56k modem and such an inconsequential fraction of a second on a 1mb connection. Yet the overload of images get no penalty. They could be lazy loaded and bring the page impact down from 7mb to 2mb. I could serve the JPG’s as webp’s and save even more. But there is no mention. I also see that there is a 1 point deduction for HTML minification as well, 657b of white space cause this deduction. 7 points lost for a millisecond of slowness.

With Gtmetrix I realize they actually are depending on libraries. They are using the Google PageSpeed library and also the Yahoo Yslow library. I did not get into any of the Yslow issues because frankly that library has been abandoned so long it is just not relatable to modern web problems. It’s been abandoned for 4 years, it’s not coming back and it is not good advice for today’s web.

Pingdom

Pingdom I feel like is a minimal try at a testing engine. I feel like the engine behind Pingdom is one of the most behind the times, yet at the same time, it does have some surprisingly good metrics that it tests too. One that is exclusive to Pingdom is the minimizing of hostname lookups. This is something that does cause a slowdown, especially depending on how the DNS is hosted. But, then you are hit with the behind the times penalties as well.

One that stands out is the combining of files. There are two penalties combining of external CSS files and combining of external Javascript files. This was good practice most of the time with http/1.1, which this testing engine still uses. This is no longer considered good practice under http/2. With http/2 you want to use multiple packages and cascade the resources from most important to least important and off screen. An extremely simple example would be to have 3 different CSS files, one file that load first that is your grid and above the fold content. The second package could be the main off-screen content area like the middle of your site, the third package could be your footer and popup modal content. This provides the best experience for the user since CSS paints top down in order and with http/2’s streaming of downloads, you will not see an impact in page load times.

Like Google PageSpeed you are dinged with a browser caching penalty if you are not caching for long enough. With Pingdom, I get it more since their tool is not in my mind a tool for pure SEO. One penalty sticks out with Pingdom though, it is the Minimize request size penalty. This is a pure http/1.1 issue and is not present in http/2. The latency issues with http/1.1 are not present in http/2 yet I there is a penalty here. Another holdover from http/1.1 is the parallelize downloads across hostnames. Using http/1.1 you only had some many download channels that a site could utilize, normally it was 8. Files would queue until there was a free slot then another file was downloaded. This is not the case with http/2 many files can be downloaded at once, upwards of 50 depending on the client viewing the site. This advice is actually counterproductive to having a fast site currently. The lookup of the other hostnames takes more time than to use http/2 and stream the files.

WebPageTest

WebPageTest is a fairly barebones tester. It mainly relies on the loading times, the number of resources, and lately the first point your site is interactive. Which are all great in my opinion. These are the metrics that are truly actionable. It does get derailed however by having an A through F rating for certain aspects it deems necessary.

As far as the issue with this testing engine, there really are not any that I can see. It does give timed results, which are actionable, but at the same time, it does not give any real advice. Which this might be the best option since every other test seems to give bad advice.

Google Test My Site

This is Google’s latest testing engine. From all I can tell about the advice that it gives, it seems that it might be using a mix of Google PageSpeed and Lighthouse. It is a mobile-first or for that matter a purely mobile test. So that means most of the advice is geared toward mobile optimizations. You can only access the suggested optimizations by email though, you have to submit your email to get a report. This can raise privacy concerns with some people and also slow the testing process as well.

Most of the issues that come up with this testing method have been covered in the Google PageSpeed insights method. One, however, that is annoying is the image compression. We all know image compression is great and can reduce the amount of file data downloaded for a site. But come on Google, at least say which image compression engine you use for your metric. It would give people a starting place to optimize their images.

A better test

Better tests need to be created, I think we can all agree on that. Tests that are in-depth for developers and highlight the issues and give ideas to actionable fixes for having faster loading sites. These tests just use a shotgun approach and have little to no real world basis in some cases. Here is a list of some things I think could be added to the tests to make them more relevant for developers.

Test for http/2 on the server

Test common responsive breakpoints (really looking at Google and their Nexus screen size with 4% market share)

Highlight issues in DNS waits and SSL negotiation

Have scores based more off of loading/rendering time than arbitrary metrics

The last feature that needs to be added needs more explaining than a bullet point. Test more for interactivity. This is something that is becoming crucial now and a fair amount of developers never test for it. Over the last 6 months, we have had several requests from clients who have had issues relating to this. Their sites speeds are fine, they might be very fast loading sites, but users abandon them. The issues is the overloading of Javascript, to a point that a mobile processor cannot keep up, so the device freezes, stutters, and has issues while on the site. This is something that WebPageTest is trying to test for with its first interactive test, and I applaud them for doing so. This is the next great problem in optimization. With every SaaS company wanting to insert a snippet on your site to give you a tracking script, a chat script, a product suggestion feature, ect mobile browsers are being overloaded in the terms of processing. No good comes when a site is fully downloaded in 2 seconds if it freezes the device for 8 seconds.

A side note on Google tests

Since most of the industry is of the opinion that the PageSpeed test is a ranking factor it matters. The results of the test matter. Years ago I remember Google stating that they wanted to create a smarter web. While with the current standards there are a lot of abilities lacking for having a totally smart web, you sometimes have to work within the framework that you have. Using Google products on your site adds no value to the end user. Most of the time they never know that you are using a Google product. It could be remarketing, analytics, or a font. At the same time neither do most other services such as Facebook or VK. That is why I like to suggest the idea of hiding those resources from Google. Simply grabbing the user agent string and matching google in the string, you can hide your font resources, analytics, and remarketing tags. This will automatically lift the dings on your score that Google gives you for using them in the first place. One thing to note as well is Google Bot has the Google Font library installed, so image renders will look the same since the fonts will be loaded locally.

This article turned out a bit longer than I anticipated, if you made it this far you are dedicated. Leave me a comment to let me know your thoughts or opinions, or even any questions you might have.