When we published the results of our eyetracking study of flat design, I was surprised to find that the topic is still quite controversial (despite being a popular design style since 2011). Many of the comments, threads, and tweets posted in response to the article were positive. Others were critical, and merit a response.

About the Series of Flat-Design Studies

Before I respond to the individual complaints, I'll point out that our conclusions regarding the problems of flat design stem from two years of UX research, using a variety of approaches, only one of which was eyetracking. Our eyetracking study was simply the latest in a series, and we triangulated insights from a much broader range of inquiry than what was described in the article reporting on it.

This two-year effort provided 3 contextual benefits which we would have lacked, if we had simply run a single study.

By gradually deepening our understanding of flat design, we were able to define the hypotheses for the eyetracking research. We did not just go on a fishing expedition to see if anything would turn up with this expensive methodology. Our qualitative findings suggested metrics that could be meaningful.

of flat design, we were able to define the hypotheses for the eyetracking research. We did not just go on a fishing expedition to see if anything would turn up with this expensive methodology. Our qualitative findings suggested metrics that could be meaningful. By triangulating insights across methods , we increased our confidence that our conclusions were not caused by a weakness of any individual study.

, we increased our confidence that our conclusions were not caused by a weakness of any individual study. Across the two years, we looked at a much broader range of web designs than the ones chosen for the study. This prior experience allowed us to (a) pick study stimuli that were representative of many of the sites we saw, and (b) feel confident that our conclusions apply to a broad range of websites.

Our full body of work has been published in the following articles, videos, and an online seminar:

Criticisms of the Flat-Design Eyetracking Study

As mentioned above, our article generated a fair amount of skepticism. Below are the most important complaints that we received, and my response to them.

“The title was misleading.”

Compared to the others, this is the argument that I personally find most valid. I always struggle with writing headlines for my articles. I find it particularly difficult when I’m trying to summarize a complex and nuanced research study into 70 characters. If I could’ve, I would’ve titled the article “Weak Signifiers Used in Flat Design Can Attract Less Attention and Cause Uncertainty.”

Unfortunately, writing article titles necessarily involves simplification. We love technical and theoretical terminology (like “weak signifiers”) but those aren’t the terms that work best in headlines, because that isn’t the language the majority of our readership is familiar with. I thoroughly agree with Cigna’s Sean Dexter that you shouldn’t make decisions based solely on article titles.

Some people complained that our study didn’t only test 3D/skeuomorphic designs against 2D designs, but also included comparing practices like ghost buttons or styling linked text as static text. The strictest, simplest definition of flat design would be an interface without any 3D or skeuomorphic effects. However, in practice, when people use the term flat design they commonly are referring to more than just the absence of drop shadows. Flat design is a reaction against the heavy-handed skeuomorphism of traditional clickability clues, and so often includes design patterns like ghost buttons and static-looking links. We decided to use that meaning of flat design in designing our study — not simply a lack of depth.

Not every flat UI will use each of the associated techniques we tested. Flat design doesn’t always mean broken interaction design, of course. But flat design does almost always mean subtler, less noticeable visual cues that don’t help users recognize their available actions in an interface. And that’s what we attempted to test.

“The comparisons weren’t fair to flat design.”

Our study goal was to compare strong, traditional visual signifiers against weak or absent visual signifiers, which are strongly associated with flat design and frequently found in flat UIs.

Of course, not every single instance of flat design will look like our stimuli. But very many do. Every single change we made to the weak signifier UIs we tested came from real flat websites. Many designers creating flat UIs do use links styled exactly like static text and empty ghost buttons instead of colored buttons.

Some critics pointed out that the weak-signifier designs were low-contrast compared to the strong-signifier versions. For example, empty ghost buttons lacked contrast when compared to traditional 3D buttons. But differences in contrast were precisely what we were trying to compare — strong, traditionally, high-contrast, consistent clickability clues against their flatter, thinner, subtler alternatives.

“I would never have designed those interfaces like that.”

As a variant to the previous complaint, some tweeters claimed that they would never have designed the pages that lost in the eyetracking study. Well, that’s a common reaction. Once usability research data has pinpointed a certain number of design problems with a screen, it becomes obvious to anybody with a decent amount of UX knowledge why that UI is wrong and should be changed. Usability problems become obvious after they are identified.

More to the point, even if you might not have designed the UI that failed the test, somebody designed those sites. Those designers are not all stupid and incompetent, since most flat sites are from big and well-funded companies and many of the live sites that inspired our study don’t have obvious usability problems besides flat visual characteristics.

“NN/g just hates flat design.”

We’ve often spoken out against new design patterns when they emerge. That’s because we do the research. When the data suggests that a trendy new approach is having a detrimental effect on users, we’ll say so, even at the risk of being unpopular.

That said, we aren’t arguing against flat design. As I said in my previous article, flat design can be done well. Flat design isn’t the enemy — weak or absent signifiers are. The inherent danger in flat design is that it looks easy to execute, but is difficult to implement so that it does not cause usability problems.

Again, experienced designers who conduct usability research on their products can probably create a flat UI that works just fine. We didn’t find any quantitative evidence to suggest that users can’t figure out flat designs — just that they sometimes struggle more to find what they’re looking for.

“The pages in the study are no longer live on the websites in question.”

As stated in the original article, we didn’t aim to study the usability of any of the websites that inspired the research stimuli. Sites change all the time, for the better or sometimes for the worse. If any of the companies had hired us to consult on how to improve their online business performance, we would have conducted a very different study, looking at broader issues in user experience, and not just comparing signifiers of different degrees of flatness.

For our research, we needed pairs of stimuli that were realistic for current websites, which is why we derived them from pages that were live on the web back when we planned the study. There are many reasons why some of the companies might have changed their websites during the several months between our initial planning and the publication of our results. Maybe they ran their own user testing studies and discovered issues that needed fixing. We don’t claim that these companies have (or had) bad sites. We simply claim that a certain design style has a high potential for usability problems (unless one follows our guidelines for reducing the UX risks of flat design).

“There weren’t enough users/enough sites.”

That’s what statistical significance is for: it tells us how confident we can be that the findings are real, given the number of participants or sites in the study. No study will be 100% guaranteed to be correct for all the people or sites in the world, unless you test every last person and every single site. But we can have a high degree of confidence in our results if the statistical analysis says so. In our study, the results were statistically significant at the p<0.05 level, which is the level used for published academic research.

No, p<0.05 doesn’t give us 100% confidence, but it does give us a high degree of confidence (precisely 95%) that we’re right. And for sure, this level of statistical significance is vastly better than chance, which is what one would get by guessing in the absence of any data.

Some data is always better than no data, and in this case our data is good.

“Sites were not representative of the entire internet.”

We tested 9 very different sites across 6 very different domains (ecommerce, hotel, travel, technology, finance, and nonprofit). That’s 800% more sites and 500% more domains more than most people test — and many don’t even bother to test their own website.

As always, it’s surely best if you test your own website to make sure that you are not a special case. However, many mainstream websites fall roughly within the design space spanned by our stimuli and therefore our results should be roughly applicable to these sites.

“More research is needed.”

Of course! More research is always needed, since no study can explore all variants and details of a research question. In the article, I devote an entire section to the limitations of our study. We only had the time and resources to run a 70-user eyetracking study on 9 paired designs. We only used small findability tasks, rather than full, realistic, natural tasks. I would love to see more research on this topic, even if it contradicts our findings.

Keep in mind that the findings from our study were based on fine-grain metrics (number and duration of fixations on one page) rarely (if ever) used in usability studies. It’s very possible that coarser performance measures such as task time and success show no difference between flat and non-flat sites — because these page-level time differences may be absorbed by the larger variability inherent when you make tasks more realistic. However, that is the strength of our study: it shows that there is a hidden, hard-to-measure cost of weak signifiers.

I’d be particularly interested in seeing how weak or absent signifiers impact discoverability — that is, whether users recognize a feature or element they didn’t expect.

Future research will surely discover many more nuances about the impact of flat design and different styles of signifiers on the total user experience, and that impact will certainly evolve over time. But we believe that our core finding that weak visual signifiers like those often used in flat design attract less attention than strong visual signifiers is likely to remain valid.