Browser maker Opera has published the early results of an ongoing study that aims to provide insight into the structure of Internet content. To conduct this research project, Opera created the Metadata Analysis and Mining Application (MAMA), a tool that crawls the web and indexes the markup and scripting data from approximately 3.5 million pages.

Statistical analysis of the data collected by MAMA has provided Opera's engineers with a unique understanding of emerging trends in web development and the way that standards-based web technologies are used on the Internet. Opera plans to take the project to the next level by building a search engine on top of the indexed data so that web designers, browser implementers, and standards experts can easily obtain information about real-world usage of web technologies.

The preliminary data published today by Opera provides some intriguing statistics about the use of specific HTML elements. Among the pages analyzed by MAMA, the most popular HTML tags were HEAD, TITLE, HTML, BODY, A, META, IMG, AND TABLE. The list of least popular tags includes VAR, DEL, AND BDO.

Opera also studied the prevalence of rich web content technologies and scripting mechanisms that are typically associated with Ajax. The study found that Adobe Flash is used on roughly 35 percent of all web sites. Flash is most popular in China, where it was found on 67 percent of the web sites analyzed by MAMA, and it was least popular in Denmark, where it is used on 25 percent of web sites. The XMLHttpRequest scripting mechanism, one of the cornerstones of Ajax, is used on roughly 3.2 percent of the indexed web sites. It is most popular in Norway, where it was found on 10 percent of pages.

The study found that cascading stylesheets (CSS) are very widely used, and appear inline or referenced on 80 percent of the sites indexed by MAMA. The most popular CSS properties relate to color and fonts. JavaScript is also extremely common and is found on 75 percent of indexed web sites.

Standards compliant?



Opera also ran the pages indexed by MAMA through the W3C's validation tools to see how many conform with standards. The results show that only 4.13 percent are valid. A more startling conclusion that Opera derived from its MAMA data is that only 50 percent of sites that display a badge touting validation are actually valid. This could indicate that many sites which are initially designed with valid HTML later cease to be valid as changes are made and new content is added.

Opera analyzed page meta tags to see if there were any correlations between editing tools and validation rates. Surprisingly, Apple's iWeb delivered the highest volume of valid pages—the study shows that 81 percent of pages created with iWeb were valid. By comparison, only 3.4 percent of pages created with Adobe Dreamweaver were valid.

The initial results of Opera's study are fascinating, but its true value hasn't yet been fully unlocked. Opera's efforts to build a search engine on top of MAMA will open the door for some really exciting analysis and will enable third parties to use and repurpose the data for their own studies and projects.

"The Web is fragmented, complex and always evolving. MAMA's vast database provides us with detailed information about how Web technologies are used," said Opera vice president of quality assurance, Snorre M. Grimsby, in a statement. "This is key in our efforts to test and ensure high-quality compatibility, stability, and performance of our products, and we want to share it with our peers, so they can benefit from it, too."

Indeed, this project is a laudable gift to the web development community and web standards bodies. Its usefulness will continue to grow as Opera extends its scope and adds more functionality to accommodate broader research.

Further reading: