Google’s Jeff Dean did an excellent talk at Stanford as part of EE380 – it is worth one’s time to listen. Very informative, instructive and innovative. As I listened, I jotted a few quick notes.

Interesting comparison of the scale in search from 1999 to 2010 Docs and queries are up 1000X, while the query latency has decreased 5X Interesting to hear that in 1999 they used to update a web page store in a month or two, but now it is reduced 50000X to seconds!

They have had 7 significant revisions in 11 years

Trivia : They encounter very expensive queries for example “circle of death” requires ~30GB of I/O

Trivia : In 2004, they did a rethink and refreshed the systems infrastructure from scratch

He discussed a little about encodings – informative discussion on Byte aligned variable length & group encoding schemes << I have to try it out …

Trivia : They have had long distance links failure by wild dogs, sharks, dead horses and (in Oregon) drunken hunters !

Jeff talked in length about MapReduce. An interesting set of statistics of MapReduce over time MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output Data has doubled while the number of machines have been constant from ‘07 to ‘10. Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10 Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !

He is working on a project called Spanner , that spans data centers. Looks like this is one of their hairy problems. Dean also mentioned this during Q & A. All their systems work well inside a datacenter, but have no way of spanning datacenters. They have manual methods & task specific tools to copy data across datacenters, monitor tasks across datacenters and so forth. But no systemic infrastructure. Declarative Policies, Common namespaces, transactions, strong and weak consistency and automation are all parts that Spanner addresses



, that spans data centers. Looks like this is one of their hairy problems. Dean also mentioned this during Q & A. All their systems work well inside a datacenter, but have no way of spanning datacenters. They have manual methods & task specific tools to copy data across datacenters, monitor tasks across datacenters and so forth. But no systemic infrastructure. I think the most important part of the talk was the final section on experiences & patterns Break Large systems into smaller services. << We have heard this from Amazon as well. The Google page touches 200+ services. (Same with amazon page) One should be able to estimate performance based on back-of the envelope calculations I am compelled to insert Jeff’s “Numbers Everyone Should Know” as it is a very useful chart. I hope Jeff doesn’t mind. [Update 11/17/10] Thanks to Kevin Le, this chart comes from Norvig’s blog. Identify common problems & build general systems to address them. Very important not to be all things to all people. Paraphrasing Jeff, “That final feature will drive you over the edge of complexity“! Don’t design to scale infinitely – consider 5X – 50X growth. But > 100X requires redesign << very insightful He likes the centralized master design & so does not suggest a fully distributed system. Of course, the master should not be in the data plane but can be a control plane artifact He also likes multiple small units per machine than a mongo job running on a big machine. Smaller units of work are easier to recover, load balance and scale. << agreed!

He concludes the talk saying there are lots of interesting “Big Data” available. I have seen this emergence of Data Scientists from multiple sources, here and here. << I agree, my main focus as well …

Couple of insights from the Q & A sessions They run chained sequence of M/Rs that implement some part of a larger algorithm in a sequence of steps than a Map-Reduce-Reduce-… pattern The predictive search is more of a resource issue and not a fundamental change in the underlying infrastructure He wishes they had incorporated distributed transactions as a part of their infrastructure for example : in GFS et al. Many internal apps have rolled their own. BTW, Spanner has distributed transactions

Over all an excellent talk, as always … And a Big Thanks to Jeff Dean …

[Update 11/12/10] Last year Jeff has given a similar talk on scalability at WSDM 2009. Video [here] and slides [here]. Notes [here] and [here].

[Update 5/11/12] Question in Quora “What are the numbers that every computer engineer should know, according to Jeff Dean?“