BigTable: Google's Distributed Structured Storage System

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

MapReduce Used on Large Data Sets

MapReduce is a programming model and library designed to simplify distributed processing of huge datasets on large clusters of computers. This is achieved by providing a general mechanism which largely relieves the programmer from having to handle challenging distributed computing problems such as data distribution, process coordination, fault tolerance, and scaling. While working on Google maps, I've used MapReduce extensively to process and transform datasets which describe the earth's geography. In this talk, I'll introduce MapReduce, demonstrating its broad applicability through example problems ranging from basic data transformation to complex graph processing, all the in the context of geographic data.

Abstractions for Handling Large Datasets

MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets Jeff Dean, Google, Inc. Search is one of the most important applications used on the internet, but it also poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines, from lower-level systems issues like computer architecture and distributed systems to applied areas like information retrieval, machine learning, data mining, and user interface design. In this talk, I'll highlight some of the behind-the-scenes pieces of infrastructure that we've built in order to operate Google's services.

Building Large Systems at Google

Google deals with large amounts of data and millions of users. We'll take a behind-the-scenes look at some of the distributed systems and computing platform that power Google's various products, and make the products scalable and reliable.

YouTube Scalability

This talk will discuss some of the scalability challenges that have arisen during YouTube's short but extraordinary history. YouTube has grown incredibly rapidly despite having had only a handful of people responsible for scaling the site. Topics of discussion will include hardware scalability, software scalability, and database scalability.

Blaine Cook on Scaling Twitter

Blaine gave this talk on how they scaled Twitter. Twitter is probably the largest Ruby on Rails application in production today, so it was filled with lots of great insights including many not so obvious tips.

Wikipedia and MediaWiki



Over four years, MediaWiki has evolved from a quick hack to run a little-known encyclopedia web site to the monster engine behind a heavily-used public site, while maintaining the simplicity needed for an entry-level intranet wiki. Brion reviews past and future directions for Wikipedia's software and hardware, and how modern buzzword technologies could power and simplify the wiki world.

Behind the Scenes at LiveJournal: Scaling Storytime

The history and lessons learned while scaling a community site (LiveJournal.com) from a single server with a dozen friends to hundreds of machines and 10M+ users. What's worked, what hasn't, and all the things we've had to build ourselves, now in common use thoughout the "Web 2.0" world, including memcached, MogileFS, Perlbal, and our job dispatch systems.

Lessons In Building Scalable Systems

Since launching Google Talk in the summer of 2005, we have integrated the service with two large existing products: Gmail and orkut. Each of these integrations provided unique scalability challenges as we had to handle a sudden big increase in the number of users. Today, Google Talk supports millions of users and handles billions of packets per day. I will discuss several practical lessons and key insights from our experience that can be used for any project. These lessons will cover both engineering and operational areas. Reza Behforooz is a Staff Engineer at Google and is currently the technical lead for the Google Talk servers. He's passionate about building large systems and working on communication products in an attempt to make the world a smaller place. While at Google, he has primarily worked on Google Talk, Gmail, orkut, Google Groups, and shared infrastructure used by several Google applications.

Distributed Caching Essential Lessons

In this presentation, recorded at Javapolis, Cameron Purdy shows how to improve application performance & scalability via caching architectures to reduce load on the database tier and & clustered caching to provide transparent fail-over by reliably sharing live data among clustered JVMs.

Lustre File System

Lustre is a scalable open source Linux cluster file system that powers 6 of the top 10 computers in the world. It is resold by HP, SUN, Dell and many other OEM and storage companies, yet produced by a small powerful technology company, Cluster File Systems, Inc. This lecture will explain the Lustre architecture and then focus on how scalability was achieved. We will address many aspects of scalability mostly from the field and some from future requirements, from having 25,000 clients in the Red Storm computer to offering exabytes of storage. Performance is an important focus and we will discuss how Lustre serves up over 100GB/sec today going to 100TB/sec in the coming years. It will deliver millions of metadata operations per second in a cluster and, write 10's of thousands of small files per second on a single node.

Scalable Test Selection Using Source Code

As the number of automated regression tests increase, the ability to run all of them in a reasonable amount of time becomes more and more difficult, and simply doesn't scale. Since we are looking for regressions, it is useful to hone in on the parts of the code that have changed from the last run to help select a small subset of tests that are likely to find the regression. In this way we are only running the tests that need to be run as your system gets larger and the number of possible tests scales outward. We have devised a method to select a subset of tests from an existing test set for scalable regression testing based on source code changes, or deltas. The selection algorithm is a static data mining technique that establishes the relationship between source code deltas and test case execution results. Test selection is then based on the established correlation. In this talk, we will discuss the benefits and also the pitfalls involved in having such an infrastructure. Finally, we will talk about how best to add it to a nightly or continuous test automation infrastructure. Ryan Gerard is currently an SQA Engineer at Symantec. He has a BS in Computer Science and Engineering from UCLA, and is currently pursuing his MS in Information Security. Ryan’s particular specialties are in web technologies and security testing, although his interests span kernel-level technologies to process improvements to data analysis.

SCTPs Reliability and Fault Tolerance

Low cost clusters are usually built from commodity parts and use standard transport protocols like TCP/IP. Once systems become large enough, reliability and fault tolerance become an important issue and TCP/IP often requires additional mechanisms to ensure reliability of the application. The Stream Control Transmission Protocol (SCTP) is a newly standardized transport protocol that provides additional mechanisms for reliability beyond that of TCP. The added reliability and fault tolerance of SCTP may function better for MapReduce-like distributed applications on large commodity clusters. SCTP has the following features that provide additional levels of reliability and fault tolerance. Selective acknowledgment (SACK) is built-in to the protocol with the ability to express larger gaps than TCP; as a result, SCTP outperforms TCP under loss. For cluster nodes with multiple interfaces, SCTP supports multihoming, which transparently provides failover in the event of network path failure. SCTP has the stronger CRC32c checksum which is necessary with high data rates and large scale systems. SCTP also allows multiple streams within a single connection, providing a solution to the head- of-line blocking problem present in TCP-based farming applications like Google's MapReduce. Like TCP, SCTP provides a reliable data stream by default, but unlike TCP, messages can optionally age or reliability can be disabled altogether. The SCTP API provides both a one-to-one (like TCP) and a one-to-many (like UDP) socket style; use of a one-to-many style socket can reduce the number of file descriptors required by an application, making it more scalable.

Building a Scalable Resource Management

This talk will describe the architecture and implementation details for building a highly scalable resource management layer that can support a variety of applications and workloads. This technology has evolved from large scale computing grids deployed in production at customers such as Texas Instruments, AMD, JP Morgan, and various government labs. We will show how to build a centralized dynamic load information collection service that can handle up to 5000 nodes/20,000 cpus in a single cluster. The service is able to gather a variety of system level metrics and is extensible to collect up to 256 dynamic or static attributes of a node and actively feed them to a centralized master. A built-in election algorithm ensures timely failover of the master service ensuring high-availability without the need for specialized interconnects. This building block is extended to multiple clusters that can be organized hierarchically to support a single resource management domain that can span multiple data centers. We believe the current architecture could scale to 100,000 nodes/400,000 cpus. Additional services such as a distributed process execution service, and a policy-based resource allocation engine which leverage this core scale-out clustering service are described. The protocols, communication overheads, and various design tradeoffs that were made the development of these services will be presented along with experimental results from various tests, simulations and production environments.

VeriSign's Global DNS Infrastucture

VeriSign's global network of nameservers for the .com and .net domains sees 500,000 DNS queries per second during its daily peak, and ten times that or more during attacks. By adding new servers and bandwidth, we've recently increased capacity to handle many times that query volume. Name and address changes are distributed to these nameservers every 15 seconds -- from a provisioning system that routinely receives one million domain updates in an hour. In this presentation we describe VeriSign's production DNS implementation as a context for discussing our approach to highly scalable, highly reliable architectures. We will talk about the underlying Advanced Transactional Lookup and Signaling software, which is used to handle database extraction, validation, distribution and name resolution. We also will show the central heads-up display that rolls up statistics reported from each component in the infrastructure.

Scaling Google for Every User

Marissa Mayer, Vice President, Search Products & User Experience, leads the product management efforts on Google's search products – web search, images, groups, news, Froogle, the Google Toolbar, Google Desktop, Google Labs, and more. She joined Google in 1999 as Google's first female engineer and led the user interface and webserver teams at that time.

Scalability and Efficiency on Data Mining Applied to Internet Applications

The Internet went well beyond a technology artefact, increasingly becoming a social interaction tool. These interactions are usually complex and hard to analyze automatically, demanding the research and development of novel data mining techniques that handle the individual characteristics of each application scenario. Notice that these data mining techniques, similarly to other machine learning techniques, are intensive in terms of both computation and I/O, motivating the development of new paradigms, programming environments, and parallel algorithms that support scalable and efficient applications. In this talk we present some results that justify not only the need for developing these new techniques, as well as their parallelization.

Customizable Scalable Compute Intensive Stream Queries

GSDM is a data stream management system running on cluster computers. The system is extensible through user-defined data representations and computations. The computations are specified as stream queries continuously computed over windows of data sliding over the streams.



Our applications include a virtual radio telescope requiring advanced computations over huge streams of data. The system needs to be highly scalable w.r.t. both data and computations. Rather than providing only built-in distribution strategies, the system allows the user to define distribution templates to specify customized distribution strategies for user functions in stream queries. The distribution templates are shown to provide scalability for stream computations that grow more expensive with larger windows.



Distribution templates can be defined in terms of other distribution templates, enabling specification of large distribution patterns of communicating computing nodes. This also allows optimizing templates that generate new templates based on profiling the computations.

Performance Tuning Best Practices for MySQL

Learn where to best focus your attention when tuning the performance of your applications and database servers, and how to effectively find the "low hanging fruit" on the tree of bottlenecks. It's not rocket science, but with a bit of acquired skill and experience, and of course good habits, you too can do this magic! Jay Pipes is MySQL's Community Relations Manager for North America.

A Googly MySQL Cluster Talk

Introduction to MySQL Cluster The NDB storage engine (MySQL Cluster) is a high-availability storage engine for MySQL. It provides synchronous replication between storage nodes and many mysql servers having a consistent view of the database. In 4.1 and 5.0 it's a main memory database, but in 5.1 non-indexed attributes can be stored on disk. NDB also provides a lot of determinism in system resource usage. I'll talk a bit about that.



New features in 5.1 including cluster to cluster replication, disk based data and a bunch of other things. anybody that is attending the mysql users conference may find this eerily familiar.

MySQL Scaling and High Availability Architectures

Related Posts

This month I present to you the amazingly good videos on scalability and scalable architectures.They cover real life scalability issues and solutions at YouTube, Twitter, LiveJournal and Wikipedia. Some other videos are on more theoretical aspects of scalability such as design of effective data storage systems and effective algorithms to read and modify the data.Also I found some video lectures on MySQL tuning, clustering and high availability.Enjoy!Along the way I found some nice blogs on scalability, here is one with various presentations:And here is one collecting slides, talks, audios and videos on scalability:That's it for this month! Hope you enjoy it! Have fun!

Labels: bigtable, clusters, data, data sets, distributed, dns, filesystem, google, high availability, livejournal, map reduce, mining, mysql, scalablility, systems, twitter, verisign, wikipedia, youtube