Big Data Reading Group

Information

What? This reading group will cover some of the most influential modern (last 5-10 years) developments in algorithms for large data processing. No background knowledge (except general mathematical maturity) is required.

When? Weekly meetings on Fridays , 3:30pm at Towne 311 during the Fall 2014 semester.

How can I join? Please, send an e-mail to Grigory Yaroslavtsev, grigory (at) grigory (dot) us to join the group's mailing list. Each participant will be expected to pick one of the papers for presentation and discussion with the rest of the group. It is expected that such presentation will cover technical parts of the paper in detail. Preparing slides is not necessary and it's perfectly fine if you use the paper and/or notes during the presentation.

What's next? Stay tuned: I will be teaching a related class next semester (Spring 2015). This reading group and the class are complementary and will cover different sets of topics. The main difference is that the class cover foundations of algorithms for massive data in more detail, while the reading group will focus on the most recent developments. Try some open problems: sublinear.info



Topics

Increasingly large datasets are being collected by governments, private companies and research institutions. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data. In this reading group we will consider scenarios when the size of the data is too large to fit into the main memory of a single machine. Two main paradigms of computation that we will focus on are massively parallel computation (applicable to frameworks such as Yahoo!'s Hadoop, Google's MapReduce and Microsoft's Dryad) and streaming algorithms (Apache Storm and Spark Streaming).

Massively Parallel Algorithms In massively parallel computational systems (clusters) the data is partitioned between a large number of identical machines connected via a high-speed network. An algorithm proceeds in synchronous rounds, each consisting of local computation performed by each machine followed by an exchange of information through the network. The typical goal of algorithm design is to minimize the number of synchronous rounds, together with optimizing the time/space, communication, etc.

Some papers suggested for reading:

Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010.

Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010. Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii: Filtering: a method for solving graph problems in MapReduce. SPAA 2011.

Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii: Filtering: a method for solving graph problems in MapReduce. SPAA 2011. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii: Scalable K-Means++. VLDB 2012.

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii: Scalable K-Means++. VLDB 2012. Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, Zoya Svitkina: On distributing symmetric streaming computations. SODA 2008.

Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, Zoya Svitkina: On distributing symmetric streaming computations. SODA 2008. Siddharth Suri, Sergei Vassilvitskii: Counting triangles and the curse of the last reducer. WWW 2011.

Siddharth Suri, Sergei Vassilvitskii: Counting triangles and the curse of the last reducer. WWW 2011. Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized PageRank on MapReduce. SIGMOD 2011.

Streaming algorithms Data streams represent a large dataset as a sequence updates to its entries. Streaming algorithms extract only a small amount of information about the data stream (a "sketch") and compute an answer based on that. Such algorithms are typically allowed to make only one pass over the data (or very few passes). The typical goal of algorithm design is to minimize the number of passes and space, while achieving a good approximation guarantee.

Some papers suggested for reading:

Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004, Imre Simon Test-of-Time Paper Award .

Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004, . Edo Liberty: Simple and deterministic matrix sketching. KDD 2013, Best Paper Award .

Edo Liberty: Simple and deterministic matrix sketching. KDD 2013, . Daniel M. Kane, Jelani Nelson, David P. Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010, Best Paper Award .

Daniel M. Kane, Jelani Nelson, David P. Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010, . Madhav Jha, C. Seshadhri, Ali Pinar: A space efficient streaming algorithm for triangle counting using the birthday paradox. KDD 2013, Best Student Paper Award .

Madhav Jha, C. Seshadhri, Ali Pinar: A space efficient streaming algorithm for triangle counting using the birthday paradox. KDD 2013, . Atish Das Sarma, Sreenivas Gollapudi, Rina Panigrahy: Estimating PageRank on graph streams. PODS 2008, Best Paper Award.

Schedule