With all the hype and buzz surrounding NoSQL, I decided to have a look at it. I quickly found that there is not one NoSQL I could learn. Rather, there are various different solutions with different purposes and trade offs. Those various solutions tend to have one thing in common: processing of data in NoSQL storage is usually done using MapReduce framework.

Search on MapReduce found various scattered blog posts, some universities courses pages and one book that seems to contain almost everything other sources did.

This post contains MapReduce questions and answers based on the book. Basically, if I would be a student, this is what I would have made as a test preparation notes. If I would be a teacher, this is what I would ask on the exam.

First chapter gives credit where the credit is due, the rest contains questions. Last chapter contains hands-on coding exercises.

The Book

The book is named Data-Intensive Text Processing with MapReduce. If you are unsure whether you want to buy it or not, pre-production manuscript is available for free.

Do not be fooled by the title. The book is more about MapReduce itself and less about text processing. First half of the book describes general purpose tricks (design patterns) useful for any task. Second half contains a chapters on text processing, graph algorithms and expectation maximization.

The book contains almost everything I found on various blogs, university courses pages and much more.

Questions

Questions are split by book chapters. With one minor exception (counters), questions are mostly based on the first half of the book. Second half contains concrete algorithms and I wanted to focus on general purpose knowledge.

It does not mean that learning them is not useful. Especially Graph Algorithms chapter contains easily generalizable ideas.

2 MapReduce Basics

2.2 Mappers and Reducers

Describe general MapReduce algorithm. Split it into phases. For each phase include:

who is responsible (framework/programmer/customizable),

what it does,

phase input,

phase output.

The answer is:

MapReduce has four phases:

map,

combine,

shuttle and sort,

reduce.

Map phase is done by mappers. Mappers run on unsorted input key/values pairs. The same physical nodes that keeps input data run also mappers. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pairs are called intermediate key/value pairs. Their type is usually different from input key/value pair type. Mapper must be supplied by programmers.

Combine phase is done by combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero, once or multiple times. Framework decides whether and how many times to run the combiner, programmer has no control over it. Combiners output key/value pair type must be the same as its input key/value pair types.

Shuttle and sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. Programmer may supply custom compare function for sorting and partitioner for data split. All key/value pairs going to the same reducer are sorted by the key, but there is no global sorting.

Reducer obtains sorted key/[values list] pairs sorted by the key. Values list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pair type is usually different from input key/value pair type. Reducer must be supplied by programmers.

If the algorithm requires multiple MapReduce iterations, each combiner may increment global counter. Driver program would read the counter after the reduce phase. It then decides whether next iteration is needed or not.

Note: chapter 2 does not mention counters. They are explained later, in the chapter 5. Decide if the statement is true or false: All MapReduce implementations implement exactly same algorithm.

Decide if the statement is true or false: All MapReduse implementations implement exactly same algorithm.

The answer is:

False. For example, Google’s implementation does not allow change of key in the reducer, but provides sorting for values. Hadoop does not provide values sorting, but reducer can change the key.

True or false: Each mapper must generate the same number of key/value pairs as its input had.

The answer is:

False. Mapper may generate any number of key/value pairs (including zero).

True or false: Mappers input key/value pairs are sorted by the key.

The answer is:

False. Mapper’s input is not sorted in any way.

True or false: Mappers output key/value must be of the same type as its input.

The answer is:

False. Mapper may produce key/value pairs of any type.

True or false: Reducer is applied to all values associated with the same key.

The answer is:

True. Reducer is applied to all values associated with the same key.

True or false: Reducers input key/value pairs are sorted by the key.

The answer is:

True. Reducers input key/value pairs are sorted by the key.

implementation.

True or false: Each reducer must generate the same number of key/value pairs as its input had.

The answer is:

False. Reducer may generate any number of key/value pairs (including zero).

True or false: Reducers output key/value pair must be of the same type as its input.

The answer is:

False. The statement is false in Hadoop and true in Google’s implementation.

2.3 The Execution Framework

What happens in case of hardware/software failure?

The answer is:

MapReduce framework must be able to recover from both hardware (disk failures, RAM errors) and software (bugs, unexpected exceptions) errors. Both are common and expected.

Is it possible to start reducers while some mappers still run? Why?

The answer is:

No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer.

Define a straggler.

The answer is:

Straggler is either map or reduce task that takes unusually long time to complete.

What is speculative execution (also called backup tasks)? What problem does it solve?

The answer is:

Identical copy of the same task is executed on multiple nodes. Output of the fastest task used.

Speculative execution helps if the task is slow because of hardware problem. It does not help if the distribution of values over keys is skewed.

2.4 Partitioners and Combiners

What does partitioner do?

The answer is:

Partitioner divides key/values pairs produced by map tasks between reducers.

What does combiner do?

The answer is:

Combiner does local aggregation of key/values pairs produced by mapper before or during shuttle and sort phase. In general, it reduces amount of data to be transferred between nodes.

The framework decides how many times to run it. Combiner may run zero, one or multiple times on the same input.

Decide if the statement is true or false: Each combiner runs exactly once.

The answer is:

False. The framework decides whether combiner runs zero, once or multiple times.

2.5 The Distributed File System

Briefly describe HDFS architecture.

The answer is:

HDFS has one namenode and a lot of datanodes. Namenode is master and coordinates file operations, ensures integrity of the system and keeps namespace (metadata, directory structure, file to block mapping etc.).

Data are stored in big blocks on data nodes. Each block is stored on multiple, by default three, data nodes. Name node checks whether data nodes work correctly and manages data replication.

Client contacts name node which answers with data block id and data node id. Data node then sends data directly to the client.

Decide if the statement is true or false: HDFS is good at managing large number of small files.

The answer is:

False. HDFS is good at managing large files.

2.6 Hadoop Cluster Architecture

Explain difference between jobtracker and tasktracker?

The answer is:

Client executes jobs on jobtracker. Jobtracker runs on the master. Jobtracker monitors MapReduce jobs. It also coordinates mappers and reducers.

Tasktracker runs both user code and datanode daemon on slave nodes. It is never contacted by the client.

Explain mapper lifecycle.

The answer is:

Initialization method is called before any other method is called. It has no parameters and no output.

Map method is called separately for each key/value pair. It process input key/value pairs and emits intermediate key/value pairs.

Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.

Explain reducer lifecycle.

The answer is:

Initialization method is called before any other method is called. It has no parameters and no output.

Reduce method is called separately for each key/[values list] pair. It process intermediate key/value pairs and emits final key/value pairs. Its input is a key and iterator over all intermediate values associated with the same key.

Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.

3 MapReduce Algorithm Design

3.1 Local Aggregation

What is local aggregation and why is it used?

The answer is:

Either combiner or a mapper combines key/value pairs with the same key together. They may do also some additional preprocessing of combined values. Only key/value pairs produced by the same mapper are combined.

Key/Value pairs created by map tasks are transferred between nodes during shuffle and sort phase. Local aggregation reduces amount of data to be transferred.

If the distribution of values over keys is skewed, data preprocessing in combiner helps to eliminate reduce stragglers.

What is in-mapper combining? State advantages and disadvantages over writing custom combiner.

The answer is:

Local aggregation (combining of key/value pairs) done inside the mapper.

Map method does not emit key/value pairs, it only updates internal data structure. Close method combines and preprocess all stored data and emits final key/value pairs. Internal data structure is initialized in init method.

Advantages:

It will run exactly once. Combiner may run multiple times or not at all.

We are sure it will run during map phase. Combiner may run either after map phase or before reduce phase. The latter case provides no reduction in transferred data.

In-mapper combining is typically more effective. Combiner does not reduce amount of data produced by mappers, it only groups generated data together. That causes unnecessary object creation, destruction, serialization and deserialization.

Disadvantages:

Scalability bottleneck: the technique depends on having enough memory to store all partial results. We have to flush partial results regularly to avoid it. Combiner use produce no scalability bottleneck.

3.2 Pairs and Stripes

Explain Pair design patter on a co-occurence example. Include advantages/disadvantages against Stripes approach, possible optimizations and their efficacy.

The answer is:

Mapper generates keys composed from pairs of words that occurred together. The value contains the number 1. Framework groups key/value pairs with the same work pairs together and reducer simply counts the number values for each incoming key/value pairs.

Each final pair encodes a cell in co-occurrence matrix. Local aggregation, e.g. combiner or in-mapper combining, can be used.

Advantages:

Simple values, less serialization/deserialization overhead.

Simpler memory management. No scalability bottleneck (only if in-mapper optimization would be used).

Disadvantages:

Huge amount of intermediate key/value pairs. Shuffle and sort phase is slower.

Local aggregation is less effective – too many distinct keys.

Explain Stripes design patter on a co-occurence example. Include advantages/disadvantages against Pairs approach, possible optimizations and their efficacy.

The answer is:

Mapper generates a distinct key from each encountered word. Associated value contains a map of all co-occurred words as map keys and number of co-occurrences as map values. Framework groups same words together and reducer merges value maps.

Each final pair encodes a row in co-occurrence matrix. Combiner or in-mapper combining can be used.

Advantages:

Small amount of intermediate key/value pairs. Shuffle and sort phase is faster.

Intermediate keys are smaller.

Effective local aggregation – smaller number of distinct keys.

Disadvantages:

Complex values, more serialization/deserialization overhead.

More complex memory management. As value maps may grow too big, the approach has potential for scalability bottleneck.

Explain scalability bottleneck caused by stripes approach.

The answer is:

Stripes solution keeps a map of co-occurred words in memory. As the amount of co-occurred words is unlimited, the map size is unlimited too. Huge map does not fit into the memory and causes paging or out of memory errors.

3.3 Computing Relative Frequencies

Relative frequencies of co-occurrences problem:

Input: text documents

key: document id

value: text document

Output: key/value pairs where

key: pair(word1, word2)

value: #co-occurrences(word1, word2)/#co-occurrences(word1, any word)

Fix following solution to relative frequencies of co-occurrences problem:

class MAPPER method INITIALIZE H = new hash map method MAP(docid a, doc d) for all term w in doc d do for all term u patri neighbors(w) do H(w) = H(w) + 1 emit(pair(u, w), count 1) method CLOSE for all term w in H emit(pair(w, *), H(w)) class REDUCER variable total_occurrences = 0 method REDUCE(pair (p, u), counts[c1, c2, ..., cn]) s = 0 for all c in counts[c1, c2, ..., cn] do s = s + c if u = * total_occurrences = s else emit(pair p, s/total_occurrences) class SORTING_COMPARATOR method compare(key (p1, u1), key (p2, u2)) if p1 = p2 AND u1 = * return key1 is lower if p1 = p2 AND u2 = * return key2 is lower return compare(p1, p2)

The answer is:

Partitioner is missing, framework could send key/value pairs with totals to different reducer than key/pairs with word pairs.

class PARTITIONING_COMPARATOR method compare(key (p1, u1), key (p2, u2)) if p1 = p2 return keys are equal return keys are different

Describe order inversion design pattern.

The answer is:

Order inversion is used if the algorithm requires two passes through mapper generated key/value pairs with the same key. The first pass generates some overall statistic which is then applied to data during the second pass. The reducer would need to buffer data in the memory just to be able to pass twice through them.

First pass result is calculated by mappers and stored in some internal data structure. The mapper emits the result in closing method, after all usual intermediate key/value pairs.

The pattern requires custom partitioning and sort. First pass result must come to the reducer before usual key/value pairs. Of course, it must come to the same reducer.

3.4 Secondary Sorting

Describe value-to-key design pattern.

The answer is:

Hadoop implementation does not provide sorting for grouped values in reducers input. Value-to-key is used as a workaround.

Part of the value is added to the key. Custom sort then sorts primary by the key and secondary by the added value. Custom partitioner must move all data with the same original key to the same reducer.

3.5 Relational Joins

Describe reduce side join between tables with one-on-one relationship.

The answer is:

Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase.

Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data.

Describe reduce side join between tables with one-to-many relationship.

The answer is:

We assume that the join key is primary key in table called S. Second table is called T. In other words, the table S in on the ‘one’ side of the relationship and the table T is on the ‘many’ side of the relationship.

We have to implement mapper, custom sorter, partitioner and reducer.

Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts key/value pair generated from the table S right before key/value pair with the same join id from the table T.

Reducers input looks like this:

((JoinId1, s)-> row)

((JoinId1, t)-> [rows])

((JoinId2, s)-> row)

((JoinId2, t)-> [rows])

...

((JoinIdn, s), row)

((JoinIdn, t), [rows])



The reducer joins all rows from s pair with all rows from following t pair.

Describe reduce side join between tables with many-to-many relationship.

The answer is:

We assume that data are stored in tables called S and T. The table S is smaller. We have to implement mapper, custom sorter, partitioner and reducer.

Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts the key/value pairs generated from the table S is right before all key/value pair with the data from the table T.

Reducers input looks like this:

((JoinId1, s)-> [rows])

((JoinId1, t)-> [rows])

((JoinId2, s)-> [rows])

((JoinId2, t)-> [rows])

...

((JoinIdn, s), [rows])

((JoinIdn, t), [rows])



The reducer buffers all rows with the same JoinId from the table S into the memory and joins them with following T table rows.

All data from the smaller table must fit into the memory – the algorithm has scalability bottleneck problem.

Describe map side join between two database tables.

The answer is:

Map side join works only if following assumptions hold:

both datasets are sorted by the join key,

both datasets are partitioned the same way.

Mapper maps over larger dataset and reads corresponding part of smaller dataset inside the mapper. As the smaller set is partitioned the same way as bigger one, only one map task access the same data. As the data are sorted by the join key, we can perform merge join O(n) .

Describe memory backed join.

The answer is:

Smaller set of data is loaded into the memory in every mapper. Mappers loop over larger dataset and joins it with data in the memory. If the smaller set is too big to fit into the memory, dataset is loaded into memcached or some other caching solution.

Which one is faster? Map side join or reduce side join?

The answer is:

Map side join is faster.

Go to Part 2

Reference: MapReduce Questions and Answers from our JCG partner Maria Jurcovicova at the This is Stuff blog.