Big Data is an extremely broad domain, typically addressed by a hybrid team of data scientists, data analysts, software engineers, and statisticians. Finding a single individual knowledgeable in the entire breadth of this domain versus say a Microsoft Azure expert is therefore extremely unlikely and rare. Rather, one will most likely be searching for multiple individuals with specific sub-areas of expertise. This guide is therefore divided at a high level into two sections:

Big Data Algorithms, Techniques, and Approaches: Geared primarily toward data scientists and statisticians.

Geared primarily toward data scientists and statisticians. Big Data Technologies: Geared primarily toward software engineers.

This guide highlights questions related to key concepts, paradigms, and technologies in which a big data expert can be expected to have proficiency. Bear in mind, though, that not every “A” candidate will be able to answer them all, nor does answering them all guarantee an “A” candidate. Ultimately, effective interviewing and hiring is as much of an art as it is a science.

Big Data Algorithms, Techniques, and Approaches

When it comes to big data and data management, fundamental knowledge of and work experience with relevant algorithms, techniques, big data platforms, and approaches is essential. Generally speaking, mastering these areas requires more time and skill than becoming an expert with a specific set of software languages or tools. As such, software engineers who do have expertise in these areas are both hard to find and extremely valuable to your team. The questions that follow can be helpful in gauging such expertise.

Q: Given a stream of data of unknown length, and a requirement to create a sample of a fixed size, how might you perform a simple random sample across the entire dataset? (i.e., given N elements in a data stream, how can you produce a sample of k elements, where N > k, whereby every element has a 1/N chance of being included in the sample?

One of the effective algorithms for addressing this is known as Reservoir Sampling.

The basic procedure is as follows:

Create an array of size k. Fill the array with the first k elements from the stream. For each subsequent element E (with index i) read from the stream, generate a random number j between 0 and i. If j is less than k, replace the jth element in the array with E.

This approach gives each element in the stream the same probability of appearing in the output sample.

Q: Describe and compare some of the more common algorithms and techniques for cluster analysis.

Cluster analysis is a common unsupervised learning technique used in many fields. It has a huge range of applications both in science and in business. A few examples include:

Bioinformatics: Organizing genes into clusters by analyzing similarity of gene expression patterns.

Marketing: Discovering distinct groups of customers and the using this knowledge to structure a campaign that targets the right marketing segments.

Insurance: Identifying categories of insurance holders that have a high average claim cost.

Clustering algorithms can be logically categorized based on their underlying cluster model as summarized in the table below.

CLUSTERING ALGORITHMS Category Description Advantages Disadvantages Connectivity-based (a.k.a. hierarchical) clustering Based on core idea of objects that are "close" to one another being more related, these algorithms connect objects to form clusters based on distance

Employs a distance algorithm (such as Levenshtein distance in the case of string comparison) to determine "nearness" of objects

Linkage criteria can be based on minimum distance (single linkage), maximum distance (complete linkage), average distance, centroid distance, or any other algorithm of arbitrary complexity

Clustering can be agglomerative (starting with single elements, aggregating them into clusters) or divisive (starting with the complete dataset, dividing it into partitions). Does not require pre-specifying the number of clusters

Can be useful for proof-of-concept or preliminary analyses Produce a hierarchy from which user still needs to choose appropriate clusters

Complexity generally makes them too slow for large datasets

"Chaining phenomenon", whereby outliers either show up as additional clusters or cause other clusters to merge erroneously Centroid-based

(a.k.a. K-means) clustering Clusters are represented by a central vector, which is not necessarily a member of the set.

When the number of clusters is fixed to K, K-means clustering gives a formal definition as an optimization problem: find the K cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. With large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small)

K-Means may produce tighter clusters than hierarchical clustering, especially if clusters are globular Requires number of clusters (K) to be specified in advance

Prefers clusters of approximately similar size, which often leads to incorrectly set borders between clusters

Unable to represent density-based clusters Distribution-based clustering Based on distribution models, clusters objects that appear to belong to the same distribution.

Closely resembles the way artificial datasets are generated (i.e., by sampling random objects from a distribution). Produces complex models for clusters that can also capture correlation and dependence of attributes Can suffer from overfitting (i.e., describing random error or noise instead of the underlying relationships)

Requires selection of appropriate data models to optimize (which can be quite challenging for many real-world datasets) Density-based clustering Clusters are defined as areas of higher density than the remainder of the dataset. Objects in sparse areas are usually considered to be noise and/or border points.

Connects points based on distance thresholds (similar to linkage-based clustering), but only connects those that satisfy a specified density criterion.

Most popular density-based clustering method is DBSCAN, which features a well-defined cluster model called "density-reachability". Doesn't require specifying number of clusters a priori

Can find arbitrarily-shaped clusters; can even find a cluster completely surrounded by (but not connected to) a different cluster

Mostly insensitive to the ordering of the points in the database Expects a density "drop" or "cliff" to detect cluster borders

DBSCAN is unable to detect intrinsic cluster structures that are prevalent in much of real-world data

On datasets consisting of mixtures of Gaussians, almost always outperformed by methods such as EM clustering that are able to precisely model such data

Q: Define and discuss ACID, BASE, and the CAP theorem.

ACID refers to the following set of properties that collectively guarantee reliable processing of database transactions, with the goal being “immediate consistency”:

A tomicity. Requires each transaction to be “all or nothing”; i.e., if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged.

C onsistency. Requires every transaction to bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including (but not limited to) constraints, cascades, triggers, and any combination thereof.

I solation. Requires concurrent execution of transactions to yield a system state identical to that which would be obtained if those same transactions were executed sequentially.

Durability. Requires that, once a transaction has been committed, it will remain so even in the event of power loss, crashes, or errors.

Many databases rely upon locking to provide ACID capabilities. Locking means that the transaction marks the data that it accesses so that the DBMS knows not to allow other transactions to modify it until the first transaction succeeds or fails. An alternative to locking is multiversion concurrency control in which the database provides each reading transaction the prior, unmodified version of data that is being modified by another active transaction.

Guaranteeing ACID properties in a distributed transaction across a distributed database where no single node is responsible for all data affecting a transaction presents additional complications. Network connections might fail, or one node might successfully complete its part of the transaction and then be required to roll back its changes, because of a failure on another node. The two-phase commit protocol (not to be confused with two-phase locking) provides atomicity for distributed transactions to ensure that each participant in the transaction agrees on whether the transaction should be committed or not.

In contrast to ACID (and its immediate-consistency-centric approach), BASE (Basically Available, Soft State, Eventual Consistency) favors availability over consistency of operations.

BASE was developed as an alternative for producing more scalable and affordable data architectures. Allowing less constantly updated data gives developers the freedom to build other efficiencies into the overall system. In BASE, engineers embrace the idea that data has the flexibility to be “eventually” updated, resolved or made consistent, rather than instantly resolved.

The Eventual Consistency model employed by BASE informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is sometimes criticized as increasing the complexity of distributed software applications.

In nearly all models, elements like consistency and availability often are viewed as resource competitors, where adjusting one can impact another. Accordingly, the CAP theorem (a.k.a. Brewer’s theorem) states that it’s impossible for a distributed computer system to provide more than two of the following three guarantees concurrently:

C onsistency (all nodes see the same data at the same time)

onsistency (all nodes see the same data at the same time) A vailability (a guarantee that every request receives a response about whether it was successful or failed)

vailability (a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

In the decade since its introduction, designers and researchers have used the CAP theorem as a reason to explore a wide variety of novel distributed systems. The CAP Theorem has therefore certainly proven useful, fostering much discussion, debate, and creative approaches to addressing tradeoffs, some of which have even yielded new systems and technologies.

Yet at the same time, the “2 out of 3” constraint does somewhat oversimplify the tensions between the three properties. By explicitly handling partitions, for example, designers can optimize consistency and availability, thereby achieving some trade-off of all three. Although designers do need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them.

Aspects of the CAP theorem are often misunderstood, particularly the scope of availability and consistency, which can lead to undesirable results. If users cannot reach the service at all, there is no choice between C and A except when part of the service runs on the client. This exception, commonly known as disconnected operation or offline mode, is becoming increasingly important. Some HTML5 feature make disconnected operation easier going forward. These systems normally choose A over C and thus must recover from long partitions.

Q: What is dimensionality reduction and how is it relevant to processing big data? Name some techniques commonly employed for dimensionality reduction.

Dimensionality reduction is the process of converting data of very high dimensionality into data of lower dimensionality, typically for purposes such as visualization (i.e, projection onto a 2D or 3D space for visualization purposes), compression (for efficient storage and retrieval), or noise removal.

Some of the more common techniques for dimensionality reduction include:

Note: Each of the techniques listed above is itself a complex topic, so each is provided as a hyperlink to further information for those interested in learning more.

Q: Discuss some common statistical sampling techniques, including their strengths and weaknesses.

When analyzing big data, processing the entire dataset would often be operationally untenable. Exploring a representative sample is easier, more efficient, and can in many cases be nearly as accurate as exploring the entire dataset. The table below describes some of the statistical sampling techniques that are more commonly used with big data.

COMMON SAMPLING TECHNIQUES Technique Description Advantages Disadvantages Simple random sampling Every element has the same chance of selection (as does any pair of elements, triple of elements, etc.) Minimizes bias

Simplifies analysis of results Vulnerable to sampling error Systematic sampling Orders data and selects elements at regular intervals through the ordered dataset Easy to implement

Can be efficient (depending on ordering scheme) Vulnerable to periodicities in the ordered data

Theoretical properties make it difficult to quantify accuracy Stratified sampling Divides data into separate strata (i.e., categories) and then samples each stratum separately Able to draw inferences about specific subgroups

Focuses on important subgroups; ignores irrelevant ones

Improves accuracy/efficiency of estimation

Different sampling techniques can be applied to different subgroups Can increase complexity of sample selection

Selection of stratification variables can be difficult

Not useful when there are no homogeneous subgroups

Can sometimes require a larger sample than other methods

Q: Explain the term “cache oblivious”. Discuss some of its advantages and disadvantages in the context of processing big data.

A cache oblivious (a.k.a., cache transcendent) algorithm is designed to take advantage of a CPU cache without knowing its size. Its goal is to perform well – without modification or tuning – on machines with different cache sizes, or for a memory hierarchy whose levels are of different cache sizes.

Typically, a cache oblivious algorithm employs a recursive divide and conquer approach, whereby the problem is divided into smaller and smaller sub-problems, until a sub-problem size is reached that fits into the available cache. For example, an optimal cache oblivious matrix multiplication is obtained by recursively dividing each matrix into four sub-matrices to be multiplied.

In tuning for a specific machine, one may use a hybrid algorithm which uses blocking tuned for the specific cache sizes at the bottom level, but otherwise uses the cache-oblivious algorithm.

The ability to perform well, independent of cache size and without cache-size-specific tuning, is the primary advantage of the cache oblivious approach. However, it is important to acknowledge that this lack of any cache-size-specific tuning also means that a cache oblivious algorithm may not perform as well as a cache-aware algorithm (i.e., an algorithm tuned to a specific cache size). Another disadvantage of the cache oblivious approach is that it typically increases the memory footprint of the data structure, which may further degrade performance. Interestingly though, in practice, performance of cache oblivious algorithms are often surprisingly comparable to that of cache aware algorithms, making them that much more interesting and relevant for big data processing.

Big Data Technologies

These days, it’s not only about finding a single tool to get the job done; rather, it’s about building a scalable architecture to effectively collect, process, and query enormous volumes of data. Armed with a strong foundational knowledge of big data algorithms, techniques, and approaches, a big data expert will be able to employ tools from a growing landscape of technologies that can be used to exploit big data to extract actionable information. The questions that follow can help evaluate this dimension of a candidate’s expertise.

Q: What is Hadoop? What are its key features? What modules does it consist of?

Apache Hadoop is a software framework that allows for the distributed processing of large datasets across clusters of computers. It is designed to effectively scale from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer.

The project includes these modules:

Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

A distributed file system that provides high-throughput access to application data. Hadoop MapReduce: A system for parallel processing of large datasets.

A system for parallel processing of large datasets. Hadoop YARN: A framework for job scheduling and cluster resource management.

A framework for job scheduling and cluster resource management. Hadoop Common: The common utilities that support the other Hadoop modules.

Q: Provide an overview of HDFS, including a description of an HDFS cluster and its components.

HDFS is a highly fault-tolerant distributed filesystem, commonly used as a source and output of data in Hadoop jobs. It builds on a simple coherence model of write-once-read-many (with append possible) access. It supports a traditional hierarchical organization of directories and files.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

An HDFS cluster also contains what is referred to as a Secondary NameNode that periodically downloads the current NameNode image, edits log files, joins them into a new image. and uploads the new image back to the NameNode. (Despite its name, though, the Secondary NameNode does not serve as a backup to the primary NameNode in case of failure.)

It’s important to note that HDFS is meant to handle large files, with the default block size being 128 MB. This means that, for example, a 1 GB file will be split into just 8 blocks. On the other hand, a file that’s just 1 KB will still take a full 128 MB block. Selecting a block size that will optimize performance for a cluster can be a challenge and there is no “one-size-fits-all” answer in this regard. Setting block size to too small value might increase network traffic and put huge overhead on the NameNode, which processes each request and locates each block. Setting it to too large a value, on the other hand, might result in a great deal of wasted space. Since the focus of HDFS is on large files, one strategy could be to combine small files into larger ones, if possible.

Q: Provide an overview of MapReduce, including a discussion of its key components, features, and benefits.

MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary (i.e., data reduction) operation. A MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

MapReduce processes parallelizable problems across huge datasets using a large number of nodes in a cluster or grid as follows:

Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

Reduce step: The master node collects the answers from its worker nodes and combines them to form its output (i.e., the answer to the problem it was tasked with solving).

MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel (though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source). Similarly, a set of ‘reducers’ can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than high performance servers can usually handle (e.g., a large server farm of “commodity” machines can use MapReduce to sort a petabyte of data in only a few hours).

The parallelism in MapReduce also helps facilitate recovery from a partial failure of servers or storage during the operation (i.e., if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available). This is particularly important, since the failure of single nodes are fairly common in multi-machine clusters. When there are thousands of disks spinning for a couple of hours, it’s not unlikely that one or two of them will fail at some point.

The key contributions of the MapReduce framework are not the map and reduce functions per se, but the scalability and fault-tolerance achieved for a variety of applications. It should be noted that a single-threaded implementation of MapReduce will usually not be faster than a traditional implementation. Its benefits are typically only realized when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerant features of the framework come into play.

As a side note, the name MapReduce originally referred to a proprietary Google technology but has since been genericized.

Q: Describe the MapReduce process in detail.

While not all big data software engineers will be familiar with the the internals of the MapReduce process, those who are will be that much better equipped to leverage its capabilities and to thereby architect and design an optimal solution.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. We therefore provide a Hadoop-centric description of the MapReduce process, described here both for Apache Hadoop MapReduce 1 (MR1) as well as for Apache Hadoop MapReduce 2 (MR2, a.k.a. YARN). For simplicity, we use an example that takes input from an HDFS and stores it back to an HDFS. Such a process would operate as follows:

The local Job Client prepares the job for the submission and sends it to Resource Manager (MR2) or Job Tracker (MR1). Resource Manager (MR2) / Job Tracker (MR1) schedules the job across the cluster. In the case of MR2, the Application Master is started, which will take care of managing the job and terminating it after of completion. The job is sent to Node Managers (MR2) or Task Trackers (MR1). An InputSplitter is used to distribute the map tasks across the cluster in the most efficient way. Each of the Note Managers (MR2) / Task Trackers (MR1) spawns a map task, sending progress updates to an Application Master (MR2) / Job Tracker (MR1). The output is partitioned, according to the reducers it needs to go to. An optional Combiner might be used as well to join values for the same local key. If a map fails, it is retried (up to a defined maximum number of retries). After a map succeeds, the local partitions with the output can be copied to respective reduce nodes (depending on the partition). After a reduce task receives all relevant files, it starts a sort phase, in which it merges the map outputs, maintaining the natural order of keys. For each of the keys in the sorted output, a reduce function is invoked and the output is stored to the desired sink (in our case, HDFS). Since the node where it’s being executed is usually also a DataNode, the first replica is written to the local disk. If a reducer fails, it is run again. After a successful chain of operations, temporary files are removed and the Application Master closes gracefully.

Q: What are major types of NoSQL databases commonly used for big data?

In recent years, the growing prevalence and relevance of big data has dramatically increased the need to store and process huge quantities of data. This has caused a shift away from more traditional relational database systems to other approaches and technologies, commonly known as NoSQL databases. NoSQL databases are typically simpler than traditional SQL databases and typically lack ACID transactions capabilities, thereby making them more efficient and scalable.

The taxonomy of such databases might be created using a couple of different spaces. Although the table below groups them by data model, in the context of big data, consistency model would be another pivotal feature to consider when evaluating options for these types of datastores.

NoSQL/BigData datastores Type Description Advantages Disadvantages Distributed file system

(e.g., HDFS) One of the simplest approaches for keeping data is just to use a hierarchical filesystem. It's extremely simple, and is also often very efficient. Optimal for storing huge objects

Simple model makes it easy to tune for efficiency

Distributing a file across many nodes (and using many replicas) facilitates scalability and throughput

Can be mounted directly (e.g. using FUSE) Only a very basic schema available (directory structure)

Not suited for multiple small objects

No built-in indexing capabilities

Usually the whole file must be rewritten if a single change occurs Key Value

(e.g., Riak, Amazon DynamoDB, Redis) Represents data as a collection of key-value pairs.

More complex data models might be implemented on a top of such a structure. Simple concept

Easy to partition data by keys

Very simple schema (value is often just a blob; some datastores, such as Redis, offer more complex value types - including sets or maps)

Can be blazing fast, especially with memory-based implementations

Many implementations to choose from Sometimes too simple

Makes grouping common data difficult

Cannot define sophisticated schema on database side Wide column

(e.g., BigTable, Apache Cassandra, Apache HBase, Apache Accumulo) Inspired by Google's BigTable publication, stores data in records that are able to store anvery large number of dynamic columns.

In general, describe data using somewhat loose schema - defining tables and the main columns (e.g. ColumnFamilies in case of HBase) and being flexible regarding the data stored inside them, which are key-values. Very fast writes

Easy to scale

Organizes data into tables and columns using a flexible schema

Easy to store history of operations with timestamps Some versions (e.g., HBase) require the whole (usually complex) Hadoop environment as a prerequisite

Often requires effort to tune and optimize performance parameters Document-oriented (e.g. MongoDB, CouchDB) Designed around the concept of a "document"; i.e., a single object encapsulating data in some standard format (such as JSON, XML, etc.).

Instead of tables with rows, operates on collections of documents. Schema-free design

Simplicity gives significant gains in scalability and performance

Many real-word objects are in fact documents

Multiple implementations support indexing out-of-the-box Tracking any kind of relation is difficult

No implied schema puts more effort on the client

Q: What are Pig, Hive, and Impala? What are the differences between them?

Pig, Hive, and Impala are examples of big data frameworks for querying and processing data in the Apache Hadoop ecosystem. While a number of other systems have recently been introduced (notable mentions include Facebook’s Presto or Spark SQL), these are still considered “the big 3” when dealing with big data challenges. The table below provides a basic description and comparison of the three.

Comparitive Overview: Hive, Pig, and Impala Pig Hive Impala Introduced in 2006 by Yahoo Research. Ad-hoc way of creating and executing map-reduce jobs on a very large datasets Introduced in 2007 by Facebook. Peta-byte scale data warehousing framework. Introduced in 2012 by Cloudera. Massively parallel processing SQL query engine on HDFS. PigLatin, procedural, data flow oriented HiveQL, SQL-like syntax, declarative SQL Runs MapReduce jobs Runs MapReduce jobs Custom execution engine; does most of its operation in-memory Batch processing or ETL Batch processing, ad-hoc queries Interactive-like Relatively easy to extend with user-defined functions (UDF) Relatively harder to extend with UDF Supports native (C++) and Hive UDFs High fault-tolerance of queries High fault-tolerance of queries If a node fails, the query is aborted

Q: What are Apache Tez and Apache Spark? What kind of problems do they solve (and how)?

One issue with MapReduce is that it’s not suited for interactive processing. Moreover, stacking one MapReduce job over another (which is a common real-world use case when running, for example, Pig or Hive queries) is typically quite ineffective. After each stage, the intermediate result is stored to HDFS, only to be picked up as an input by another job, and so on. It also requires a lot of shuffling of data during the reduce phases. While sometimes it is indeed necessary, in many cases the actual flow could be optimized, as some of the jobs could be joined together or could use some cache that would reduce the need of joins and in effect increase the speed significantly.

Addressing these issues was one of important reasons behind creating Apache Tez and Apache Spark. They both generalize the MapReduce paradigm and execute the overall job by first defining the flow using a Direct Acyclic Graph (DAG). At a high level of abstraction, each vertex is a user operation (i.e., performing some form of processing on the input data), while each edge defines the relation with other steps of the overall job (such as grouping, filtering, counting, and so on). Based on the specified DAG, the scheduler can decide which steps can be executed together (and when) and which require pushing the data over the cluster. Additionally, both Tez and Spark offer forms of caching, minimizing the need to push huge datasets between the nodes.

There are some significant differences between the two technologies though. Apache Tez was created more as an additional element of Hadoop ecosystem, taking advantage of YARN and allowing to easily include it in existing MapReduce solutions. It can even be easily run with regular MapReduce jobs. Apache Spark, on the other hand, was built more as a new approach to processing big data. It adds some new concepts, such as RDDs (Resilient Distributed Datasets), provides a well-thought-out idiomatic API for defining the steps of the job (which has many more operations than just map or reduce, such as joins or co-groups), and has multiple cache-related features. The code is lazily evaluated and the Direct Acyclic Graph is created and optimized automatically (in contrast, in the case of Tez, the graph must be defined manually). Taking all of this into account, the code in Spark tends to be very concise. Spark is also a significant part of the Berkeley Data Analytics Stack (BDAS).

Q: What is a column-oriented database? When and why would you use one?

Column-oriented databases arrange data storage on disk by column, rather than by row, which allows more efficient disk seeks for particular operations. This has a number of benefits when working with large datasets, including faster aggregation related queries, efficient compression of data, and optimized updating of values in a specific column across all (or many) rows.

Whether or not a column-oriented database will be more efficient in operation varies in practice. It would appear that operations that retrieve data for objects would be slower, requiring numerous disk operations to collect data from multiple columns to build up the record. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first and last names from many rows in order to build a list of contacts are far more common than operations reading all data for a single entity in the rolodex. This is even more true for writing data into the database, especially if the data tends to be “sparse” with many optional columns.

It is also the case that data organized by columns, being coherent, tends to compress very well.

For these reasons, column stores have demonstrated excellent real-world performance in spite of any theoretical disadvantages.

Q: What is a Lambda Architecture and how might it be used to perform analytics on streams with real-time updates?

One of the common use cases for big data is real-time processing of huge volumes of incoming data streams. Depending on the actual problem and the nature of the data, multiple solutions might be proposed. In general, a Lambda Architecture approach is often an effective way to address this type of challenge.

The core concept is that the result is always a function of input data (lambda). The final solution should work as such a lambda function, irrelevant of the amount of data it has to process. To make this possible, the data is split into two parts; namely, raw data (which never changes and might be only appended) and pre-computed data. The pre-computed data is further subdivided into two other groups; namely, the old data and the recent data (where “old” and “recent” are relative terms, the specifics of which depend on the operational context).

Having this distinction, we can now build a system based on the lambda architecture consisting of the following three major building blocks:

Batch layer. Manages the master dataset (an immutable, append-only set of raw data) and pre-computes arbitrary query functions (called batch views).

Manages the master dataset (an immutable, append-only set of raw data) and pre-computes arbitrary query functions (called batch views). Serving layer. Indexes the batch views to support ad-hoc queries with low latency.

Indexes the batch views to support ad-hoc queries with low latency. Speed layer. Accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

There are many tools that can be applied to each of these architectural layers. For instance, the raw results could simply be stored in HDFS and the batch views might be pre-computed with help of Pig jobs. Similarly, to make it possible to query batch views effectively, they might be indexed using technologies such as Apache Drill, Impala, ElasticSearch or many others.

The speed layer, though, often requires a non-trivial amount of effort. Fortunately, there are many new technologies that can help with that as well. One of the most commonly chosen ones is Apache Storm which uses an abstraction of spouts and bolts to define data sources and manipulations of them, allowing distributed processing of streaming data.

The final element is to combine the results of the Speed layer and the Serving layer when assembling the query response.

This architecture combines the batch processing power of Hadoop with real-time availability of results. If the underlying algorithm ever changes, it’s very easy to just recalculate all batch views as a batch process and then update the speed layer to take the new version into account.

Wrap-up

Without a doubt, big data represents one of the most formidable and pervasive challenges facing the software industry today. From social networking, to marketing, to security and law enforcement, the need for large scale big data solutions that can effectively handle and process big data is becoming increasingly important and is rapidly on the rise.

Real expertise and proficiency in this domain requires far more than learning the ins and outs of a particular technology. It requires an understanding, at the first principles level, of the manifold technical challenges and complexities involved as well as the most effective methodologies and problem-solving approaches to address these hurdles.

We hope you find the questions presented in this article to be a useful foundation for “separating the wheat from the chaff” in your quest for the elite few among Big Data engineers, whether you need them full-time or part-time.