Infrastructure is the cornerstone of Big Data architecture. Possessing the right tools for storing, processing and analysing your data is crucial in any Big Data project. In last installment of “Understanding Big Data”, we provided a general overview of all of the technologies in the Big Data landscape. In this edition, we’ll be closely examining infrastructural approaches- what they are, how they work and what each approach is best used for.

Hadoop

To recap, Hadoop is essentially an open-source framework for processing, storing and analysing data. The fundamental principle behind Hadoop is rather than tackling one monolithic block of data all in one go, it’s more efficient to break up & distribute data into many parts, allowing processing and analysing of different parts concurrently.

When hearing Hadoop discussed, it’s easy to think of Hadoop as one vast entity; this is a myth. In reality, Hadoop is a whole ecosystem of different products, largely presided over by the Apache Software Foundation. Some key components include:

HDFS- The default storage layer

The default storage layer MapReduce- Executes a wide range of analytic functions by analysing datasets in parallel before ‘reducing’ the results. The “Map” job distributes a query to different nodes, and the “Reduce” gathers the results and resolves them into a single value.

Executes a wide range of analytic functions by analysing datasets in parallel before ‘reducing’ the results. The “Map” job distributes a query to different nodes, and the “Reduce” gathers the results and resolves them into a single value. YARN- Responsible for cluster management and scheduling user applications

Responsible for cluster management and scheduling user applications Spark- Used on top of HDFS, and promises speeds up to 100 times faster than the two-step MapReduce function in certain applications. Allows data to loaded in-memory and queried repeatedly, making it particularly apt for machine learning algorithms

More information about Apache Hadoop add-on components, can be found here.

The main advantages of Hadoop are its cost- and time-effectiveness. Cost, because as it’s open source, it’s free and available for anyone to use, and can run off cheap commodity hardware. Time, because it processes multiple ‘parts’ of the data set concurrently, making it a comparatively fast tool for retrospective, in-depth analysis. However, open source has its drawbacks. The Apache Software Foundation are constantly updating and developing the Hadoop ecosystem; but if you hit a snag with open-source technology, there’s no one go-to source for troubleshooting.

This is where Hadoop-on-Premium packages enter the picture. Hadoop-on-Premium services such as Cloudera, Hortonworks and Splice offer the Hadoop framework with greater security and support, with added system & data management tools and enterprise capabilities.

NoSQL

NoSQL, which stands for Not Only SQL, is a term used to cover a range of different database technologies. As mentioned in the previous article, unlike their relational predecessors, NoSQL databases are adept at processing dynamic, semi-structured data with low latency, making them better tailored to a Big Data environment.

The different strengths and uses of Hadoop and NoSQL are often described as “operational” and “analytical”. NoSQL is better suited for “operational” tasks; interactive workloads based on selective criteria where data can be processed in near real-time. Hadoop is better suited to high-throughput, in-depth analysis in retrospect, where the majority or all of the data is harnessed. Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed concurrently. Some NoSQL databases, such as HBase, were primarily designed to work on top of Hadoop.

Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL. Many of the most widely used NoSQL technologies are open source, meaning security and troubleshooting may be an issue. It also places less focus on atomicity and consistency than on performance and scalability. Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address these issues.

Massively Parallel Processing (MPP)

As the name might suggest, MPP technologies process massive amounts of data in parallel. Hundreds (or potentially even thousands) of processors, each with their own operating system and memory, work on different parts of the same programme.

As mentioned in the previous article, MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most often run on cheap commodity hardware (allowing for inexpensive horizontal scale out). MPP uses SQL, and Hadoop uses Java as default (although the Apache Foundation developed Hive, a language used in Hadoop similar to SQL, to make using Hadoop slightly easier and less specialist). As with all technologies in this article, MPP has crossovers with the other technologies; Teradata, an MPP technology, has an ongoing partnership with Hortonworks (a Hadoop-on-Premium service).

Many of the major players in the MPP market have been acquired by technology vendor behemoths; Netezza, for instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by EMC.

Cloud

Cloud computing refers to a broad set of products that are sold as a service and delivered over a network. In other infrastructural approaches, when setting up your big architecture you need to buy hardware and software for each person involved with the processing and analysing of your data. In cloud computing, your analysts only require access to 1 application- a web-based service where all of the necessary resources and programmes are hosted. In cloud computing, up-front costs are minimal as you typically only pay for what you use, and scale out from there- Amazon Redshift, for instance, allows you to get started for as little as 25 cents an hour. As well as cost, Cloud computing also has an advantage in terms of delivering faster insights.

Of course, having your data hosted by third party can raise questions about security; many choose to host their confidential information in-house, and use the cloud for less private data.

Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing products, including BigQuery, specifically designed for the processing and management of Big Data; Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and DynamoDB for NoSQL. There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud computing solutions.

As you can see, these different technologies are by no means direct competitors; each has its own particular uses and capabilities, and complex architectures will make use of combinations of all of these approaches, and more. In the next “Understanding Big Data”, we will be moving beyond processing data and into the realm of advanced analytics; programmes specifically designed to help you harness your data and glean insights from it.

(Image credit: NATS Press Office)

Follow @DataconomyMedia



Eileen McNulty-Holmes – Editor

Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.

Email: [email protected].com

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]