Data Sets tend to grow rapidly with increase in the number of information sensing devices, CCTV cameras, System Logs, Mobile phones, aerial (remote sensing), wireless sensor networks, IoT devices and lot more.

The data is increasing so rapidly that almost 90% of the data alone has been generated in past few years.

Gartner quotes “Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

Big Data is nothing but these huge large datasets which cannot be processed using traditional techniques.

It’s beyond one’s imagination to calculate the actual numbers out there today in the world. More than 470,000 tweets, 2.1 million snaps and 50,000 Instagram photos are posted every 60 seconds. The number says it all.

Five V’s of Big Data

Ever wondered what it takes to process such a huge data? What are the various types of data that need to be focused on? How do we analyze these data? Where to collect or store? and so on…

To better understand and to answer the above questions lets quickly understand the 5 major concepts of Big Data widely known as “Five V’s of Big Data”.

“Vs of Big Data”

Volume — “Volume” refers to the size of the data generated every second. As per Cisco, Global Internet traffic crossed one zettabyte of data in 2016. Facebook alone deals with 10 billion message, over 350 billion new pictures and 4.5 billion the like button press. That’s huge! Big Data enables to collect and process such humungous data. Variety — Variety is the type of data generated which can be broadly classified as structured and unstructured. While, it is easier to process the structured data(name, address, phone number, city, etc) using the traditional methods, most of the data today (~80%) is unstructured which includes photos, videos, social media updates, CCTV Footage, Logs, etc. Variety is defined as the different types of data we can now use. Data today looks very different than data from the past. Velocity — Every second data is increasing. Velocity is nothing but the speed at which these data gets generated, collected and analyzed. It is important to process the data really quickly for real-time users to access and retrieve data. It is believed that the number of active Internet users has increased by 75% with 3.8 billion people from 2012 to 2017. Big data technology now analyzes the data while it is being generated even before storing it. Veracity — Just Imagine what would have happened if the data generated is not accurate. The world would go insane! Veracity refers to the accuracy of the data generated, processed and analyzed. It ensures the quality and trustworthiness of the data. Value — Value refers to the “Data into Money”. Talking about Big Data without its worth or value would be an injustice. It is important to map the business model into data by understanding the costs associated with collecting and analyzing the data.

Big Data Tools

Big Data testing and development requires special skills and set up contrary to the traditional methods. Big data deals with complex voluminous data and because it generates variety of data, it brings in home a number of challenges. Analyzing and processing big data is not so easy. One of the major challenge being the processing of unstructured data in a structured format to further analyze the data. Storing of data is another Big Data concern.

Source — wordpress.com

There are a number of tools available in the market to store and analyze Big Data. Here are the top ones categorized based on Storage and Analysis:

Apache Hadoop is a free open source software framework widely used to store and process data in a larger scale. “Hadoop Distributed File System (HDFS)” is a distributed file system storage system of Hadoop. “Hadoop Yarn” is a resource manager responsible for managing the resources in clusters and scheduling the applications. Hadoop provides a programming model “MapReduce” to process data at large scale.

2. NoSQL

NoSQL databases work best for Big Data applications and give better performance for storing huge amount of data. NoSQL database stores the unstructured data and supports SQL like queries. There are many open source NoSQL Databases available in the market to analyze Big Data like MongoDB, Cassandra, CouchDB, Hypertable, Redis, etc.

3. Hive

The Apache Hive is a data warehouse software by Hadoop primarily used for Data Mining Purposes. It provides better performance to manage large datasets while supporting SQL like query.

5. Sqoop

Sqoop is a command line interface tool which transfers bulk data between Apache Hadoop and structured data stores(relational databases). This tool uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

5. PolyBase

PolyBase integrates Parallel Data Warehouse (PDW) with Hadoop. It is a technology to access data from outside the database by querying across relational data stored in PDW and non relational data stored in HDFS.

6. Excel

Microsoft Excel is one of the most popular tools used to store and access data. This tool is widely used as an Analytics tool. Excel provides lot of in-built functions to analyze the huge data quickly. Users can also work with external data using three patterns of Excel.

When working with big data, there are a number of technologies and techniques that can be applied to make these three patterns successful.

7. Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

8. Splunk

Founded in 2003, Splunk is a software platform that indexes machine data and turns it into accessible, actionable intelligence. The company’s motto is “any question, any data, one Splunk” — and they mean it.

9. Spark

Apache Spark™ is a unified analytics engine for large-scale data processing. Spark supports Java, Scala, Python, R and SQL. It processes data in real time dealing with both Real and Batch time processing. Spark handles input from multiple sources and things are much simpler with Spark.

10. Elastic Search

Elastic search is a free and open source Lucene based enterprise search engine tool widely used to query data and for big data analysis. It is very easy to install and it comes with in-built REST API. Elastic Search can be integrated with Hadoop for better query results enhancing the performance.

Big Data Testing Challenges

Data is an integral part of our lives and transforming these unstructured data into structured has become a huge testing concern. Big data testing faces challenges around Automation and Virtualization. With such rapid growth, it is essential to have a clear test strategy in place for the success of the product.

Industry demands someone with technical expertise when it comes to automation testing of Big Data. The Big Data testing itself unearths various problems due to unequipped tools available.

Real Time Data handling is again very challenging with virtual machines as its latency might create timing issues. Virtualization is very crucial for Big Data testing.

Big Data Testing involves heavy load on data storage, analysis and processing. As it involves huge data sets to be verified across various platforms it is important to automate the testing effort as much as possible and identify the issues early on the stage to avoid risks/costs.

Read my blog on how to Get the Most out of Software Testing in an Agile Methodology.