







What is Big Data ? - We have been hearing this question every now and then . Most common answer to this question that you might receive will be like - “Any data that cannot fit into a single machine”, “Data that is > 1 TB is considered big data” and so on . But are these the right answer to one of the most important questions of this data driven era “What is Big Data ?”.

NO

Facts related to Big Data

Over 2.5 quintillion bytes of data is generated worldwide every day In 2016, 90% of the world’s data had been created in the previous two years Trillions of sensors monitor , track and communicate with each other helping for Internet of Things with real time data 294 Billion emails sent everyday Over 1 billion Google searches everyday 30+ petabytes of user-generated data is stored , processed and analyzed at Facebook 230+ million tweets each day at Twitter

Characteristics of Big Data - 5 V’s of Big Data

















Volume : This refers to the amount of data that is being generated. This is huge and most of them interpret Big Data only in terms of this property.



Over 2.5 quintillion bytes of data is generated worldwide every day. We have better access to the internet , smartphone industry is growing rapidly with the advent of online retailers like Amazon , Flipkart , Ebay , etc .



All these developments have led to generation of huge amount of data. Twitter sends out over 500,000 tweets a day. We have IOT nowadays , which leads to sensor collected data from almost any devices and machines you can think of. Industries are adopting IOT with Big data Analytics to help grow their business.



These all developments lead to huge amount of data , and these are getting generated in such a volume that it cannot be handled in traditional data processing systems . We need outputs faster , and traditional processes would take a long time to even process 10% of this data . This is where we need to think of Big Data Frameworks and technologies. Velocity : This refers to the rate at which data is being generated by different sources. This is one of the main characteristics of Big Data. We have data being generated from new sources everyday. And the speed at which this data is generated is increasing exponentially. Social Media giants like Facebook , Twitter have to process millions of posts and tweets per day or even per minute !



In 2019,

Twitter users send more than 500,000 tweets every minute

Instagram users post over 250,000 stories

Twitch users view 1 million videos

Tinder users swipe 1.4 million times

Since Internet Of Things (IOT) is very popular these days , it is one of the best examples with which I can explain Velocity to you. With the advancements in IOT , businesses have started innovative ways to collect data on their processes , so that they can analyze these and improvise there system. Sensors are being connected to machinery, to the employees , sportsperson etc to analyze their performance.



Now, these data are being generated continually . This is what we call streaming data , where data is being generated continually real-time or near real-time. Until few years back , we didn't have any mature system to analyze such streaming or real-time data.



So, nobody used such data. We were limited to analyzing only historical data in batch processing fashion . But the potential in streaming data is immense in any industry.



Velocity is one of the most important factors in Big Data processing. Highly Active Researches are being undertaken and frameworks are being developed using various methodologies to reduce the execution time .If an organization cannot keep up with the velocity of data, they might probably have to rethink their data processing strategy Variety : This refers to the types of data that are being generated. We can categorize data into mainly 3 types:

Structured : Any data that can be segregated into a table format is what basically called structured data. Analysis of Structured data is predominant in traditional DBMS approach.

: Unstructured : These are the data that cannot be organised into a table like format. These are complex entities like Images, Tweets, Emails, Videos etc

Semi-Structured : These lie between the Structured and Unstructured forms of data. These kind of data are not completely unstructured. They are defined in specific formats with tags. Examples include JSON, XML etc Now, an interesting fact is shown below. You can see that over 80% of the data in digital universe are Unstructured format. Various Industries , Governments , Research Organisations are making use of the Big Data Frameworks and methodologies to harness useful information out of this.

The ability to harness insights out of Unstructured data has mainly been a game changer over past few years. Every industry is trying to make use of this data from Retail Giants to Governments Veracity : This refers to the Trustworthiness an Quality of the data. Before you start to process and analyze data, you should always check if the data is trustworthy or not.



Suppose you are analyzing twitter data. You have to do a sentimental analysis of a particular Twitter hashtag. Now before jumping into processing, you should ensure that whether the data provided to you is trustworthy or not. Is the data of any quality. You have to check if it is credible enough to provide some value or not.



Now , what is the point in wasting your efforts and time in analyzing something that is fake or corrupt. You may find lot of tweets that have your desired hashtag but the tweet might not be related !



Let's take another example. Suppose , you are given a group of individuals car GPS information. You are supposed to find some metrics. Now you find a lot of data missing due to GPS connectivity lost in some remote areas . These kind of conditions is where Veracity of data comes into picture. Value : This refers to how much value does processing and analyzing a data provide to you or a business. What is the use of analyzing large data-sets if it doesn't provide any useful insights to you.



There are immense quantity of data out there, but choosing the right data is important. Always analyze the value, a data can provide to you before starting to processing it.





Conclusion





Let's summarize few key point from our discussion !





Big Data is not just about size of data There are 3 main types of data Structured - Data that can be confined to tables Semi-Structured - JSON, XML Unstructured - Images, Videos, Social Media , IOT There are 5 important characteristics to identify a Big Data Problem called the 5V's of Big Data Volume Variety - Structured , Unstructured , Semi-Structured Velocity Veracity - Trustworthiness / Quality Value Now , one important point you have to always remember is that, Big Data is a Problem rather than a technology. It is a big problem faced by the industry. Until few years back, we didn't have the resource and budget to process these data.



It was like you have the treasure in front of you , but you don't know how to open the case ! Have you ever wondered why now a sudden explosion in Analytics domain ? Its because we now have the resources and capabilities to innovate , process and analyze. 8 GB of RAM is now a standard in almost all laptops. If we go back to around 5 to 10 years, we were running our laptops and PC's with 512 MB RAM. Then came the semi-conductor revolution and here we are now.



There are endless possibilities in the field of Big Data Analytics. Those who have access to data is the king. As a Data Engineer / Analyst your role is to help organizations find useful insights out of their data. In coming articles we will discuss further about the various tools and frameworks that are used in the industry for Big Data Analytics



Hope I was able to provide you an understanding of Big Data and answer to what is Big Data. Let me know your views on Big Data in the comments below. Keep learning and keep innovating !

The simple answer is. You cannot define big data simply by putting in some numbers . There are other factors also which form the characteristics of Big Data , the famous Gartner defines Big data as “Let's look into this definition for a second and try to understandThey say, any data can be considered as Big Data , if the data is huge in volume than what can be handled by traditional systems like RDBMS. It is generated at high velocity from different sources, and it can be any form of data i.e, images, text, videos JSON, etc.But the main point is the last part. Big Data processing demands Cost-effective and Innovative approach. Otherwise, it may not be accessible to everybody. Processing such data demands either vertical scaling of existing hardware or an innovative approach that can enable big data processing even using regular commodity hardware like your own PC. Vertical scaling is limited and not feasible to the huge data volume and variety as there is always a limit to how much we can upgrade/scale a particular system hardware. So the solution was distributed , parallel computing and frameworks like Hadoop Mapreduce , Spark, etc.Now , it is important to know why suddenly we are seeing the huge demand for Big Data Analytics in the digital worldThere are a lot more than this ! The above facts might have given you an idea of how important data is going to be in coming years .is a field that is growing rapidly but the available expertise and technologies to explore ever growing data is limited . This is what is called “” . This gap is since now every 2 years mankind doubles the volume of data produced, but processes, analyzes and comprehends only a part of these data.In Big Data , there are 5 important characteristics that you need to see as shown in the above info-graphics . These characteristics are generally called the importantFew years back , there were onlyi.e., Volume , Velocity and Variety . Later came into existence 2 move V’s into this list. They are Veracity and Value . Let's understand more about these