Introduction to Big Data

The Big Data refers to the technological wave that we see emerging in recent years around the data, metadata that allow us to store today the servers. Big Data comes from the fact that the data of some companies or institutions have become so large that the traditional technical tools of management, query on the so-called structured databases and data processing have become obsolete, with difficulties in the instantiation of data, the extraction and processing times become too long etc.

The impacts of Big Data go far beyond web or IT applications. The impact on society is increasing. The capacity to store and analyze data allows emergence of new information processing and analysis technologies, Big data allows the expending in research : medical, pharmaceutical, Computer, banking or electromechanical (Internet of Things, automobile etc.)

Technological Evolution of Big Data

Three major technical developments have allowed the creation and growth of Big Data: firstly, the evolution of storage hardware with increasing storage capacities in smaller and smaller devices and the evolution of storage models, from internal servers inside the enterprise to so-called “cloud” servers, which often have much higher storage capacity than internal company servers. The second point is the new evolution on serializing small replaceable servers and creating a distributed system that is resistant to breakdowns. This paradigm was popularized by Google in the early 2000s and is at the origin of the first open source version of the first Big Data framework released 10 years ago: Hadoop . The third revolution that began in 2009 is the explosion of tools for analyzing, extracting and processing data in an unstructured way, from NoSQL or new framework linked to The Hadoop ecosystem .

Traditional databases are no longer able to manage so high volumes of information, then big web players such as Facebook, Google, Yahoo or Linkedin have created Big Data Frameworks to manage and process large amounts of data through, For example, data lakes, where all data from various sources are stored. These data are then “splitted” or separated to be processed in parallel in order to lighten computation processes (in the old model, treatments were made one after another in a stack) and then reassembled to give the final result . It’s this technology that allows fast processing speed on large volume data. Originally developed by Google, it is now under the Apache flag and this technology is called Map Reduce. Here is the breakdown of the processing process:

The aim of the above algorithm is to calculate the number of repetitions of a keyword in the text. The Map Reduce algorithm distributes the data (here character strings or words) in several nodes (splitting), each node performs its calculations separately (calculation of the number of words – Mapping and Shuffling) and finally the step of “Reduce” will consolidate the data of each calculation to display the final result. Smart !

Big Data Use Cases

Big Data finds a wide variety of business applications, but some industries use Big Data more than others.

The insurance and the bank hold records of customer data (who doesn’t have bank accounts ?) Big Data gives the possibility to analyze customer data (subscriptions, unsubscriptions, geographical locations, Cultural variables, gender and others) in order to predict the population or typical client profile likely to leave the bank and thus take steps to reduce the churn rate in these populations.

Aviation or mechanical industries in general. Saagie is currently conducting predictive maintenance for its client, to predict moments of breakage of sensitive parts or the evolution of deterioration of essential parts for the proper functioning of an airplane, a car, luggages conveyors and thus makes it possible to launch the alert and the control of the part before the latter reaches its end (Prescriptive Analysis).

In the pharmaceutical and medical fields, for example the analysis of dead data concerning scanners of patients suffering from cancer, the collection of these data, the analysis of medical opinions for each case and the implementation of aintelligence Artificial to help the doctor in his decision-making is allowed by Saagie Artificial Intelligence, on the basis of analysis of hundreds of thousands of scanners and opinions of hundreds of doctors, the robot can learn more precisely which opinion and treatment can be more efficient depending on high scanner resolution analysis.

The Big Data also makes it possible to optimize quantitatively and qualitatively the profile of test patients in the pharmaceutical industry, making it possible to determine on the basis of the previous data the typical profiles having the most probability of going to the end of the drug tests. This can help to reduce time before commercialization of some drugs, blocked by lack of patient test.

The Future of Big Data

Because the technological industry of big data is very recent, the systems of treatment of the megadata and the storage are constantly growing. There is an impressive speed in appearance and disappearance of technologies. The algorithm Map Reduce was created in 2004 by Google and is widely used, for example by Yahoo in its Nutch project. In 2008 it became an Apache product in order to create Hadoop but because of the “slowness” of treatment, even on modest-sized megadata, its use is progressively abandoned.

From the second version of Hadoop , the modular architecture allows to accept new calculation modules like Hadoop File System (HDFS) and Map Reduce. This is the way Apache-developed Spark , younger than Map Reduce, overtakes it little by little. Spark can be executed over Hadoop and over NoSQL numerous bases. Project experienced these last years a fast development and received the approval of a big part of the community of developers.

The main actors of Big Data

Google and Facebook very early faced problems of volumetry of their data, that is why they quite naturally became two structuring actors of Big Data. They are the most capable actors to correctly and quickly treat these volumes of data. Since the beginning Big Data interests the giants of the IT sector, software publishers, historical software integrators on the waiters of companies. The “early adopters” of Big Data are, for example, Oracle, SAP, IBM or Microsoft, which (regarding to the potential of this market) certainly launched out a little later than Google and Facebook, but still benefit from the wave of growth of Big Data. A zoom on the largest Big actors Data in 2016.

Hortonworks, Cloudera et Mapr They are the editors of the Big Data distributions. Cloudera counts among its team one of the creators of Hadoop Doug Cutting . Hortonworks is a spin off of Yahoo and has the most open source positionning. Mapr has another approach, the engines of storage and calculations were changed but Hadoop APIs were preserved in order to ensure compatibility with the existing ecosystem.

Google Google remains the precursor and mastodon of Big Data technologies, with the development of Map Reduce in 2004 for example. Google largely uses its technology for indexing algorithms on the search engines, Google Translate or Google Satellites, by using specific functions of load balancing, parallelization and recovery in case of servers breakdowns. Google uses less and less Map Reduce and is very strongly directed towards streaming (real time treatment). Google provided the open version source of Google Data Flow with Apache Beam.

Amazon Amazon became one of the biggest actors of Big Data when it offered in 2009 Amazon Web Services with a Google-comparable technology called Elastic Map Reduce. This technology has the advantage to separate data exploration from Hadoop clusters implementing, management or adjustment. The advent of Cloud Computing launched by Amazon enables the brand to be more powerful in Big Data sector by massively democratizing it. But the cost of migration and exit strategy is very high.

IBM IBM like the other big actors of the Web started exploring Big Data by integrating in its services some Hadoop and Map Reduce bricks of treatment.

ODPi The Open Data Platform Initiative gathers Hortonworks, IBM, Pivotal in order to set standards about Big Data platforms implementation. The goal is to give to users reversibility guarantees. But it is not a success yet because Cloudera and Mapr did not join the mouvment.

And the last but not least Big Data actor is Saagie. How does Saagie deal with Big Data ?

Data is in Saagie’s blood

In the big jungle of Big Data and complicated terms (data lake, map reducing, HDFS, Mongo MySQL… what is that ?) and program languages like Python or Java, there is Saagie, the oasis able to manage all the technologies, from data collecting to synthesis KPIs dashboards. To do this better Saagie keeps the spirit of Open Source by allowing the importation of your existing work with just a little bit adaptation and the securing of reversibility to another platform.

Saagie simplifies Big Data whatever your solutions and their ways to implementation (cloud, on premise or hybrid). Saagie makes all your data workable onsite with Saagie Su Appliance or in the cloud with Saagie Kumo.