The leaked Panama Papers present an excellent example of how rapidly scale changes and what sort of new approaches are needed to handle the challenges of today and tomorrow. The V’s of Big Data can be easily applied to these datasets.

Volume

The documents contain terabytes of information and is orders of magnitude larger than the previous leaked document sets.

The scale of the leaked documents shown in gigabytes on a logarithmic scale

Such a large volume of data can no longer be processed entirely by hand because the costs spiral quickly out of control. At a rate of one document a minute it would take a single person nearly a year to browse through the data once. To look for correlations or connections would take lifetimes.

Variety

Another challenge in this case is the variety of information available and how much of it was completely unstructured. In particular images (both alone and as part of other documents like PDFs), which can be particularly difficult to analyze. There are a number of questions which arise when dealing with images: what do they contain, if text what orientation, language, how are the blocks of text connected together.

The make up of the contents examined, a large portion is unstructured and difficult to search with standard approaches.

Big Image Analytics

We use the term big image analytics for combining the latest technologies in Big Data and image processing. The synergy from combining these two approaches allows for extracting useful actionable information at an unprecedented scale.

Text Recognition

As many of the images are photographs of documents, text recognition and extraction can be used to convert many of these into text.

A text clipping from a german-language newspaper

While for simple documents this works well as the document starts to get more complicated (even receipts) fall into this category. The importance of understanding structure and relationships within the document are much more important.

A standard receipt from a store

Scene Recognition

For more standard image material, text recognition does not provide very much useful information. To extract meaning from these images (like photographs, videos, etc) it is important to be able to interpret the scene, identify important objects and understand the connections and similarities within the large image collections. A simple example of this is what we call Image to Text where images can be converted to a small text description. Read more…

An example of a few scenes with fully automatic captioning applied.

Beyond recognizing scenes, the latest research from Google has realized a network, which can guess where a picture was taken. Such tools mean that each image

From each of the over million images can scores of features be extracted, but instead of the rhetorical thousand words, we can extract meaningful quantitative features.

Correlations

The final and simplest analysis approach is to examine and extract correlations inside of these enormous datasets. This involves comparing thousands of different pieces of information to find connections and anomalies. These sort of challenges involve the same types of approach we employed to understand the connectivity of mouse brains at the >50 terabyte scale.

4Quant specializes in delivering Big Image Analytics solutions using our analytics platform built on and tightly integrated with Apache Spark and Google TensorFlow.