Data Lakes, Data Ponds, and Data Droplets

We need to start with with the end, Value , in mind, but that value won't materialize if we don't address Veracity . Variety can be a serious problem long before a large Volume of data is accumulated. Velocity refers to not just the speed of data processing, but the speed of analysis, software and product development -- these of course are closely related and they support Veracity in allowing you to properly test the system and freeing up developers to think about creating Value. Real Semantics addresses all five V's for data sets of all sizes by applying a limited number of methods and strategies to common problems that appear at all scales.

Thus, to capture the bulk of the market, a data lake toolset needs to address projects of wildly disparate size. That's what scalability means: it means not just big, but able to function on a long range of scales. People talk of at least five "V"'s of Big data:

This approach is perfectly suited for Data Lakes, in which we are going for a panoptic view over a large document collection, because of the large computing capability coupled to the storage. Although we can, and should, use a number of mechanisms to reduce duplicated work, hyperconverged systems have the power to scan the complete data set in a reasonable amount of time, which means we can repeat the data analysis at any time to reflect changes in either the code or data.

The real benefits of this hardware organization come when you pair it with the right kind of software: in particular, since each storage volume is directly attached to a computer, each computer can do the initial parsing step that extracts facts from the documents it holds without communication with the other cluster members. (This is the "map" phase in the MapReduce programming model.) In most cases, the extracted information is considerably smaller than the original documents, so when we gather together (or accumulate) the results, a much smaller amount of information is sent across the network connection:

The scalability story is as follows: if you want to enlarge the cluster, add additional servers. Each additional server adds additional disk space, so the ratio between computing power and both disk capacity and disk bandwidth stays constant. There could still be other bottlenecks in the system, but the storage interconnect is not one of them. The trade-off, relative to the storage area network, is that the storage no longer looks like a single disk, so that the content of the disks is not synchronized in time. We can't use the synchronization implied by fsync() to implement ACID transactions and frequently distributed systems implement eventual consistently precisely because the cost of maintaining perfect synchronization and coherence increases with the size of a cluster.

The trouble with the SAN architecture, however, is that it does not scale. So long as we depend on a special piece of hardware to pretend that a disk array is just a disk, our performance is limited by that single piece of hardware. If we add more servers and more disks, we will run into increasing costs, performance bottlenecks and additional problems that will bring expansion to a halt.

With a SAN, we can connect a large number of disks to the SAN on one end, and we can plug servers on the other end. We can apply RAID protection to the disk array, and then carve the array up into multiple volumes, each of which looks like a very high performing (and possibly very big disk.) If the SAN fabric contains battery-backed RAM, we can report the completion of transactions to the server right away (before a chunk of metal movies to write the data), meaning we get great performance for transactional databases.

Whether or not you care for the word, Data Lakes owe a lot to the idea of the Hyper-converged architecture. Lets take a moment to compare this to the traditional architecture used in high-end enterprise servers in recent years. Data Warehousing and other high-performance servers have frequently depended on Storage Area Networks (SANs)

Vendors of traditional Data Warehousing systems are afraid of Data Lakes, and they'd like you to be afraid of the challenge of getting data out of them. However, the traditional data warehousing model, where corporate data is transformed into its final schema for processing as soon as it is collected, is untenable in an age of innovation. If you need to change your data warehouse schema every time you change your applications, you'll either never change your applications or your data warehouse will become obsolete. Yes, we need order to tame the chaos of real-world data, but the best way to do that is to be able to apply order after the fact, to continuously find new uses for the data we already have.

Design for scalability

The illustration above is of a fractal antenna, a technology that has taken off in the last 15 years or so outside of the public eye, as they are often hidden on the inside of a cell phone or inside a sealed radome in a digital television antenna. Typically the size of an antenna (or part of an antenna) is on the scale of the wavelength of the radio waves that it interacts with, which is inversely proportional to the frequence. Fractal antennas work at a wide range of frequencies because they contain both large and small structures. This fits the wide frequency range used for television transmissions (54-692 MHz) and the multiple cellular and WiFi bands supported by a modern cellular phone.

In many areas of science, there is a characteristic scale for size, time, mass or some quantity like that. For instance, the atomic nucleus is roughly 1 femtometer (10-15m) in size, and the electron shell of an atom is about 0.1 nanometers (10-10m) in size. Atoms don't vary that much in size: a particularly small atom has a diameter of 0.1 nm and the largest has a diameter of about 0.6 nm. Many real-life quantities vary more than that because of the lack of a natural scale. For instance, you can find human settlements as large as Tokyo (33 million people) and as small as Magdalena, NM (926 people). Many quantities that you find looking at practical data, such as:

The sizes of files in a large document collection

The number of emails written by different authors

The number of times a predicate in an RDF database is used

The number of times a class is used in an RDF database

The value of customer accounts

vary over many orders of magnitude and cannot be modelled well with a normal distribution and are better modelled by a log-normal or power-law distribution. "Exceptional event" are the new normal, and as you increase the volume and variety of data you work with, you'll find that extreme distributions can be a source of bottlenecks and will need to work around them.

Real Semantics has a fractal structure because it is designed to reuse the same data structures at multiple scales. In particular, Real Semantics represents everything as RDF graphs, RDF datasets, or RDF result sets. We can store a small RDF graph in an in-memory Jena Model, a structure that can be used much like programmers in a language like PHP, Perl or Javascript would use a hashtable. Large RDF graphs can be stored in a number of open-source and commercial SPARQL databases such as Blazegraph, OpenLink Virtuoso, and AllegroGrqaph just to name a few. The existence of RDF and SPARQL standards mean you can choose different databases for work in different environments and different scales. In Real Semantics we speak of "data droplets" which are typically small RDF graphs that represent the facts in a particular document, about a particular topic, or relevant to a particular decision. With a common toolbox that can be applied to data of all sizes, Real Semantics can roll with the punches, moving processing from one stage to another to break through bottlenecks and meet changing requirements without constant re-engineering.

We'll look at one more example of fractal architecture to explain the benefits:

Plants are based on fractal principles because fractal designs scale in both space and time. If you were building a house you'd have to finish perhaps 60% of the construction before you can live in it, but a plant captures energy from its very first leaves to grow by factors of thousands, even millions. It's simple for the plant too, because it is encoded in the rules controlled by cell division; armed with just a little information about their local environment, cells can make decisions that produce a complex-appearing shape that is sketched out only in the abstract in the plant's DNA.

For Real Semantics, RDF nodes and RDF triples are like atoms and atomic bonds. We put them together to make molecules and eventually droplets of data. The underlying graph model is universal: we can mirror conventional data structures quite directly with nodes and links between node. We apply semantics, or meaning, to the graph in a layer that is built on top of that basic model. That lets us combine operators that work on graphs as "graphs" together with relational operators, logic, conventional Java code and the kind of transformations used in programming language compilers.

What is a data droplet? A data droplet is a graph of nodes and relationships concerning a document, topic, situation or decision. Real Semantics uses RDF graphs and datasets to implement data droplets. The JSON-LD specification represents one view of data droplets, although we prefer the Turtle language when we write droplets by hand. @prefix : <http://example.com/appliances> . @prefix dbpedia: <http://dbpedia.org/resource/> [ a :WashingMachine,:FrontLoadingWashingMachine ; :capacity 4.8 ; :supportedVoltages 120, 240 ; :phases ( "Soak" "Wash" "Rinse" "Spin" ) ; :energyCostEstimate [ :source dbpedia:United_States_Environmental_Protection_Agency ; :hotWaterSource "electric" ; :annualEstimatedCost 16.00 ], [ :source dbpedia:United_States_Environmental_Protection_Agency ; :hotWaterSource "natural gas" ; :annualEstimatedCost 14.00 ] ] The above is the kind of droplet you can write right off the cuff. It's a lot like a JSON document in that you can nest groups of properties inside the square braces [] . We inserted ordered lists using curved parenthesis () . Nested structures can be build into lists. Some more subtle details are in this example too. We use three different numeric types, including floating point, integer, and decimal. That last one is important: if you use floating point to do financial calculations, you will someday cut somebody a check for the wrong amount. COBOL got it right with mainframes in the 60s, but ordinary JSON lacks a decimal data type. Another feature RDF adds over JSON is data types for times and dates. You'll need to formalize something if you want to exchange times and dates, and you might as well use the standards baked into the XML Schema Data Types. The ordered lists represented are conceptually the kind of linked lists used in languages like LISP, and in fact you can write LISP S-expressions in Turtle exactly as you would in LISP (except for requiring spaces around the parenthesis and a period at the end): ( <fn:numeric-add> 2 2 ) . RDF: A New Slant is another chapter in this book that describes another take on Data Droplets. Just like the blind men discussing an elephant, there are many viewpoints that you could approach them with. In the case of Real Semantics, which is written in Java, the problem of moving data from Java objects to and from RDF is the most urgent problem, for several reasons: (i) many libraries for importing and exporting Java already exists, (ii) Real Semantics can most easily use Java language libraries to do things that it wants to do, such as launch cloud servers, read file metadata and initialize objects with Spring, and (iii) developers can extend Real Semantics by writing Java code. The Java to RDF mapping in Real Semantics is a bit deeper than many data mappings that you see in other frameworks such as object-relational mapping. It is best compared to the meta-object facility promoted by the OMG in that it addresses three basic requirements: getting a property from an object setting a property from an object calling a method on an object The first two of these are common in tools such as rdfBean, JAXB, Hibernate, Jackson as well as the many imitators of those frameworks you will see with other languages. The third one is a bit less less common, but it unlocks doors that the others do not. Often it looks like we are calling a function (static method) instead of an actual instance method, but it lets us write configuration and code in the RDF world that use the billions of classes in the Java world to specify what needs to be done. If you were developing a framework like Real Semantics in some other language, let's say Python, Ruby or Go, you'd make different decisions and you would center them around the reality of your language. For instance, we make the most of static typing in Java: we see Java class definitions as an ontology that we can access through the Java reflection API. Together with rules that describe the Java Bean conventions and other common patterns, in most cases we can build a Java-to-RDF mapping without human input. We use this to describe not only data in the conventional sense, but also the configuration that makes Real Semantics go. JSON-LD defines a relation between data droplets and JSON. Relational tables (and the equivalent CSV file) as well as XML data also translate directly to data droplets. Real Semantics avoids the "impedance mismatch" that is painful in object-relational systems because a 1-1 mapping exists between an RDF graph and traditional data structures. Any reconcilliation between other data models can be done in the RDF world where we have the widest range of strategies on call to address semantic gaps. We will talk a bit more about the relationship with relational data and XML in another chapter, but for a moment we'll take the idea of data droplets and graph models to their logical conclusion. We've got to be careful in what we say here, because the biggest marketing problem we have with Real Semantics is that people have heard more than 40 years of hype in the Artificial Intelligence and now Machine Learning space have made people deeply skeptical. Yet, the reality is that RDF can represent the kind of tree and graph structures that are useful in capturing the structure and content of natural langugage to a fine degree of granularity. We do not claim, like Cyc to have a fleshed-out vocabulary for representing 100% of the knowledge in natural language documents, nor do we claim to have an automated system to map natural language text into that representation. What we do claim is that there is a toolbox of modelling methods, such as the Situation calculus, that can be used together to capture critical knowledge from high-value documents, particularly specifications and standards. At this point in time, the construction of such a knowledge base can be at most partially automated -- this functionality would need to be built out based on the needs of a particular application, but the data structure choices in Real Semantics do not pose any obstacle to such development.