Ontology2: the Real Semantics Book

Table of Contents | Ontology2 home page

RDF: a new slant

Too Many Data Formats

All of the time we need to feed a computer program a few facts, and traditionally that involves working with property files, XML, JSON, YAML, CSV, etc. Once a system gets complex and anywhere near the "enterprise" scale, having many different file formats for configuration, reference data, and for specifying the work done by the system becomes a source of stress for it's operators.

Once the job is specified, the program needs to consume and produce data which made be provided through various formats and APIs. The need to translate data from one format to another again and again gets in the way of using a wide range of powerful tools, such as logical inference, machine learning and hybrid systems that seek and find the reality behind data in the real world.

Real Semantics talks all kinds of data formats, but it sees them all using a single data model called RDF/K, an extension of the RDF schema language. Let's suppose you want to state a few facts about the world -- we suggest you write a Turtle file. Turtle is unbeatable for writing facts by hand, and thinking through Turtle will help you make a mental model of what RDF/K is and what it can do.

Capturing a business record

While developing our Legal Entity Identifier site, we captured a data record from an XML file. The system wrote this data as a Turtle file and we included this file in the unit tests so we have certainty this record is correctly processed no matter what changes are made to the code.

Using a single data model (RDF/K) means we need just a single mechanism to capture what facts the system believes at intermediate stages of inference, calculation, or decision. This extreme traceability, available when required, costless when not, is one of many unique features of Real Semantics that are necessary for compliance with the tough BCBS 239 standard in both normal times and crisis. (Who needs the stress of cleaning data in a crisis?)

This example is a simple business entity record from the Legal Entity Identifier system:

@prefix lei: <http://rdf.legalentityidentifer.info/vocab/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . [] a lei:LegalEntity , lei:ConformantIdentifier ; lei:BusinessRegistryIdentifier "" ; lei:BusinessRegistryName "N/A" ; lei:EntityLegalForm "OTHER" ; lei:EntityStatusCode "ACTIVE" ; lei:HeadquarterAddress1 "C/O C T Corporation System"; lei:HeadquarterAddress2 "155 Federal Street" ; lei:HeadquarterAddress3 "Suite 700" ; lei:HeadquarterCity "Boston" ; lei:HeadquarterCountryCode "US" ; lei:HeadquarterPostalCode "02110" ; lei:HeadquarterRegion "US-MA" ; lei:LEIAssignmentDate "2013-05-24T09:30:20.883Z"^^xsd:dateTime ; lei:LEINextRenewalDate "2016-05-04T09:01:27.494Z"^^xsd:dateTime ; lei:LEIRecordLastUpdate "2015-05-07T01:52:22.058Z"^^xsd:dateTime ; lei:LEIStatusCode "ISSUED" ; lei:LOUID "EVK05KS7XY1DEII3R011" ; lei:LegalEntityIdentifier "549300I00FSB0O13VI67" ; lei:LegalJurisdiction "US" ; lei:RegisteredAddress1 "C/O C T Corporation System" ; lei:RegisteredAddress2 "155 Federal Street" ; lei:RegisteredAddress3 "Suite 700" ; lei:RegisteredCity "Boston" ; lei:RegisteredCountryCode "US"; lei:RegisteredName "BlackRock Funds - BlackRock Short Obligations Fund" ; lei:RegisteredPostalCode "02110" ; lei:RegisteredRegion "US-MA" ; lei:ValidationSources "FULLY_CORROBORATED" .

if you squint while you look at it, you might see things in common with many popular data formats. You should, because Turtle is closely related to many common data formats.

Spreadsheets and Relational Databases: Like a relational database, TSV, or spreadsheet, this record holds a set of data fields that can be seen as columns. In this record, all of the properties except one (the a property) are used just once; once we've decided which table this row comes from, the conversion between RDF and relational is rote and mechanical. With the SPARQL query language, you can write SQL-like queries against data that is relational in structure.

Like a relational database, TSV, or spreadsheet, this record holds a set of data fields that can be seen as columns. In this record, all of the properties except one (the property) are used just once; once we've decided which this row comes from, the conversion between RDF and relational is rote and mechanical. With the SPARQL query language, you can write SQL-like queries against data that is relational in structure. JSON: Beyond the relational model, JSON can represent situations where there is more than one data value associated with a property and also where objects are nested inside other objects. In relational systems, conceptual entities, say, a customer, are represented as a number of relational tables such as customer , linked to multiple instances of order , in turn linked to multiple instances of line_item . It's a pain to write such joins manually, compared to working with nested JSON-style structures.

Beyond the relational model, JSON can represent situations where there is more than one data value associated with a property and also where objects are nested inside other objects. In relational systems, conceptual entities, say, a customer, are represented as a number of relational tables such as , linked to multiple instances of , in turn linked to multiple instances of . It's a pain to write such joins manually, compared to working with nested JSON-style structures. XML: In the big picture, JSON and XML documents are structurally similar. The major difference is that XML is more formalized in a number of ways. The good news is that RDF inherits, builds on, and improves upon many of the features of XML, including:

In the big picture, JSON and XML documents are structurally similar. The major difference is that XML is more formalized in a number of ways. The good news is that RDF inherits, builds on, and improves upon many of the features of XML, including: Namespaces: Some kind of Namespace support is essential if we wish to combine facts, information, text and code from a wide range of sources, because otherwise it would be impossible to separate all of the different names used by different organizations for different things. With namespaces, we can make tools that manipulate data, driven by predicates and facts in a certain namespace while being transparent and oblivious to facts in other namespaces. (In our case, namespaces are specified with the @prefix construction above.)

Some kind of Namespace support is essential if we wish to combine facts, information, text and code from a wide range of sources, because otherwise it would be impossible to separate all of the different names used by different organizations for different things. With namespaces, we can make tools that manipulate data, driven by predicates and facts in a certain namespace while being transparent and oblivious to facts in other namespaces. (In our case, namespaces are specified with the construction above.)

Literal data types: the XML schema specification defines a rich set of data types including the string and numeric data types used in JSON as well as date and time types that are missing from JSON. Most importantly, XML and RDF support the xsd:decimal type, which is suitable for monetary quantities, something that the floating point numbers in JSON are not suitable for. You can create your own primitive types, in your own namespace, in case you need something extra.

the XML schema specification defines a rich set of data types including the string and numeric data types used in JSON as well as date and time types that are missing from JSON. Most importantly, XML and RDF support the xsd:decimal type, which is suitable for monetary quantities, something that the floating point numbers in JSON are suitable for. You can create your own primitive types, in your own namespace, in case you need something extra. Object-oriented: The mapping between object oriented languages and relational databases has been called "The Vietnam of Computer Science" The mapping between RDF and idiomatic Java works much better; this translation is made possible by the flexibility of RDF, and realized by RDF/K, a schema language similar to RDFS and OWL, but that closes the gap with data representations frequently used in practice. In this chapter, we will demonstrate the connection between RDF data and Java objects, because this is essential to how Real Semantics is built.

The mapping between object oriented languages and relational databases has been called "The Vietnam of Computer Science" The mapping between RDF and idiomatic Java works much better; this translation is made possible by the flexibility of RDF, and realized by RDF/K, a schema language similar to RDFS and OWL, but that closes the gap with data representations frequently used in practice. In this chapter, we will demonstrate the connection between RDF data and Java objects, because this is essential to how Real Semantics is built. Prepositional Logic The [] operator defines a blank node, which is a statement that "there exists some ?x such that the following is true", also known as the existential qualifier. Easy access to blank nodes is core to the Turtle language, meaning you can build the structures you'd build in JSON, LISP and other languages fully. Prepositional logic also allows logical inference; at the very top of the record above you see that the record is a lei:ConformantIdentifier which was inferred because logical rules established that the format of the identifier was correct.

Nested and Ordered Structures in RDF and Java

Here is another example:

@prefix : <http://example.com/appliances> . @prefix dbpedia: <http://dbpedia.org/resource/> [ a :WashingMachine,:FrontLoadingWashingMachine ; :capacity 4.8 ; :supportedVoltages 120, 240 ; :phases ( "Soak" "Wash" "Rinse" "Spin" ) ; :energyCostEstimate [ :source dbpedia:United_States_Environmental_Protection_Agency ; :hotWaterSource "electric" ; :annualEstimatedCost 16.00 ], [ :source dbpedia:United_States_Environmental_Protection_Agency ; :hotWaterSource "natural gas" ; :annualEstimatedCost 14.00 ] ]

Note that Turtle parser is not aware of schemas, vocabularies, and so forth. It takes the code you enter, and turns it into a graph, without validation that you are using the correct facts in the right way. That's fine, because a schema for RDF/K, a K-Schema, expresses the allowable vocabulary and structures and can be used to validate data and/or groom it to a standard. For now, I'm not using a schema, and I'm just coining names in the <http://example.com/> namespace because that's simple.

If you wanted to express these data in the Java programming language, you'd probably imagine a class that looks something like:

public class WashingMachine { Float capacity; Set<Integer> supportedVoltages; List<String> phases; Set<EnergyCostEstimate> energyCostEstimate; }

It takes a tiny amount of metadata to connect that Java class with the RDF written above. One way to do it is to package the metadata with the class in the form of a few Java Annotations:

@Prefix(name = "", uri = "http://example.com/appliances" ) public class WashingMachine { @Property Float capacity; @Property Set<Integer> supportedVoltages; @Property List<String> phases; @Property Set<EnergyCostEstimate> energyCostEstimate; }

The only metadata in this case is (i) the default prefix to map Java names to, and (ii) an assertion that we want to transfer data between a field and RDF. With that in place, we can convert RDF statements above into Java data with the greatest of ease:

@Test public void parseWash() throws InvocationTargetException, IllegalAccessException { WashingMachine m=new WashingMachine(); Resource that=getOnly("http://example.com/appliances/WashingMachine"); configurator.configure(m,that); assertEquals("Set",m.phases.get(0)); assertEquals("Wash",m.phases.get(1)); assertEquals("Rinse",m.phases.get(2)); assertEquals("Spin",m.phases.get(3)); }

in this test case, we find the only record of type :WashingMachine , convert it to a Java object (by creating the WashingMachine() instance and then applying configurator.configure and then we can check that we got the phases of the cycle in the correct order.

This method of annotation is useful for getting data in and out of Java classes that are written with Real Semantics in mind. The mapping is similar to JSON-LD but it works better because List and Set are used widely and exposed through the static typing of the language in contrast to JSON-LD, which adds new concepts for @list and @set . A strong advantage of doing it this way is that all of the parts in one place, so there is no risk of updating the class without updating the metadata.

Real Semantics can also work with Java objects that are not aware of real semantics; as in the above case, it looks at the Java metadata and processes it with rules that recognize common idioms such as Java Beans and sometimes a little bit of additional metadata you supply. Like the Spring framework, Real Semantics can create arbitrary objects configured with data from RDF. RDF reasoning systems, such as the Jena rules language, can do the kind of reasoning that Spring does, but can also reason about that data in different ways to understand, validate or visualize the construction of a system. RDF has the powerful SPARQL query language which lets you immediately apply analytics to anything.

Viewing Java data in RDF

Here is an example of getting data into Real Semantics from classes that were written before Real Semantics. Like most Java large programs, Real Semantics is compiled by the open source Apache Maven system which expresses and documents the physical structure of the program. The documentation generator that builds this book repurposes this information -- the fast track to that is to use the parser built into Maven to convert POM files into MavenProject objects inside Java.

The POM file is an XML document and Real Semantics could ingest it directly as a tree, however, Maven interpolates parameters and implements specific forms of inheritence and inference that let us work with, not the surface structure, but the deep structure Effective POM that controls Maven.

The documentation generator that built this book reads a map of the maven modules that comprise Real Semantics. This process has the following steps:

The first step of the process is that Real Semantics scans the Java introspection metadata about the MavenProject class and converts this into an RDF graph, which gives us a complete and losless model of the classes, fields, and methods built into that class and selected classes it depends on. RealSemantics applies one of several heuristic ruleboxes to automatically generate a mappings from Java to RDF. In this case we uses one that recognizes the "Java Beans" standard. If an existing rulebox is not satisfactory, rules can be overridden or a few mappings could be added manually with a fact patch. Real Semantics generates a set of stub classes and/or objects that implement the transformation.

Because of the intelligence under the hood, you can read the MavenProject object without thinking at all.

Stub<MavenProject> stub=new CreateStub().create(MavenProject.class); MavenProject project = MavenProjectFactory.getMavenProject(pomfile); Model that=stub.toModel(project);

the RDF output you get looks pretty natural:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> @prefix unq: <http://rdf.ontology2.com/unqualified/> . [ unq:artifactId "javagate" ; unq:description """It will be necessary to import and export data between RDF and objects that were not developed with Real Semantics in mind. This can be done in a remarkably transparent way, given that there are conventions, such as Java Beans, that the system can take advantage of. Javagate converts class metadata from Java into an RDF graph, which can be queried and transformed with rules. From this, it create a Stub object which does the Java to RDF transformation."""^^xsd:string ; unq:groupId "com.ontology2.rdf"; unq:id "com.ontology2.rdf:javagate:jar:1.1-SNAPSHOT"; unq:modelVersion "4.0.0" ; unq:version "1.1-SNAPSHOT"; unq:name "javagate"; unq:packaging "jar" ; unq:modules (); unq:runtimeClasspathElements (); ... many more facts ... ] .

Model.

What does that buy us?

Just above, we used a library in Java (our host language) to extract information from a POM then we import that data into an RDF graph. Our project consists of a number of POM files, so we import them all into an RDF graph. What good is that?

For one thing, we can write queries in the SPARQL language, which is closely related to SQL. The POM files together form a map of the Real Semantics application that is built into this book. Data in hand, we can write the following SPARQL query:

prefix unq: <http://rdf.ontology2.com/unqualified/> select ?id ?name ?description { ?project unq:id ?id . ?project unq:name ?name . OPTIONAL { ?project unq:description ?description .} } ORDER BY ?id

this produces a SPARQL result set exactly similar to a SQL result set that we use to draw a map of RealSemantics, module by module, that is displayed on this page, and of which we'll show you a little sample here:

docminister Generator and weaver of reports and documentation. Captures documentation about the input data, specifications and software. Take advantage of mechanisms that already exist to express metadata and documentation. This applies both in the area of documenting software like Real Semantics itself, but also in creating a bundle of reports tied with a bow that explain some range of natural or social phenomenon. henson Henson is the Real Semantics component that sconfigure, create and snapshot virtual machines with software and data in a cloud environment. Ensures that Real Semantics can get any computing resources it needs to create decision-making data products java-annotations Sometimes Real Semantics needs to read metadata about software components it uses. For instance, the mogrifier maintains a catalog of components exposed to end users. If code is written post-Real Semantics, it is practical to stick a few Java annotations on the new Java code to express class-level metadata This keeps the metadata bundled with the code, which keeps the metadata in sync. This module also defines a few annotations for defining namespace prefixes which are re-used in the rdfconfig-annotation package which injects class data into RDF

This is the simplest possible example, but it illustrates that once you get data in RDF format, you can (i) write queries against it and (ii) put multiple objects of various kinds in a single graph and write queries against that. Without a universal data model, you are stuck writing queries in different languages such as SQL and XQuery if you are so lucky to have a query language for a particular data format. With RDF, SPARQL, and real semantics, you can write queries against any kind of data.

Ordered Lists in RDF

Two kinds of collections of things are commonly used in computer programs, and these are List and Set . The items of a list are in a definite order, like the authors of a book, but other properties, such as the collection of booksellers who sell the book, are not. Usually, duplicates are eliminated from an unordered collection, which makes it a set.

Two kinds of collections of things are commonly used in computer programs, and these are List and Set . The items of a list are in a definite order, like the authors of a book, but other properties, such as the collection of booksellers who sell the book, are not. Usually, duplicates are eliminated from an unordered collection, which makes it a set.

Set properties have always been used extensively in RDF. Particularly, you can make multiple statements that "some ?subject has ?property with value ?object quite easily:

@prefix : <http://example.com/> :Pool a :GameFamily . :Pool :hasVariant :EightBall . :Pool :hasVariant :NineBall . :Pool :hasVariant :Straight .

and thus define a set of variants of the game of Pool. The graph above has four independent facts written out one at a time. Turtle has a convenient shorthand which is just

@prefix : <http://example.com/> :Pool a :GameFamily ; :hasVariant :EightBall , :NineBall, :Straight .

The semicolon lets you state an entirely new property , while the comma lets you specify multiple objects that share the same subject and property. This collection has Set semantics because you can only enter a fact into the graph once.

Although ordered lists have been a part of RDF standards from the very beginning, they have been a bit out of fashion in the age of "Linked Data", which involves large and complex datasets such as DBpedia. Ordered lists are missing from common SQL implementations, so practically, generations of analysts have learned to work around this about 90% of the time, yet, the special cases that require ordering hold back general solutions baced on legacy technology. Real Semantics, through RDF/K and other features, makes ordered lists easy to work with.

You can write ordered Lists in Turtle exactly the same way you would in LISP.

@prefix : <http://example.com/> (:A :B :C)

This list exists, in itself, apart from any statements that involve it. Really, (:A :B :C) is a name for a blank node such that

@prefix : <http://example.com/> (:A :B :C) rdf:first :A ; rdf:next (:B :C) . (:B :C) rdf:first :B ; rdf:next (:C) . (:C) rdf:first :C ; rdf:next () .

we picture that graph here:

Let's add also that () is rdf:nil . This particular representation is called a "Linked List" and it is quite similar to the LinkedList in Java. Let's fill out a few of the things you can do with this notation.

You can state a fact about a list by using a list on the left side of the predicate, like so:

@prefix : <http://example.com/> ("foo" 75 :something) a :RandomList .

At this point you might be asking, "What am I allowed to put into the list?" and the answer is "anything", at least any kind of RDF Node:

@prefix : <http://example.com/> ( :hello [ :a :Person ; :named "John" ] (:goodbye 3 4))

Here we see members that are URI Resources, such as :hello and :goodbye but in the middle there are some facts about a blank node and the end is another list. This means you can build the same kind of structures you would in JSON, and even write LISP code in Turtle!

( <fn:numeric-add> 2 2 ) .

given, of course an implementation of the eval function that works on the list. Practically, Real Semantics tries as much as it can to hide the mechanics from you, when it is moving data between RDF and some other format, such as Java.

Representing multiple data values in Java

From the viewpoint of Real Semantics, there are three kinds of Java type:

Primitive: this could be a String, Java Primitive (Integer, Boolean) or a type that looks like a primitive such as an OffsetDateTime and would be typically represented by an RDF literal of some kind. Composite: this is a reference to another type which is known to the Real Semantics system; typically an instance of such a type would be described as a set of RDF properties centering around either a URI or a blnak node resource. Collection: certain generic types, such Set<T> and List<T> can be automatically mapped to appropriate RDF structures.

RDF, at a raw level, allows you to use a property any number of times. You are certainly welcome to apply a property exactly once to a subject, like this

@prefix : <http://example.com/> :Orange :red 200 ; :green 200 ; :blue 0 .

and if you want to assign that to a Java class that looks like

class Color { Integer red; Integer green; Integer blue; }

you are in pretty good shape. The Turtle language lets you write:

@prefix : <http://example.com/> :Colour_out_of_Space :red 100,200 ; :green (200 200 200) ; :blue 0 .

in which case there are two values for the red property (without a specific order) and three values for the green property (ordered as a list of three elements.) Either way, you can't stick multiple values in an Integer field, so Real Semantics gives an error message if you tried to make a Color from this data. This is a behavior that RDF/K adds to the RDF standard.

The handling of Set<T> and List<T> is straightforward. By default, Real Semantics is permissive about what you can do. That's good, because we find people are often sloppy in choosing Lists vs. Sets (sometimes they use one while another would do.) If we had a class like

class PolyColor { Set<Integer> red; Set<Integer> green; Set<Integer> blue; Set<Integer> alpha; }

Real Semantics can see the schema implied by the types, and makes the natural transformation, where red is the set [200,100] because order doesn't matter in a set, green is the set [200] because members of sets are unique and blue is the set [0] (we promote a single element to an set or a list that just contains that element and alpha is the empty set [] .

If you assign to a List, something similar happens:

class OrderedColor { List<Integer> red; List<Integer> green; List<Integer> blue; List<Integer> alpha; }