A Relational View of the Semantic Web

March 14, 2007

Andrew Newman

As people are increasingly coming to believe, Web 2.0 and the Semantic Web have a lot in common: both are concerned with allowing communities to share and reuse data. In this way, the Semantic Web and Web 2.0 can both be seen as attempts at providing data integration and presenting a web of data or information space. As Tim Berners-Lee wrote in Weaving the Web [1]:

If HTML and the Web made all the online documents look like one huge book, RDF, schema and inference languages will make all the data in the world look like one huge database.

RDF is at the core of W3C's Semantic Web architectural layers. It is the standard specifically designed to provide a way to produce and consume data on the Web. It sits on top of standards such as XML, URIs, and Unicode and is used as a basis for schemas and ontologies. It consists of a set of statements that are composed of a subject, predicate, and object that form propositions of fact [7].

How are queries performed on this "one huge database"? Up until recently, manipulating or retrieving RDF data has been done through vendor specific query languages or imperatively through APIs in languages such as Java, PHP, and Ruby. The W3C's proposed standard, SPARQL, is set to provide a declarative language to query and manipulate Semantic Web data [8].

SPARQL consists of operations that are reasonably similar to those found in existing and mature technologies such SQL or relational algebra including: join, union, left outer join (SPARQL's OPTIONAL), and comparison operators (SPARQL's FILTER) such as equal to, less than, greater than, etc. [8]

The current suite of existing technologies, such as SQL and the relational model, were devised without the specific requirements of disparate, uncontrolled, large-scale integration. It is unclear whether they are flexible enough to adapt to these new set of requirements in order to enable this idea of a global database.

Advantages of Loose Structure

Before attempting to defined SPARQL and RDF in relational terms it's useful to explore some of the reasons why you would store data in this manner.

One of the difficulties in creating this shared information space is to agree on a schema for the data. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. One of the great strengths of the RDF model is that it allows data to be stored and queried without first requiring a schema. This decoupling of schema and data also allows the schema to change independently of the data without requiring any existing data to be thrown away or padded with NULLs. It also allows a schema to be automatically generated by looking at relationships between imported instance data.

RDF also allows database design and management to be much more agile, similar to agile software development, where a schema can be designed incrementally, after the data has been collected, and it evolve over time as new requirements are encountered. It allows data that is structured slightly differently to be stored together in the lowest common denominator of an RDF statement (subject, predicate, and object). It eliminates the decision to weigh good design against performance in order to store data that might be slightly different in structure. For example, it allows suppliers without cities and names to be stored along alongside suppliers with that information.

This lack of padding (not needing NULLs) removes one of the most debated topics in SQL and the relational model’s use of it (see "Much Ado About Nothing" [5]). The argument has generally revolved around the possibly confusing uses of NULLs and what a NULL value actually means. This becomes especially important when one of the main tasks of the Semantic Web is to integrate data from many different sources. A NULL value can mean different things from different data sources and may have been produced as a result of different types of queries from different database implementations. This lack of context, which is often lost in traditional databases too, means it becomes prohibitively costly and difficult to retain the specific meaning of NULL values from the wide variety of sources available on the Semantic Web.

Removing the use of NULLs also has a positive impact when you consider the inconsistent handling that occurs across various SQL database implementations. It can also simplify aggregate functions where a NULL value is considered when counting rows but not when performing other operations such as averaging values.

RDF Using the Relational Model

An RDF statement or proposition seems fairly abstract but it is actually familiar to most developers in the form of database management systems (DBMS) and the most popular relational language SQL. These databases provide a way to represent statements of facts or propositions and to ask questions (queries) as to whether a given proposition is true or not.

For the purposes of storing propositions and answering queries its possible to represent RDF in an SQL or relational database and vice versa. The advantage in storing RDF using these previous models is to allow previous work done such as formalizing query operations and query optimization to be applied to SPARQL.

It should be made clear that the work conducted here does not concern itself with specific ways of storing RDF but merely using previous models as examples of what can be applied to RDF. There are many different approaches to creating efficient RDF stores including more efficient table structures, manipulating RDF data so that it can be stored more efficiently, and creating databases (not based on SQL) specifically designed to efficiently store RDF (which is narrow, regular, and requires many joins).

In order to describe a relational model of RDF a familiar example is used throughout: the supplier and parts tables as used by C.J Date [4]. Table 1 shows a typical set of data from the supplier table. It consists of a table heading, where the columns (or attributes) consist of a name and type, and a body that consists of rows with values for each of these columns. The first row of the body in Table 1 is a proposition that represents, "A supplier 'S1', has a name called 'Smith', a status of '20' and a city of 'London'".

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 "Paris" S3 "Blake" 30 "Paris"

Table 1. Example of a Supplier Table

Figure 1 shows the mapping of this data (containing the same propositions) represented as an RDF graph. This representation takes the table headings (columns) as arrows to connect the values and their data types to an identifier ("_1", "_2", "_3"). These are RDF identifiers, called blank nodes, which are a placeholder for the other properties and values to be associated to one another, similar to a table row. The blank nodes represent the existence of a supplier but do not describe any properties of the supplier.

Figure 1. Example of a Supplier Graph

An alternative mapping could take a primary key (SNO is a likely candidate) as being the center of all the values. However, this limits the possibility of representing suppliers without a known supplier number or duplicate rows (ones that typically occur in SQL tables that don't have uniqueness constraints applied). Representing a supplier without a required attribute may not seem initially sensible for those used to creating data models in closed environments. However, on the Web or any large distributed system, agreement of what is a required attribute may not be able to be reached ahead of time or perhaps an authority required to create unique identifiers may not be reachable at the time the data is stored. Similarly, detecting duplicates is something that may have to occur after the data is recorded.

This choice between a blank node or unique identifier is similar to the surrogate vs. natural key in relational databases. The difference is that blank nodes cannot be searched on by value in the same way a numeric surrogate key can. The advantage is that blank nodes can be created locally and distributed globally without requiring an authority to generate them.

An RDF graph is a lot less structured than the given typical relational table, but it still has a fixed structure of the RDF statement (subject, predicate, and object). Because this structure is fixed, it's therefore possible to represent it relationally. This is given in Table 2 using the data represented in Table 1 and Figure 1.

s1 subject p1 predicate o1 Object _1 #sno "S1"^^#sno _1 #sname "Smith"^^#name _1 #status "20"^^#integer _1 #city "London"^^#char _2 #sno "S2"^^#sno _2 #sname "Jones"^^#name _2 #status "10"^^#integer _2 #city "Paris"^^#char _3 #sno "S3"^^#sno _3 #sname "Jones"^^name _3 #status "30"^^#integer _3 #city "Paris"^^#char

Table 2. The Supplier Data as RDF Triples in a Relation

The types of the columns are RDF's node types: subject, predicate, and object and are named "s1", "p1", and "o1" respectively. An RDF subject can be a blank node or URI, a predicate a URI and an object can be an URI, blank node, or literal. The use of hashes ("#") is merely a convention used to represent URIs with a namespace that is unimportant and literal values are composed of a value and a datatype (which are also URIs) that is preceded by two carets ("^^"). So the literals "Smith" and "20" are of type "name" and "integer" respectively.

This view of RDF as a relational structure is not that unique and was described in the early stages of RDF's development by Tim Berners-Lee [2].

RDF without NULL

RDF does not have the concept of a NULL value. Similarly, the relational model as defined by Date dismisses the need for a NULL value too. RDF can be stored using this version of the relational model and hence NULL values can be avoided.

This is best demonstrated by looking at the data from Table 2 and considering what if supplier S2 and S3 didn't have a status and S3 also lacked a city. What would a flexible view be of the data look like if you didn't need to worry about agreeing on one table structure and didn't use NULLs? Tables 3, 4, and 5 shows three relations each with a different number of columns (different types) and Table 6 shows the merging of these relations into one, as an untyped relation. An untyped relation is a relation that contains a set of tuples that can contain a subset of values bound to the heading's attributes. To return the untyped relation to a typed relation a simple project on the required columns can be performed. There are no NULLs -- there are tuples that contain sets of values that are unbound or don't return a value for the given column (attribute).

SNO sno SNAME Name STATUS Integer CITY char S1 "Smith" 20 "London"

Table 3. Suppliers with a name, status and city.

SNO sno SNAME name STATUS integer S1 "Smith" 20 S2 "Jones" 10

Table 4. Suppliers with a name and status.

SNO sno SNAME name S1 "Smith" S2 "Jones" S3 "Blake"

Table 5. Suppliers with a name.

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 S3 "Blake"

Table 6. Example of a Supplier Table

As shown in Tables 3-6 relations of different types can be represented by a single untyped relation. While this may seem like a shift away from the traditional relational approach it is actually just a convenient way of representing relations of different types in one data structure. This is especially useful when relations of different types are expected to occur frequently when integrating data from different sources such as those found in the Semantic Web. The traditional approach to relations and relational algebra can still be used but it requires many equally typed relations to be used both as input to operations and as their outputs. The use of untyped relations reduces the total number of relations to be handled and with the use of modified relational operations allows processing to be performed once over these untyped relations. For example, the supplier table given in Table 6 when joined with a parts table would require three operations and results. In an untyped system, only a single operation is performed producing a single untyped relation.

Relational SPARQL Operations

Given that RDF can be represented using a flexible, untyped relational model what modifications to relational operations are needed and how do they relate to SPARQL operations? A subset of SPARQL operations will be covered including: JOIN ("."), UNION, and OPTIONAL and a modified relational algebra will be given to support these operations.

The first modification required, one suggested by Richard Cyganiak [3], is an untyped join (SPARQL's JOIN). An untyped join allows tuples in relations to be successfully joined except if a value in one relation conflicts with the value in the other. If a value is unbound in one tuple but is bound in another then the bound value is added to the result tuple.

A formal definition of an untyped JOIN (based on Date's definition of Join [4]):

Let r and s have attributes X1,X2,...,Xm, Y1,Y2,...,Yn, Z1,Z2,...,Zp. Where Y's are the common attributes, X's are other attributes of r and Z's are the other attributes of s. The untyped JOIN of r and s is a relation t with a heading that is the set theoretic union of the headings r and s {X, Y, Z} and a body that consists of the set of all tuples {X x, Y y, Z z} such that a tuple appears in r with X value x or no value for X and Y value y and a tuple appears in s with Y value y and Z value z or no value for Z. Y values for r and s may both be unbound or either maybe unbound - this does not lead to a successful join. A successful join occurs if at least one Y value y for r and s are equal and are not unbound.

This is different to SQL and some definitions of relational algebra where NULL values (NULL being considered equivalent to an unbound value) cause join failure. This behavior of joining shared attributes in r and s is shown in Table 7. An example of an untyped join of relation r (Table 8) and relation s (Table 9) is shown in Table 10.

Values of Shared Attributes Typed Join Untyped Join r{Y = y}, s{Y = y} Joined Joined r{Y = y}, s{Y = x} Rejected Rejected r{Y = {}}, s{Y = y} Rejected Y = y if Joined r{Y = y}, s{Y = {}} Rejected Y = y if Joined r{Y = {}}, s{Y = {}} Rejected Y = {} if Joined

Table 7. Results of a Shared Attribute (Y) of Two Relations r and s

SNO sno SNAME Name S1 "Smith" S2 "Jones" S3 "Blake"

Table 8. Relation r

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 S3 "George"

Table 9. Relation s

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10

Table 10. Result of Untyped Join of r and s

The second untyped operation takes the proposal by César Galindo-Legaria [6] for an outer union operator and its use in the definition of left outer join (which is analogous to SPARQL's UNION and OPTIONAL respectively). OUTER UNION provides the same semantics as SPARQL's UNION operation while being formally defined and grounded in the relational model. Furthermore, SPARQL's OPTIONAL operation can be composed of outer union and set difference, project, and untyped join.

A formal definition of OUTER UNION:

The outer union of relations r and s is the set theoretic union of the headings of r and s with a body consisting of all tuples t such that t appears in r or s or both. It does not require that r and s have the same attributes (types) as specified by the regular relational union.

Table 11 shows the result of performing an outer union of relations r and s from Tables 8 and 9.

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 S3 "Blake" S1 "Smith" S2 "Jones" S3 "George"

Table 11. Result of Outer Union of r and s

A formal definition of LEFT OUTER JOIN:

The left outer join of relations r and s is the outer union of the join of r and s and the antijoin of r and s. Or formally:

R1 R2 := (R1 R2) (R1 R2).

Antijoin is composed of difference and semijoin. Semijoin is composed of join and project. The fully expanded version can therefore be expressed as:

R1 R2 := (R1 R2) (R1 − (π(R1) (R1 R2)))

Where: "−" denotes difference and "π" denotes project.

The use of antijoin is significant from the point of view of distributing the queries efficiently across multiple sites, something that is important in SPARQL implementations. The difference and project operations are the standard relational versions. Table 12 displays the results of performing a left outer join with relations r and s from Tables 8 and 9. Left outer join is order dependent, if the left outer join of s and r are performed the result the last relation will have the name "George" not "Blake".

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 S3 "Blake"

Table 12. Result of Left Outer Join of r and s

Another operation defined by Galindo-Lagaria is the minimum union operator (⊕), which has the same effect as performing outer union with the results of the antijoin of r and s.

A formal definition of MINIMUM UNION:

The minimum union of relations r and s is the outer union of r and s followed by removing subsumed tuples. Tuple subsumption is defined as t 1 subsumes t 2 if t 1 has more values that are bound than t 2 and that the values in t 2 that are bound are equal to t 1 . The removal of subsumed tuples in R is denoted as R ￬.

Table 13 shows the result of minimum union performed of relations r and s from Tables 8 and 9.

SNO sno SNAME name STATUS integer CITY char S1 "Smith" 20 "London" S2 "Jones" 10 S3 "Blake" S3 "George"

Table 13. Result of Minimum Union of r and s

Another definition of LEFT OUTER JOIN can then be given using minimum union:

R1 R2 := R1 R2 ⊕ R1

The result returns the same results as given in Table 12 and has the advantage over the previous definition in that it requires fewer operations.

Bagging SPARQL

The use of the relational model to query RDF provides lessons that have yet to be applied to the design of SPARQL. One of the main criticisms that can be leveled at SPARQL is its use of multisets (bags) – SPARQL has a DISTINCT operator that removes duplicates. RDF is set based. It is often seen as a good property of query languages to retain the same data model, to be consistent, this increases the easy of use and the ease of implementation.

In SQL, one of the uses of duplicates is to provide a way to perform aggregate functions. That is, being able to ask questions such as: "What is the sum of all salaries?" (using "SELECT SUM(salaries)…"). This query is typically performed on a table representing employees and their salaries within an organization's database. Using set-based semantics the same query only returns the distinct salary values to be totaled, not all of them. To get this query to work using a set-based query language a distinct entity, such as an employee, is required in combination with their salary in order to get the desired result.

The use of a set-based language requires that the results be paired with their relevant contextual information such as the combination of employee, salary and organization. This contextual information becomes vital when the query is performed on the larger web of data. Asking the entire web for the sum of salaries is unlikely to return the results required. The query has to include this contextual information so that salaries, for employees, employed by a specific organization or other group is retained. These are the parts of the query that are usually implicit locally which will need to be made explicit globally. Using consistent set-based semantics will retain this context and allow a query to return results correctly irrespective of what it is being queried against.

Another issue is one of answer closure. Closure allows the outputs of a function to be used as the inputs to the next. Currently, the results of a SPARQL SELECT query cannot be used as input for further querying. While SPARQL provides a CONSTRUCT query to return an RDF graph it is a new graph (new blank nodes are generated, for example) and is not restricted to only returning statements from the original. When querying a web of data it is useful to be able to feed the result of one query into another with each query being re-executed as needed. Ideally, the assignment of a variable to the result of a SPARQL SELECT query could be used within the SPARQL query language much like Date's relvar [4]. This provides a way to build up more powerful queries based on others and is another way to dynamically provide context that subsequent queries can be performed against.

Conclusion

One of the goals of the Semantic Web is to be able to achieve querying of disparate data sources across the web. The proposed standard for querying the Semantic Web, SPARQL, can be seen as an extension of an existing formalization, the relational model. The use of the relational model provides a way to use previous work in query distribution, optimization, and formulation. The standard relational model is not sufficient, however, and must be extended to support untyped relations and operations in order to integrate these data sources.

Bibliography

[1] T. Berners-Lee, Weaving the Web, Orion Publishing Group, Ltd, London, United Kingdom, 1999, pp 201.

[2] T. Berners-Lee, Relational Databases on the Semantic Web, 1998; http://www.w3.org/DesignIssues/RDB-RDF.html

[3] R. Cyganiak, A Relational Algebra for SPARQL, Digital Media Systems Laboratory, HP Laboratories Bristol, Tech. Rep, HP Laboratories Bristol, Tech. Rep, 2005; http://www.hpl.hp.com/techreports/2005/HPL-2005-170.html

[4] C. J. Date, Database in Depth, Relational Theory for Practitioners, O'Reilly Media, Inc, Sebastopol, California, 2005, pp. 11, 17-20, 86-93.

[5] C. J. Date, Relational Database Writing 1991-1994, Addison Wesley Publishing Company, Inc, Reading, MA, 1995, pp. 341-362.

[6] C. Galindo-Legarai, "Outerjoins as Disjunctions," Proceedings of the 1994 ACM-SIGMOD Int. Conference on Management of Data, 1994, pp. 348-358.

[7] P. Hayes, RDF Semantics, World Wide Web Consortium (W3C) Recommendation, 2004; http://www.w3.org/TR/rdf-mt/

[8] E. Prud'hommeaux, and A. Seaborne,SPARQL Query Language for RDF, World Wide Web Consortium (W3C) Candidate Recommendation, 2006; http://www.w3.org/TR/2006/CR-rdf-sparql-query-20060406/