E Pluribus Unum, or, Star Schema Meets Cluster

In the Cost of Freedom article, we looked at the implications of doing standard analytics without a schema. Here we will look at scalability.

One could claim that a 30GB scale is trivial. Here, we multiply the scale by 10 and move from one to two servers, going from shared-memory multicore to distributed-memory scale-out with partitioned data, i.e. each machine holds a distinct fragment of the database.

Again, we run the same workload in SQL and in SPARQL. The RDB schema is the same as before, so for SQL tables only primary keys are declared, there are no indices or special declarations about data placement, and the table partitioning is on the first part of the primary key. The RDF data for the SPARQL runs is quads with the default index scheme, partitioned on subject or object, whichever is first in key order. The test system is two machines, each with dual Xeon E5-2630 and 192GB RAM. The queries are from warm cache.

There are 1.79 billion line-order rows, giving a total of 32.5 billion RDF triples.

Comparative Results, 300GB, in seconds

Virtuoso Cluster, SQL Vs. SPARQL Query Virtuoso

SQL Virtuoso

SPARQL SPARQL-to-SQL

ratio Q1 2.285 7.767 3.4x Q2 1.530 3.535 2.3x Q3 1.237 1.457 1.2x Q4 3.459 6.978 2.0x Q5 3.065 8.710 2.8x Q6 2.901 8.454 2.9x Q7 5.733 15.939 2.8x Q8 2.267 6.759 3.0x Q9 1.773 4.217 2.4x Q10 1.440 4.342 3.0x Q11 5.031 12.608 2.5x Q12 4.464 15.497 3.5x Q13 2.807 4.467 1.6x Total 37.992 100.730 2.7x

The SQL run takes 38s, and the SPARQL run takes 101s. Comparing to the single server, we see a little better than linear scaling, i.e., double the gear is a little over 2x faster. In the present case, 10x the data takes a little under 5x the time.

This is due to slightly better load balancing. The single server splits the workload into chunks based on a run-time approximation; the cluster stores the partitions separately and uses this to determine the parallelism. In the latter case, the chunks are of more equal size.

The SPARQL penalty is here 2.7x, essentially the same as in the single server case.

It is no secret that a star schema is "embarrassingly parallel." In other words, when there is one big table (fact table) that references many smaller tables (dimension tables), and query conditions are expressed on properties of the dimension tables, the correct query plan nearly always consists of putting the interesting foreign key values into hash tables, and then scanning the fact table from beginning to end and picking the rows where the values in the foreign key columns are found in the hash tables. This is called a selective hash join or invisible hash join if the hash join operation is merged in the table scan itself. Daniel Abadi’s well known thesis explains this matter. In the case of a cluster, supposing all tables are partitioned, identical hash tables are made on all participating servers, and after this each server gets to scan its fraction of the fact table independently of any other. This nearly always works because the dimension tables are typically orders of magnitude smaller than the fact table.

The complexity of the queries is close to linear to the data size. The factor that makes this deviate from linear is the fact that as hash tables get larger, they will miss the CPU cache more frequently; hence they get slower to probe. We here assume that a bigger fact table means bigger dimension tables. This is often the case, i.e., the more sales records there are, the larger the number of distinct customers or distinct items in the catalogue is likely to be. This is not always so, though, as the number of days in the history does not scale in the same way.

Any decent analytics oriented RDBMS with scale out will give near-linear performance with a star schema, at least up to the point where the hash tables can no longer be replicated on all servers, or the hash join must do multiple passes over the fact table.

Doing the same with schema-less data is harder, even though the principle is exactly the same. The difficulty lies again in detecting that a scan filtered by selective hash joins will give the most locality in access pattern, visiting each cache-line-worth of data once at most and doing so almost always in sequential order. This triggers memory prefetching on any modern CPU and significantly reduces memory latency.

As the scale grows there are some details of query plan that become significant. For example, in Q7 of SSB, there is a report on sales between Asian customers and Asian suppliers, country by country. In SQL:

SELECT d_year, c_nation, s_nation, SUM (...), ... FROM lineorder, customer, supplier, dwdate WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND c_region = 'ASIA' AND s_region = 'ASIA' ...

In this case, one builds a hash table from c_custkey to c_nation and from s_suppkey to s_nation , including only suppliers and customers where the region = 'ASIA' .

In SPARQL, the situation is the same, except that we say:

SELECT * WHERE { li rdfh:lo_custkey ?lo_custkey ; rdfh:lo_suppkey ?lo_suppkey . ?lo_custkey rdfh:c_nation ?c_nation ; rdfh:c_region "ASIA" . ?lo_suppkey rdfh:s_nation ?s_nation ; rdfh:s_region "ASIA" . }

This is an equivalent expression, or near enough. We note that now it makes sense to build the hash table from the join of two patterns, i.e., the rdfh:s_nation and rdfh:s_nregion , which together are the same as the supplier table in the SQL variant. But now, a supplier might have more than one rdfh:s_nation triple; hence the hash table is no longer known to be a priori unique -- i.e., its key is the URI of a supplier, but it is only known that suppliers usually have one nation; it is not known that they never have two.

However, if we build the hash table from only one pattern, i.e., { ?s_suppkey rdfh:s_region "ASIA" } , and the graph is specified, then we know that the subject will be unique, as stating that X has region "ASIA" twice has no effect beyond stating it once. However, if we do this, then there needs to be another hash table built for mapping the supplier to its s_nation . There is no way to know that this is unique. Making two hash tables instead of one has been seen to slow down the query by a factor of two at scale 300GB whereas the effect is hardly noticeable at 30GB.

Thus recognizing these special cases with SPARQL is crucial if one is to come anywhere near the performance the SQL world attains just by following the schema. What used to be basic becomes trickier. It is true that the same or similar tricks are also needed in pure SQL workloads, but then not within the SSBM queries. What is a star in SQL is quite often a snowflake in SPARQL. A star schema has a single table-per-dimension, e.g., customer , whereas a snowflake has more structure, e.g., customer , the customer’s country and the country’s region , all in different tables.

The message to the SPARQL public is that now, for the first time, if there is natural parallelism and locality in the data, you know that the database will exploit this correctly and derive the same benefits from this as any SQL-only system would. Of course, there have been and are parallel RDF databases, including previous versions of Virtuoso, which do distributed index-based operations with various levels of concurrency and distributed coordination, but experience shows that these query strategies are not very good for queries that touch a large fraction (over 3%) of the database.

For basic data warehouse workloads, whether in SQL or SPARQL, Virtuoso offers linear scaling where clusters pay off from the start. While a handful of relational column stores have offered such capabilities for SQL for some time, now the same is also available for schema-less data: Entirely declarative querying; no explicit data partitioning; no schema restrictions; no map-reduce programs or the like.

The cluster evolution path is clear: as with single server, we strive for more speed on a broader range of operations. The star and snowflake functionality discussed here is the core piece for any analytics, so if this is not right, the rest is also compromised. The other side of cluster is operational, i.e., flexibility of deployment -- for example, flexibly resizing cloud-based databases.

This is the qualitative jump. Incremental performance gains will follow for both SQL and SPARQL.

Stars and snowflakes are common but the world does not end with these. As we will see in subsequent articles, there is a whole world of graphs, e.g., social media analytics, as well as more complex relational schemas.