One of the most fundamental functions of a programming language is to decide if two things are “the same” or are “different”.

The determination of “sameness” can be quite tricky, and introduce subtle software errors or require a significant amount of code to check many cases.

As a simple example, imagine two separate database queries, one for all people with “John” in their name, and another for all people with “Smith” in their name — how to tell that a “John Smith” from the first query is the same as a “John Smith” in the second query, without custom code?

The Vital Development Kit (VDK) includes domain specific language bindings (a DSL) specific to data comparison, inference, and manipulation. Recently we introduced a new feature called Semantic Equality to the VDK.

With the VDK and Semantic Equality, Data Scientists and Developers can write less code, have fewer bugs, and more easily work with large amounts of diverse data.

Background

The VDK, using the VitalSigns component, manages the domain models of your application and generates code to interact with different types of databases, data predictive components, and user interfaces. This makes it easy to combine different components into a unified application: such as NoSQL Databases, SQL Databases, Apache Spark, and JavaScript Web Applications.

Since the JVM compares objects “by reference” — the “reference” is a pointer to the bit of memory used to store the object — the following code will typically not be true if the objects were loaded or created at different times (like the “John Smith” objects mentioned before:

/* true only if they have the same memory reference */ if(object1 == object2) { }

To mitigate this, it’s common to write custom code to override the “equals” function in the JVM so that objects can be compared by their data values. Frameworks such as Object-Relational Mapping tools often include generating such “equals” methods, but this only covers application to database interactions, and even more custom code needs to be written to incorporate other components like machine learning.

The VDK takes a more general approach.

Each VDK data object object has a globally unique identifier, called a URI, associated with it. So determining if one object refers to the same identical thing as another object is as simple as:

/* they refer to the same thing! */ if(object1.URI == object2.URI) { }

This is universally true, regardless of the source of the data or the types of objects being compared.

But, what if you want to compare data fields of the objects, like determining if two people have the same birthday?

/* they have the same birthday! */ if(person1.birthday == person2.birthday) { }

This works when the values of the “birthday” fields match because the VDK handles the “equals” methods.

Terminology note: we call the “birthday” data field a “property” of the “Person” class. The “Person” class and properties like “birthday” are specified in an external data model, with code generated for the JVM (or JavaScript) using vitalsigns.

What if we tried to do:

/* the dates match! */ if(person1.hireDate == person2.birthday) { }

If the values were the same, this would be true, but it looks like it might be a programming error as we’re comparing hiring dates with birthdays — apples to oranges instead of apples to apples.

Bugs such as this can be difficult enough with developer created code, but it gets much worse in data analysis and machine learning with comparisons like:

/* data driven action */ /* such as increase likelihood of customer retention */ if(property537 == property675) { }

which are typically generated through an automated process where it is very difficult to track the meaning of the many thousand properties being analyzed.

Adding Semantics and Semantic Equality

In the VDK, both classes like “Person” and properties like “birthday” have a semantic marker to specify what they “mean”. So in addition to “birthday” being associated with the “Date” data type, it also has a semantic marker like:

This URI places “birthday” into a domain model, which can then be used to see if comparisons are “compatible” with another property. Using such logic we can compare fields like “birthday” and “age” since we can convert one such property to another.

Implementation Note: we use the “trait” language capability of the JVM (Java/Groovy/Scala) to semantically “mark” objects. Some documentation about the Groovy implementation of traits can be found here: http://docs.groovy-lang.org/next/html/documentation/core-traits.html

With these URIs associated with properties, we modified “==” to take into account whether two properties (or classes) can be compared semantically.

We call the redefined “==” symbol: Semantic Equality.

For example, let’s say we have a property “name” and a subproperty “nickName” and another subproperty of “name” for “familyName”, so a property hierarchy like:

name

+—- nickName

+—- familyName

Then we can have:

/* this could be true */ if(person1.name == person2.nickName) { } /* this could be true */ if(person1.name == person2.familyName) { } /* this can't be true! */ if(person1.nickName == person2.familyName) { }

The last case can’t be true because familyName is not an ancestor of nickName or vice versa according to the property hierarchy.

This helps us catch bugs like:

/* now is always false! */ if(person1.hireDate == person2.birthday) { }

by making them never evaluate to true because “birthday” and “hireDate” are not semantically compatible. In the same way, your favorite food can not be “armchair” because “armchair” is not a food.

The Semantic Equality operator is similar to the JavaScript “===”, except stronger. In JavaScript, the “==” operator will try to convert one type to another, like a number to a string so that two things can be compared with a common type, so it is forgiving of type differences, i.e. weakly typed. This can be handy, but often leads to bugs. The JavaScript “===” operator on the other hand, does not do type conversion, so it “strongly” enforces data type comparison. The VDK Semantic Equality adds one more “level” to this by enforcing that the compared data is semantically compatible.

Comparing Values without Semantics

Now, let’s say we really want to compare the values and not take into account the semantics of the properties.

We introduced an operator for this case, “^=“, by redefining the XOR assignment operator. Mainly we don’t do XOR assignments often, but also the caret “^” is sometimes used for boolean NOT operations, so we thought it would be a good match for when the semantics are “NOT” a match.

So, if we really want to compare the values of birthday and hireDate, we can do:

/* true if values match */ if(person1.hireDate ^= person2.birthday) { }

which is true when the hireDate and birthday values match, ignoring the semantics of the properties. This is analogous to the JavaScript difference between “===” (strong) and “==” (weak) except in JavaScript is it either enforcing datatypes (or not) and with the VDK we are enforcing semantics (or not).

VDK Groovy Language Extensions

Semantic Equality is part of the language extensions and DSL (domain specific language) incorporated into the Groovy JVM language with the Vital Development Kit (VDK) to make it easier to work with diverse data.

Feedback!

I hope you have enjoyed learning about the Semantic Equality feature of the VDK! Please post your comments and questions here, or follow up with us at Vital AI via info@vital.ai.