Semantic Types help you reduce bugs and improve maintainability by letting the compiler ensure consistency in your code. This article shows how this works and how to create Semantic Types with minimal overhead.

Please Sign up or sign in to vote.

Contents

Introduction

Static typing is a great help in keeping your code bug free and maintainable. Take for example:

public int Mymethod(Person person) { ... }

A few good things are happening here:

Documentation : You know right away that this method takes a Person and returns an integer.

: You know right away that this method takes a Person and returns an integer. Machine Checking : The compiler has been told as well. That means that this is not just documentation that can get out of date. The compiler actually makes sure that what you're reading here is true.

: The compiler has been told as well. That means that this is not just documentation that can get out of date. The compiler actually makes sure that what you're reading here is true. Tooling: Finally, Visual Studio has been told too - enabling you to quickly find out how Person is defined.

The problem is that out of the box, C# only gives you types based on the physical representation of your data in computer memory. Integers are 32 bit numbers, strings are collections of characters, etc. So the compiler won't even give you warning when you wind up with this:

double d = GetDistance(); double t = GetTemperature(); ... Many complicated lines further ... // Adding a temperature to a distance doesn't make sense, // but the compiler won't warn you. double probablyWrong = d + t;

Ok, you could use better naming here. totalDistance instead of d. surfaceTemperature instead of t. But the compiler still isn't going to warn you, because it still doesn't know that totalDistance is a distance, not just a double.

Another example:

/// <summary> /// Sends an email. /// </summary> /// <param name="emailAddress"> /// Hopefully this is a valid email address. /// But there is no way to be sure. We could be getting anything here really. /// /// If someone passes a phone number by mistake, the compiler will /// happily compile this, and we'll get a run time exception. Happy debugging. /// </param> /// <param name="message"> /// Message to send. /// </param> public void SendEmail(string emailAddress, string message) { }

The problem is that we're telling the compiler that the method can take any string as the email address, while actually it can only take a valid email address, which is very different.

The solution to these issues is to inform the compiler about the various value types in our domain - distances, temperatures, email addresses, etc., even if they could be represented in memory by some built in type such as double or integer. That way, it can catch more bugs for us. This is where semantic typing comes in.

Semantic Types

Imagine that C# included a type EmailAddress that can only contain a valid email address:

// Constructor throws exception if passed in email address is invalid var validEmailAddress = new EmailAddress("kjones@megacorp.com"); var validEmailAddress2 = new EmailAddress("not a valid email address"); // throws exception

Now we can guarantee that we only pass valid email addresses to the SendEmail method:

// emailAddress will always contain a valid email address public void SendEmail(EmailAddress emailAddress, string message) { } ... SendEmail(validEmailAddress, "message"); // can only pass an valid email address

To prevent needless exception handling, we need a static IsValid method that checks whether an email address is valid:

bool isValidEmailAddress = EmailAddress.IsValid("kjones@megacorp.com"); // true bool isValidEmailAddress2 = EmailAddress.IsValid("not a valid email address"); // false

Finally, we need a Value property to retrieve the underlying string value. This is read-only, to ensure that after the EmailAddress has been created, it is immutable (cannot be changed).

var validEmailAddress = new EmailAddress("kjones@megacorp.com"); string emailAddressString = validEmailAddress.Value; // "kjones@megacorp.com"

Such an EmailAddress type is an example of a semantic type:

Type based on meaning, not on physical storage : An EmailAddress is physically still a string. What makes it different is the way we think of that string - as an email address, not as a random collection of characters.

: An EmailAddress is physically still a string. What makes it different is the way we think of that string - as an email address, not as a random collection of characters. Type safe : Having a distinct EmailAddress type enables the compiler to ensure you're not using some common string where a valid email address is expected - just as the compiler stops you from using a string where an integer is expected.

: Having a distinct EmailAddress type enables the compiler to ensure you're not using some common string where a valid email address is expected - just as the compiler stops you from using a string where an integer is expected. Guaranteed to be valid : Because you can't create an EmailAddress based on an invalid email address, and you can't change it after it has been created, you know for sure that every EmaillAddress represents a valid email address.

: Because you can't create an EmailAddress based on an invalid email address, and you can't change it after it has been created, you know for sure that every EmaillAddress represents a valid email address. Documentation: When you see a parameter of type EmailAddress, you know right away it contain an email address, even if the parameter name is unclear.

Besides an EmailAddress type, you could have a ZipCode type, a PhoneNumber type, a Distance type, a Temperature type, etc.

Semantic typing is obviously useful, but many people do not use this approach because they fear that introducing semantic types involves lots of typing and boilerplate.

The rest of this article shows first how to implement a semantic type, and then how to factor out all the common code to make creating a new semantic type nice and quick.

Creating a semantic type, first take

Before seeing how to create semantic types in general, lets create a specific semantic type: EmailAddress.

Seeing that an EmailAddress is physically a string, you might be tempted to inherit from string:

// Doesn't compile public class EmailAddress: string { }

However, this doesn't compile, because string is sealed, so you cannot derive from it. The same goes for int, double, etc. You can't even inherit from DateTime.

So, we'll store the string value inside the EmailAddress class. Note that the setter is private. That way, code outside the class cannot change the value:

public class EmailAddress { public string Value { get; private set; } }

Add a static IsValid method that returns true if the given string is a valid email address:

using System.Text.RegularExpressions; public class EmailAddress { public string Value { get; private set; } public static bool IsValid(string emailAddress) { return Regex.IsMatch(emailAddress, @"^(?("")("".+?(?<!\\)""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$", RegexOptions.IgnoreCase); } }

Add the constructor. This takes a string with hopefully a valid email address. If it isn't an email address, throw an exception.

using System.Text.RegularExpressions; public class EmailAddress { public string Value { get; private set; } public static bool IsValid(string emailAddress) { return Regex.IsMatch(emailAddress, @"^(?("")("".+?(?<!\\)""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$", RegexOptions.IgnoreCase); } public EmailAddress(string emailAddress) { if (!IsValid(emailAddress)) { throw new ArgumentException(string.Format("Invalid email address: {0}", emailAddress)); } Value = emailAddress; } }

That gives us the basics. Note that with this implementation, an EmailAddress cannot be changed after it has been created - it is immutable. If you want a new email address, you have to create a new EmailAddress object - and the constructor will ensure that your new email address is valid as well.

However, there is one more thing to implement: equality. When you use simple strings to store email addresses, you expect to be able to compare them by value:

string emailAddress1 = "kjones@megacorp.com"; string emailAddress2 = "kjones@megacorp.com"; bool equal = (emailAddress1 == emailAddress2); // true

Because of this, we'll want the same behaviour with EmailAddresses:

var emailAddress1 = new EmailAddress("kjones@megacorp.com"); var emailAddress2 = new EmailAddress("kjones@megacorp.com"); bool equal = (emailAddress1 == emailAddress2); // true

Because EmailAddress is a reference type, by default the equality operator only checks whether the two EmailAddresses are physically the same. However, we want to compare the underlying email adresses.

To make this happen, we have to implement the System.IEquatable<T> interface and override the Object.Equals and Object.GetHashCode methods and the == and != operators (full details). The result is this:

public class EmailAddress : IEquatable<EmailAddress> { public string Value { get; private set; } public EmailAddress(string emailAddress) { if (!IsValid(emailAddress)) { throw new ArgumentException(string.Format("Invalid email address: {0}", emailAddress)); } Value = emailAddress; } public static bool IsValid(string emailAddress) { return Regex.IsMatch(emailAddress, @"^(?("")("".+?(?<!\\)""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$", RegexOptions.IgnoreCase); } #region equality public override bool Equals(Object obj) { //Check for null and compare run-time types. if ((obj == null) || (!(obj is EmailAddress))) { return false; } return (Value.Equals(((EmailAddress)obj).Value)); } public override int GetHashCode() { return Value.GetHashCode(); } public bool Equals(EmailAddress other) { if (other == null) { return false; } return (Value.Equals(other.Value)); } public static bool operator ==(EmailAddress a, EmailAddress b) { // If both are null, or both are same instance, return true. if (System.Object.ReferenceEquals(a, b)) { return true; } // If one is null, but not both, return false. // Have to cast to object, otherwise you recursively call this == operator. if (((object)a == null) || ((object)b == null)) { return true; } // Return true if the fields match: return a.Equals(b); } public static bool operator !=(EmailAddress a, EmailAddress b) { return !(a == b); } #endregion }

Factoring out the boilerplate

Obviously, the EmailAddress class as it stands has lots of boilerplate that is not specific to email addresses. We'll factor this out into a base class SemanticType. This can then be used to quickly define lots of semantic types.

Here is what EmailAddress will look like once we're done:

public class EmailAddress : SemanticType<string> { public static bool IsValid(string value) { return (Regex.IsMatch(value, @"^(?("")("".+?(?<!\\)""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))" + @"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$", RegexOptions.IgnoreCase)); } // Constructor, taking an email address. The base constructor handles validation // and storage in the Value property. public EmailAddress(string emailAddress) : base(IsValid, emailAddress) { } }

Here we only specify what is EmailAddress specific, leaving the boilerplate to a base class SemanticType (which we'll get to in the next section):

The SemanticType base class will be storing the underlying value, so it needs to be generic, and have a type parameter with the type of the underlying value - in this case string.

The IsValid method is specific to EmailAddress, so it cannot be factored out.

It is the SemanticType constructor that stores the value, so it needs to know how to validate it. To make that happen, simply pass the IsValid method as a parameter. If no validation is needed, pass in null.

Another example is a BirthDate semantic type. This is a DateTime, except that birth dates must be in the past (unless you take advance bookings for a kindergarten) and they can't be more than say 130 years in the past (unless you store dead people's details).

public class BirthDate : SemanticType<DateTime> { // Oldest person ever died at 122 year and 164 days // http://en.wikipedia.org/wiki/List_of_the_verified_oldest_people // To be safe, reject any age over 130 years. const int maxAgeForHumans = 130; const int daysPerYear = 365; public static bool IsValid(DateTime birthDate) { TimeSpan age = DateTime.Now - birthDate; return (age.TotalDays >= 0) && (age.TotalDays < daysPerYear * maxAgeForHumans); } public BirthDate(DateTime birthDate) : base(IsValid, birthDate) { } }

Creating the SemanticType base class

Lets start with the bare bones declaration:

public class SemanticType<T> { }

Value property

Add the Value property that will be used to store the underlying value. Note that it is of type T, the type of the underlying value:

public class SemanticType<T> { public T Value { get; private set; } }

Constructor

Now for the constructor. This acts as a gatekeeper by throwing an exception when the passed in value is invalid, thereby ensuring that if you have a semantic type, it is always valid. Note that:

It doesn't allow null as a value. If you did allow null, there would be confusion between a null EmailAddress and an EmailAddress that has a null value.

It uses the IsValid static method that was passed in via the isValidLambda parameter to do the validation.

It uses the type of the derived class, retrieved with this.GetType(), to create a more meaningful exception message.

public class SemanticType<T> { public T Value { get; private set; } protected SemanticType(Func<T, bool> isValidLambda, T value) { if ((Object)value == null) { throw new ArgumentException(string.Format("Trying to use null as the value of a {0}", this.GetType())); } if ((isValidLambda != null) && !isValidLambda(value)) { throw new ArgumentException(string.Format("Trying to set a {0} to {1} which is invalid", this.GetType(), value)); } Value = value; } }

Equality related code

Now we can implement the equality related code. First override the Equals and GetHashCode methods inherited from Object.

public class SemanticType<T> { public T Value { get; private set; } protected SemanticType(Func<T, bool> isValidLambda, T value) { if ((Object)value == null) { throw new ArgumentException(string.Format("Trying to use null as the value of a {0}", this.GetType())); } if ((isValidLambda != null) && !isValidLambda(value)) { throw new ArgumentException(string.Format("Trying to set a {0} to {1} which is invalid", this.GetType(), value)); } Value = value; } public override bool Equals(Object obj) { //Check for null and compare run-time types. if (obj == null || obj.GetType() != this.GetType()) { return false; } return (Value.Equals(((SemanticType<T>)obj).Value)); } public override int GetHashCode() { return Value.GetHashCode(); } }

Implement IEquatable

Now we can implement the IEquatable interface, by implementing its Equals method.

The difference between IEquatable.Equals and Object.Equals is that IEquatable.Equals is strongly typed. This has the following advantages:

You get better type checking by the compiler.

It makes testing for equality a bit more efficient when the underlying type is a value type, such as integer, because it prevents boxing.

public class SemanticType<T> : IEquatable<SemanticType<T>> { public T Value { get; private set; } protected SemanticType(Func<T, bool> isValidLambda, T value) { if ((Object)value == null) { throw new ArgumentException(string.Format("Trying to use null as the value of a {0}", this.GetType())); } if ((isValidLambda != null) && !isValidLambda(value)) { throw new ArgumentException(string.Format("Trying to set a {0} to {1} which is invalid", this.GetType(), value)); } Value = value; } public override bool Equals(Object obj) { //Check for null and compare run-time types. if (obj == null || obj.GetType() != this.GetType()) { return false; } return (Value.Equals(((SemanticType<T>)obj).Value)); } public override int GetHashCode() { return Value.GetHashCode(); } public bool Equals(SemanticType<T> other) { if (other == null) { return false; } return (Value.Equals(other.Value)); } }

== and != operators

Finally override the == and != operators:

public class SemanticType<T> : IEquatable<SemanticType<T>> { public T Value { get; private set; } protected SemanticType(Func<T, bool> isValidLambda, T value) { if ((Object)value == null) { throw new ArgumentException(string.Format("Trying to use null as the value of a {0}", this.GetType())); } if ((isValidLambda != null) && !isValidLambda(value)) { throw new ArgumentException(string.Format("Trying to set a {0} to {1} which is invalid", this.GetType(), value)); } Value = value; } public override bool Equals(Object obj) { //Check for null and compare run-time types. if (obj == null || obj.GetType() != this.GetType()) { return false; } return (Value.Equals(((SemanticType<T>)obj).Value)); } public override int GetHashCode() { return Value.GetHashCode(); } public bool Equals(SemanticType<T> other) { if (other == null) { return false; } return (Value.Equals(other.Value)); } public static bool operator ==(SemanticType<T> a, SemanticType<T> b) { // If both are null, or both are same instance, return true. if (System.Object.ReferenceEquals(a, b)) { return true; } // If one is null, but not both, return false. // Have to cast to object, otherwise you recursively call this == operator. if (((object)a == null) || ((object)b == null)) { return false; } // Return true if the fields match: return a.Equals(b); } public static bool operator !=(SemanticType<T> a, SemanticType<T> b) { return !(a == b); } }

ToString

ToString is implemented by every Object, that is, every single type in .Net, including value types such as int and double.

By default, this simply returns the name of the type. However, you'll want the string representation of the underlying value. This isn't so useful for say EmailAddress where the underlying value is already a string, but when it is for example a DateTime, this comes in handy.

Implementing ToString is pretty trivial:

public class SemanticType<T> : IEquatable<SemanticType<T>> { ... public override string ToString() { return this.Value.ToString(); } }

IComparable

Say you just converted your code to use EmailAddress for email addresses, rather than strings. The issue is that strings can be ordered with say List<T>.Sort (a@abc.com comes before b@abc.com, etc.) However, out of the box, you can't do this with plain objects.

The solution is that all .Net classes concerned with ordering objects check whether an object implements the IComparable<T> interface. To implement that interface, you have to add a CompareTo method that compares the object with another object of the same class.

IComparable<T> has a non generic counterpart, IComparable. This is a hang over from the dark and long past days when there were no generics. I decided not to support this, because it goes against the idea of using strong typing to catch bugs at compile time.

Implementing IComparable<T> in SemanticType<T> is simple - just compare the underlying values:

// Does not compile public class SemanticType<T> : IEquatable<SemanticType<T>>, IComparable<SemanticType<T>> { ... public int CompareTo(SemanticType<T> other) { if (other == null) { return 1; } return this.Value.CompareTo(other.Value); } }

There is one problem here: this code doesn't compile. The compiler hasn't been told that type T (the type of the underlying value) actually implements CompareTo. There are a few options to fix this:

Check at run time whether T implements IComparable<T> using Type.IsAssignableFrom. If it does, cast to IComparable<T>. If it doesn't, throw an exception. Add a constraint on T to ensure it implements IComparable<T>.

Option 1 defers checking whether T implements IComparable<T> to run time, while in option 2 this is done at compile time. Option 2 is also a bit simpler. This makes option 2 far preferable to me:

public class SemanticType<T> : IEquatable<SemanticType<T>>, IComparable<SemanticType<T>> where T: IComparable<T> { ... public int CompareTo(SemanticType<T> other) { if (other == null) { return 1; } return this.Value.CompareTo(other.Value); } }

What about the rare cases where the the underlying value does not implement IComparable<T>? Maybe you want to wrap some legacy type into a semantic type.

To cater for this, in the Semantic Types Nuget package I introduced a class UncomparableSemanticType<T> - a version of SemanticType<T> that does not implement IComparable<T>. If you have a look at that code, you'll find that the common bits of these classes have been factored out to a common base class. Because this is pretty trivial, I haven't discussed that here.

Taming the physical world

Having simple semantic types that essentially just wrap a value works well for email addresses, phone numbers and other simple bits of data. Things get more interesting however when applying this to lengths, areas, weights and other physical units.

Us humans are inconsistent with our units

Lets go back to the bit of code we saw in the beginning:

double d = GetDistance(); double t = GetTemperature(); ... Many complicated lines further ... // Adding a temperature to a distance doesn't make sense, // but the compiler won't warn you. double probablyWrong = d + t;

We can easily introduce semantic types Distance and Temperature here, so the compiler will catch our mistake:

Distance d = GetDistance(); Temperature t = GetTemperature(); ... Many complicated lines further ... // Adding a temperature to a distance doesn't make sense, // and now the compiler will catch our mistake. double probablyWrong = d + t; // doesn't compile

But this code creates a new question: is that distance in meters? Kilometers? Feet? Inches? And the temperature: degrees Celcius? Fahrenheit? Kelvin?

An improvement would be to add the unit to the variable names:

Distance distanceMeters = GetDistanceInMeters(); Temperature temperatureCelcius = GetTemperatureInCelcius();

But that gets very clunky, and can easily get outdated. And what if your site has users both in the US, in Europe and the UK? You're now dealing with feet and meters, pounds and kilograms, and possibly much more.

Even if your site is just feet right now, your marketing department is probably already eying some market where people use meters. Going through all numeric variables and methods to make your site handle both feet and meters won't be much fun.

The problem is that you would have to keep track of the unit of each length, weight etc. in some separate variable that can easily get out of sync. Plus you'll be writing lots of conversion methods - MetersToInches, InchesToFeet, etc. A clear invitation for complexity, weird bugs, pain and frustration.

The solution is to stop putting meters, feet, inches, kilograms, pounds, etc. in your variables. Instead, think simply in terms of Lengths, Weights, etc. Remember, the length of a real world object is the same, regardless of whether you speak inches or meters.

A Length object would look like this:

public class Length { // Store weights internally as meters public double Value { get; private set; } public Length(double value) { Value = value; } // Get length in feet public double Feet { get { return Value/0.3048; } } // Get length in meters public double Meters { get { return Value; } } // Create length based on number of feet public static Length FromFeet(double feet) { return new Length(feet*0.3048); } // Create lenght based on number of meters public static Length FromMeters(double meters) { return new Length(meters); } }

While a weight object would go like so:

public class Weight { // Store weights internally as kilograms public double Value { get; private set; } public Weight(double value) { Value = value; } // Get weight in pounds public double Pounds { get { return Value/0.45359237; } } // Get weight in kilograms public double Kilograms { get { return Value; } } // Create weight based on number of kilograms public static Weight FromKilograms(double kilograms) { return new Weight(kilograms); } // Create weight based on number of pounds public static Weight FromPounds(double pounds) { return new Weight(pounds*0.45359237); } }

Now you can write:

Length userHeight = Length.FromMeters(height_entered_by_european); Weight userWeight = Weight.FromKilograms(weight_entered_by_european); .... Length userHeight = Length.FromFeet(height_entered_by_american); Weight userWeight = Weight.FromPounds(weight_entered_by_american); .... // Calculate Body Mass Index, a measure of obesity. // Bmi is always calculated in kilograms and meters, including in the US and UK. double bmi = userWeight.Kilograms / (userHeight.Meters * userHeight.Meters);

Now it is always clear whether some number of type double is supposed to be a length in meters, a weight in pounds, etc. And there is no more worrying whether userWeight is in kilograms or pounds. If you need the weight in kilograms, you simply retrieve it in kilograms.

If this sounds good to you, but you don't want to code lots of classes with conversions, etc., have a look at the NuGet package Units.NET. It has dozens of units, all with mathematical and comparison operators, ToString, etc. A very complete package.

A length times a length is no longer a length

If you use a package such as Units.NET, you probably have the usual arithmetic operators defined on your unit classes:

public static Length operator +(Length left, Length right) { return new Length(left.Value + right.Value); } public static Length operator -(Length left, Length right) { return new Length(left.Value - right.Value); }

Things get more complicated when it comes to multiplication and division. A length of 2 meters plus another length of 3 meters is a length of 5 meters. But a length of 2 meters times a length of 3 meters is an area of 6 square meters:

public static Area operator *(Length left, Length right) { return new Area(left.Value * right.Value); }

An Area times a Length is a Volume. A Length divided by a time period is a Speed, etc. It's up to you where you decide to draw the line here.

Conclusion

You saw how Semantic Types help you prevent bugs by getting the compiler to find them for you at compile time. They also make it easier to understand your code by letting you specify that something is an email address, distance, temperature, etc. rather than just some string or double.

You also saw how to create a SemanticType base class that makes it easy to create new semantic types without getting bogged down in lots of boilerplate.