On his continuing quest for productivity and performance in the Java programming language, Brian Goetz, Java Language Architect at Oracle, introduced an experimental concept of data classes that has potential to someday be integrated into the language. His research demonstrates a natural fit of data classes with up-and-coming features such as value types and pattern matching. But there is much work to be done before this concept is ready to become part of the Java language. Goetz explored the problems and tradeoffs of data classes on the premise that sometimes "data is just data."

Motivation

Java classes typically require lots of boilerplate code regardless of how simple or how complex those classes may be. This has lead to Java's reputation of being "too verbose." Goetz explains:

To write a simple data carrier class responsibly, we have to write a lot of low-value, repetitive code: constructors, accessors, equals() , hashCode() , toString() , etc. And developers are sometimes tempted to cut corners such as omitting these important methods, leading to surprising behavior or poor debuggability, or pressing an alternate but not entirely appropriate class into service because it has the "right shape" and they don't want to define yet another class. IDEs will help you write most of this code, but writing code is only a small part of the problem. IDEs don't do anything to help the reader distill the design intent of "I'm a plain data carrier for x, y, and z" from the dozens of lines of boilerplate code. And repetitive code is a good place for bugs to hide; if we can, it is best to eliminate their hiding spots outright.

Similar to class declarations defined in Scala ( case ), Kotlin ( data ) and C# ( record ) that are designed to be compact, the same could potentially be true for a Java class to be a plain data carrier with a minimum of overhead. Without a formal definition of a plain data carrier, most Java developers would most-likely be unable to recognize one. And while the Java community would indeed welcome a data class mechanism in the language, individual interpretations of a plain data carrier could be vastly different. Goetz used the parable of the blind men and an elephant to explain:

Algebraic Annie will say "a data class is just an algebraic product type." Like Scala's case classes, they come paired with pattern matching, and are best served immutable (and for dessert, Annie would order sealed interfaces). Boilerplate Billy will say "a data class is just an ordinary class with better syntax", and will likely bristle at constraints on mutability, extension, or encapsulation (Billy's brother, JavaBean Jerry, will say "these must be for JavaBeans -- so of course I get getters and setters too." And his sister, POJO Patty, remarks that she is drowning in enterprise POJOs, and reminds us that she'd like these to be proxyable by frameworks like Hibernate). Tuple Tommy will say "a data class is just a nominal tuple" -- and may not even be even expecting them to have methods other than the core Object methods -- they're just the simplest of aggregates (he might even expect the names to be erased, so that two data classes of the same "shape" can be freely converted). Values Victor will say "a data class is really just a more transparent value type." All of these personae are united in favor of "data classes" -- but have different ideas of what data classes are, and there may not be any one solution that makes them all happy.

Understanding the Problem

The concept of data classes goes beyond reduction in boilerplate code, which Goetz maintains is "just a symptom of a deeper problem" in which the cost of encapsulation is shared among all Java classes. The object-oriented principles of abstraction and encapsulation allow Java developers to write robust and safe code across various boundaries:

Maintenance boundaries

Security and trust boundaries

Integrity boundaries

Versioning boundaries

For classes such as SocketInputStream , these boundaries are essential due to its inherent complexity. But does a class that is a plain data carrier for, say, two integer components (such as the example declared below) really need to be concerned with such boundaries?

record Point(int x,int y) { ... }

Goetz explains:

Since the cost of establishing and defending these boundaries (how constructor arguments map to state, how to derive the equality contract from state, etc.) is constant across classes, but the benefit is not, the cost may sometimes be out of line with the benefit. This is what Java developers mean by "too much ceremony" -- not that the ceremony has no value, but that they're forced to invoke it even when it does not offer sufficient value. The encapsulation model that Java provides -- where the representation is entirely decoupled from construction, state access, and equality -- is just more than many classes need. Classes that have a simpler relationship with their boundaries can benefit from a simpler model where we can define a class as a thin wrapper around its state, and derive the relationship between state, construction, equality, and state access from that. Further, the costs of decoupling representation from API goes beyond the overhead of declaring boilerplate members; encapsulation is, by its nature, information-destroying.

Requirements for Data Classes

Using the Point declaration above, consider its "de-sugared" definition as a plain data carrier:

final class Point extends java.lang.DataClass { public final int x; public final int y; public Point(int x,int y) { this.x = x; this.y = y; } // destructuring pattern for Point(int x,int y) // state-based implementations of equals(), hashCode(), and toString() // public read accessors x() and y() }

To further study the design of plain data carriers, Goetz defined a set of requirements (or constraints) to "safely and mechanically generate the boilerplate for constructors, pattern extractors, accessors, equals() , hashCode() , and toString() -- and more." He writes:

We say a class C is a transparent carrier for a state vector S if: There is a function ctor : S -> C which maps an instance of the state vector to an instance of C (the constructor may reject some state vectors as invalid, such as rational numbers whose denominator is zero).

-> which maps an instance of the state vector to an instance of (the constructor may reject some state vectors as invalid, such as rational numbers whose denominator is zero). There is a total function dtor : C -> S which maps an instance of C to a state vector S in the domain of ctor.

-> which maps an instance of to a state vector in the domain of ctor. For any instance c of C , ctor(dtor(c)) is equal to c, according to the equals() contract for C .

, ctor(dtor(c)) is equal to c, according to the contract for . For two state vectors s1 and s2, if each of their components is equal to the corresponding component of the other (according to the component's equals() contract), then either ctor(s1) and ctor(s2) are both undefined, or they are equals under the equals() contract for C .

contract), then either ctor(s1) and ctor(s2) are both undefined, or they are equals under the contract for . For equivalent instances c and d , invoking the same operation produces equivalent results: c.m() equals d.m() . Moreover, after the operation, c and d should still be equivalent. These invariants are an attempt to capture our requirements; that the carrier is transparent, and that there is a simple and predictable relationship between the classes representation, its construction, and its destructuring -- that the API is the representation.

Data Classes and Pattern Matching

A plain data carrier has the advantage, as Goetz states, "to freely convert a data class instance back and forth between its aggregate form and exploded state." This would work conveniently well with pattern matching. As demonstrated in his pattern matching paper, Goetz discussed destructuring and improvements in utilizing the switch construct. With this in mind, it could be possible to write the following code:

interface Shape { ... } record Point (int x,int y) { ... } record Rect(Point p1,Point p2) implements Shape { ... } record Circle(Point center,int radius) implements Shape { ... } ... switch(shape) { case Rect(Point(var x1,var y1),Point(var x2,var y2)) : ... case Circle(Point(var x,var y),int radius): ... }

Any concrete instance of Shape could easily be destructured within the switch statement. This could also be useful for externalization such as serialization, marshalling to/from JSON and XML, and database mapping.

Refining the Design Space

Goetz discussed that the requirements for being a plain data carrier comes with trade-offs. He explains:

The simplest -- and most draconian -- model for data classes is to say that a data class is a final class with public final fields for each state component, a public constructor and deconstruction pattern whose signature matches that of the state description, and state-based implementations of the core Object methods, and further, that no other members (or explicit implementations of the implicit members) are allowed. This is essentially the strictest interpretation of a nominal tuple. This starting point is simple and stable -- and nearly everyone will find something to object to about it. So, how much can we relax these constraints without giving up on the semantic benefits we want? Let's look at some directions in which the draconian starting point could be extended, and their interactions.

These directions cover a wide array of design elements and related issues:

Interfaces and additional methods Risk violating the "nothing but the state" rule.

Overriding implicit members Risk violating the requirements of a plain data carrier.

Additional constructors Ensure the object state and state description are equivalent.

Additional fields Risk violating "the state, the whole state, and nothing but the state" rule.

Extension Issues related to extension between data classes and regular classes.

Mutability Question the rationale of allowing data classes to be mutable.

Field encapsulation and accessors Ensure that encapsulating fields must be readable.

Arrays and defensive copies Defensive copies violate the invariant of destructuring and reconstructing an array to ensure an equal instance.

Thread safety Question how mutability in data classes can be thread safe.



Summary

Java had a excellent year in 2017 and there is much excitement about the language this year. However, as Goetz told InfoQ, data classes are still considered a "half-baked" idea that requires more work to fully understand how this concept can someday be a reality.

In summary, Goetz explains:

The key question in designing a facility for "plain data aggregates" in Java is identifying which degrees of freedom we are willing to give up. If we try to model all the degrees of freedom of classes, we just move the complexity around; to gain some benefit, we must accept some constraints. We think that the sensible constraints to accept are disavowing the use of encapsulation for decoupling representation from API, and for mediating read access to state; in turn, this provides significant syntactic and semantic benefits for classes which can accept these constraints.

Vicente Romero, principal member of the technical staff at Oracle, recently posted an "initial public push" on the development of data classes that can be found on the datum branch of the Project Amber repository.

Goetz spoke to InfoQ about his data classes research:

InfoQ: What kind of community response have you received since publishing your paper?

Brian Goetz: The expected response: some highly positive comments about the idea, and a variety of suggestions (mostly mutually inconsistent) for how it could be "improved." Which is to say, people like the idea, but, as expected, many people would want us to move the design center in one direction, or another, to suit their personal preferences. As a highly subjective feature, this was to be expected.

InfoQ: Do you envision a data class mechanism to someday be integrated in the Java programming language? If so, what kind of effort will be necessary to address all the concerns you discussed in your paper?

Goetz: It is going to require "bake time." With language design, your first idea, no matter how carefully thought out, is going to be wrong. As will your second. Many language features require half a dozen iterations or more before you ultimately discover the right place to land. So we'll be experimenting, prototyping, gathering feedback, iterating, and iterating again. Until we feel we've gotten to the right place.

InfoQ: Is it a goal or a non-goal to promote the rebasing of non-Java languages implementation of compact classes (e.g., Scala case classes) on top of data classes?

Goetz: Every language is going to have its own surface syntax. However, data classes connect with other language features, such as pattern matching, and and we hope (as happened with Lambda) that other languages will target the runtime support of these features, and gain interoperability benefits.

InfoQ: As far as you may know, did the architects of Scala, Kotlin, and C# face similar challenges in implementing a more compact class declaration?

Goetz: Indeed so, though both Kotlin and Scala were able to take this on much closer to the beginning of their projects than C# did, so had fewer constraints to navigate. And each settled in a slightly different point in the design space.

InfoQ: What is the single most important take-home message you would like our readers to know about data classes?

Goetz: That data classes are about data, not about syntactic concision. They are about providing a natural means to model pure data in the object model. And not all classes are carriers for pure data, even if they would like the concision benefits that data classes offer.

InfoQ: What's on the horizon for your data classes research?

Goetz: Breaking the features that data classes need into finer-grained features, that might be usable by all classes. For example, even in classes that are clearly not just data carriers, constructors are full of error-prone repetition, which could be replaced by making a higher-level correspondence between constructor parameters and representation. This way, data classes become simpler (just sugar for other language features), and more classes can get the benefit of the feature without trying to shoehorn them into data classes.

Resources