Python data classes

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The reminder that the feature freeze for Python 3.7 is coming up fairly soon (January 29) was met with a flurry of activity on the python-dev mailing list. Numerous Python enhancement proposals (PEPs) were updated or newly proposed; other features or changes have been discussed as well. One of the updated PEPs is proposing a new type of class, a "data class", to be added to the standard library. Data classes would serve much the same purpose as structures or records in other languages and would use the relatively new type annotations feature to support static type checking of the use of the classes.

PEP 557 ("Data Classes") came out of a discussion on the python-ideas mailing list back in May, but its roots go back much further than that. The attrs module, which is aimed at reducing the boilerplate code needed for Python classes, is a major influence on the design of data classes, though it goes much further than the PEP. attrs is not part of the standard library, but is available from the Python Package Index (PyPI); it has been around for a few years and is quite popular with many Python developers. The idea behind both attrs and data classes is to automatically generate many of the "dunder" methods (e.g. __init__() , __repr__() ) needed, especially for a class that is largely meant to hold various typed data items.

Python's named tuples are another way to easily create a class with named data items, but they suffer from a number of shortcomings. For one, they are still tuples, so two named tuples with the same set of values will compare as equal even if they have different names for the "fields". In addition, they are always immutable (like tuples) and their values can be accessed by indexing (e.g. nt[2] ), which can lead to confusion and bugs.

As the "Rationale" section of the PEP notes, there are various descriptions out there of ways to support data classes in Python, along with people posting questions about how to do that kind of thing. For many, attrs provides what they need (Twisted developer Glyph Lefkowitz championed the module in a blog post in 2016, for example), but it also provides more than some need. Beyond that, discussion in the GitHub repository for the data classes project indicated that attrs moves too quickly and has extra features that make it not suitable for standard library inclusion. Data classes are meant to be a simpler, standard way to have some of that functionality:

Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.

Eric V. Smith picked up the suggestion of writing a PEP from the python-ideas thread and posted the first version to python-dev back in September. The first set of comments was the inevitable bikeshedding over the name, which continued even after Guido van Rossum asked that it stop. Van Rossum is satisfied with the "data classes" name, though others like "record", "struct", and the like. There were some more technical comments made in that thread, which Smith incorporated into the revision he posted about in late November.

The overall goal is to reduce the boilerplate that needs to be written for a class with typed data fields. To that end, there is an @dataclass decorator (in the dataclasses module) that processes the class definition to find typed fields. It then generates the various dunder methods and attaches them to the class, which is then returned by the decorator. It would look something like the following example from the PEP:

@dataclass class InventoryItem: '''Class for keeping track of an item in inventory.''' name: str unit_price: float quantity_on_hand: int = 0 def total_cost(self) -> float: return self.unit_price * self.quantity_on_hand

The total_cost() method was put into the example to help show that a data class is simply a regular class and can have its own methods, be subclassed, and so on. From the above declaration, InventoryItem would automatically get a properly type-annotated __init__() method, along with a __repr__() that produces a descriptive string and a bunch of comparison operators (e.g. __eq__() , __lt__() , __ge__() ). None of those need to be written or maintained by the developer.

More fine-grained control over the generated methods is available using parameters passed to the dataclass() decorator. There is a handful of boolean flags that determine whether certain methods are generated ( init , repr , eq , compare ); the latter two allow only generating equality methods ( __eq__() and __ne__() ) or generating the full set of comparison methods. These methods all test objects as if they were a tuple of the fields in the order specified in the class definition.

There was some discussion of how to handle comparisons between objects that have different types. Obviously, comparing unrelated objects should raise an exception ( NotImplemented ), but for subclasses that don't add any fields, an argument could be made that the comparison should be done. Smith considered using an isinstance() check, but ended up taking the lead from attrs and sticking with strict type checks for all of the comparison operators. This GitHub issue has a bit more discussion, including that attrs is actually only strict for the equality operators—something attrs author Hynek Schlawack called an oversight.

There are two other flags for dataclass() that govern whether the class is "frozen" (emulating immutability by raising an exception when any field is assigned to) and whether a __hash__() method will be generated (thus allowing objects to be used as dictionary keys). The two are somewhat intertwined (and interact with the eq flag as well), so the flag interpretations reflect that:

eq and frozen are both true, Data Classes will generate a __hash__ method for you. If eq is true and frozen is false, __hash__ will be set to None , marking it unhashable (which it is). If eq is false, __hash__ will be left untouched meaning the __hash__ method of the superclass will be used (if the superclass is object , this means it will fall back to id-based hashing). Ifandare both true, Data Classes will generate amethod for you. Ifis true andis false,will be set to, marking it unhashable (which it is). Ifis false,will be left untouched meaning themethod of the superclass will be used (if the superclass is, this means it will fall back to id-based hashing). Although not recommended, you can force Data Classes to create a __hash__ method with hash=True . This might be the case if your class is logically immutable but can nonetheless be mutated. This is a specialized use case and should be considered carefully.

That all seems a little clunky, but it is likely to be a fairly fringe feature that will not see much use.

Fields can be specified using the type annotation syntax (as in the example above), but more control is available using the field() function. That allows fields to be removed from the generated methods using the init , repr , compare , and hash flags. It also provides a way to set the default value, since using field() precludes the usual way to set a default, as an example in the PEP shows:

@dataclass class C: x: int y: int = field(repr=False) z: int = field(repr=False, default=10) t: int = 20

Beyond that, there can be a default_factory passed to create new empty objects (e.g. dict , list ) for the field, since using [] or {} directly would result in all objects sharing the same list or dictionary. There is also a metadata parameter that can set some user-specific metadata on the Field objects that are created for each field in a data class (and can be retrieved using the fields() method in dataclasses ).

There are some other module-level helper functions, such as asdict() and astuple() to convert a data class to a dict or tuple; isdataclass() allows checking to see if an object is an data class instance. There is more to the data class specification, but the summary above hits most of the high notes.

So far, there have been few real objections to the idea. Given that Van Rossum has been actively participating in the threads (and suggested writing a PEP), it would seem highly likely that he will accept the PEP for 3.7. There is working code in the GitHub repository, so there should be little that stands in its way.

The process followed here is an excellent example of how Python development works. Something was posted to python-ideas that was not particularly "Pythonic", it was discussed and a path forward was identified, a PEP was written and has been reviewed by many, changes were made, and we are on the cusp of seeing it in a release. All of that took roughly half a year, though much of the groundwork was laid some time ago. Clearly not all features have such a smooth path—or even any path—into Python, but ideas whose time has come can be adopted fairly rapidly.