[This article is based on our recently accepted paper “Smelly relations: Measuring and Understanding Database Schema Quality” at ICSE 2018. Interested readers may find the preprint here.]

Kent Beck coined “code smells” metaphor to indicate the presence of quality problems in a software system. The metaphor has been extended to other similar domains such as configuration management, spreadsheets, and presentations. A detailed taxonomy of software smells can be found here.

Databases are an integral element of enterprise applications. The effective use of database affects vital quality parameters, such as performance and maintainability, of these applications. Similar to code, SQL statements can also indicate smells. Bill Karwin documents a catalog of database anti-patterns. In this context, we present our study on mining database smells in production-quality systems including both the industrial as well as the open-source software. We analyze SQL statements to measure schema quality of relational databases with a focus on performance and maintainability quality attributes.

A Catalog of Database Schema Smells

We studied wide variety of resources including books, research literature [1, 2, 3], industrial white-paper, and discussions on question-answer sites. We summarize the result of our exploration in the form of a catalog of database schema smells.

Compound attribute This smell arises when a column is used to store a non-atomic attribute. For instance, storing comma-separated lists for an attribute to avoid creating an intersection table for a many-to-many relationship or storing a JSON file which is not used atomically.

Rationale: Each attribute value must be stored and retrieved atomically. If a table does not adhere to this practice, the resultant schema introduces multiple problems. For instance, a user has to write more complex queries (using pattern-matching expressions) to retrieve data from this table. Such complex queries are prone to inaccurate results. Also, such queries cannot exploit available indexes. Even further, these queries are not portable due to vendor specific support to pattern-matching expressions. Adjacency list The smell occurs when an attribute in a table refers another row in the same table i.e., a table has a recursive relationship to model hierarchical structure.

Rationale: Querying a tree with adjacency list is quite difficult and error-prone. Specifically, deleting a node from a tree which is modelled using adjacency list is non-trivial and prone to introduce errors in the database. Superfluous key This smell arises when an unnecessary superfluous pseudo key is defined in a table where other attribute(s) in the table may serve as a primary key.

Rationale: Choosing an appropriate primary key is an essential requirement for a table. A pseudo key could be defined when the present set of attributes could not serve as a primary key. However, a pseudo key is unnecessary and even erroneous (leads to duplicate rows) when the existing set of attributes of the table could be used as a primary key. Missing constraints This smell arises when constraints for a foreign key are missing from a schema definition.

Rationale: Referential integrity is an essential property of relational databases. Values referenced in a foreign key column must exist in the columns of primary or unique keys of the parent table. It can be easily achieved by defining constraints on foreign keys. However, when such constraints are missing for a foreign key it leads to compromized referential integrity of the database. Metadata as data This smell occurs when metadata is stored as data in the form of EAV (Entity-Attribute-Value) pattern.

Rationale: In a relational table, all the attributes are equally applicable for all the rows in the table. It is tempting to implement EAV pattern when a subset of attributes applicable for a subset of rows and the rest of attributes for rest of the rows. However, this arrangement introduces many deficiencies in the database; for example, one can’t use native SQL data types (leading to invalid data), enforce referential integrity, or make up attribute names. Polymorphic association This smell occurs when a table uses a multi-purpose foreign key.

Rationale: Relational database schema does not allow us to declare polymorphic association. However, many times developers define an additional column in a table as a tag to realize a polymorphic association. This arrangement makes it difficult to query the table and compromises readability and understandability. Multicolumn attribute This smell arises when multiple serial columns are created for an attribute.

Rationale: In cases when an attribute may have one or more values, it is tempting to create multiple columns for the attribute in a table. However, such a schema design makes querying the table very difficult and verbose. Clone tables This smell occurs when a table is split horizontally in multiple tables using some criterion (for example, year) to achieve scalability.

Rationale: This smell not only makes the querying difficult but also introduces problems managing data integrity. Values in attribute definition This smell arises when specific values are defined in an attribute definition to restrict possible values of the attribute.

Rationale: Specifying all possible values for an attribute in schema definition mixes metadata with data which is not recommended. This smell makes it difficult to extend or modify the list of accepted values for an attribute. Index abuse This smell arises when the indexes are used poorly. This smell has the following variants: Missing indexes Insufficient indexes (indexes must be prepared at least for primary and foreign keys), and Unused indexes Rationale: Creating effective indexes is not trivial; it requires judicious planning. A database with a deficient plan for indexes performs poorly. God table This smell arises when a table contains excessive number of attributes.

Rationale: Excessive number of attributes tend to violate the principles of normalization which in turn introduce a variety of problems. Additionally, it impacts maintainability of the database. Meaningless name This smell occurs when a table or an attribute name is cryptic or meaningless.

Rationale: Meaningless or cryptic names hamper readability of the database’s schema. Overloaded attribute names This smell occurs when two or more attributes are defined with identical names but as distinct data types in different tables.

Rationale: Identical names with different data types create confusion and could lead to subtle bugs in queries.

Developers’ Survey on Database Smells

We carried out an online survey targeting software developers to understand their perspective about the significance of various database schema smells. We asked the participants to read the description of each potential smell presented (total 13 questions based on the above smells catalog) and asked the respondents whether they consider the presented practice as a database schema smell, a recommended practice, both a smell and a recommended practice depending on the context, or neither a smell nor a recommended practice. We shared the survey to all online social media channels and sought participation from the developer community. We received 52 complete responses with completion rate 38%.

Findings from the Developers Survey

Most of the respondents belong to experienced developer groups. The figure above shows the distribution of respondents’ experience in terms of number of years and number of database applications they have developed.

We summarize our findings from the survey below.

A large majority of 88% agrees (42% strongly agree and 46% agree) that the awareness and knowledge of database smells is crucial for software developers to develop high quality applications. None of the respondents marked disagree or strongly disagree options.

The figure above shows a consolidated perspective provided by the respondents. Based on the responses we infer that some practices, such as meaningless name (83%) and missing constraints (77%), are clearly marked as database smells. However, we found that practices such as values in attribute definition and adjacency list are more context-sensitive.

The respondents had the option to add their views either in terms of smells that we have not included but they have seen in practice as well as their feeling, objection, or reservation on the presented smells. A few respondents underline the subjectivity involved in database smells detection. For instance, one respondent said that “… database smells in general depend much more on an assessment of the need and end use of data…”. Similarly, another respondent shared an instance of duplicating values in a table (which is a smell) to avoid querying 60 tables to load a single record. Yet another respondent provided his/her opinion on index abuse smell: “… the proper use of indexes is dependent on many things and without regular profiling it’s not possible to decide whether indexes are actually being misused.”

As a conclusion of our survey, developers seem to acknowledge the need for detecting database smells. However, their systematic identification remains an open problem. This points to the need for a tool that automatically detects the database smells. Developers may then, considering the context of the smell, decide whether the detected smells are indeed quality issues or serving a required purpose.

Quantitative Analysis

We analyzed 357 industrial projects from Singular Logic (Athens, Greece) and Software Improvement Group (Amsterdam, The Netherlands) as well as 2568 open-source projects containing SQL statements. We analyzed all the SQL statements from all the repositories using our open-source tool DbDeo to detect database schema smells. We analyzed more than 629 million lines of code containing 393,989 SQL statements from 2925 repositories (on average approximately 135 SQL statements per repository).

Summary of Results

The smell index abuse is the most frequently occurring database smell in both industrial and open-source projects. A high number of index abuse smell instances indicate the need of an effective index management in database systems for higher performance and maintainability. On the other hand, values in attribute definition in industrial projects and adjacency list in open-source projects are the least frequently occurring smells.

Some database smells, such as adjacency list, are more prone to occur in industrial projects compared to open-source projects by a significant margin.

The size of the host application has no impact on density of database smells; however, smell density shows positive correlation with the size of the database.

Application type (i.e., Mobile, Desktop, or Web) doesn’t have a significant impact on database smell density.

The usage of an ORM framework makes it easier to work with a database; however, use of an ORM framework doesn’t prevent database smells.

Our co-occurrence analysis reveals that clone table for industrial projects and values in attribute definition for open-source projects show highest co-occurrence with other smells. It implies that whenever a clone table in an industrial project or values in attribute definition smell in an open-source project gets spotted, it is very likely to find other database smells in the project.

Feel free to use/clone/fork/star DbDeo – our open-source tool that we used in this study.

Star Fork

If you find the study interesting, you may find more details in the preprint of the paper. Even better, join my talk at ICSE 2018 in SEIP track in Gothenburg, Sweden.