Soft-deletes are bad, m'kay? Thursday, February 19, 2009

Have you ever ran into database tables with a field which is used to mark if a row has been 'deleted' ? Probably. These fields are used to implement 'soft-deletes'. For the soft-delete impaired, a quick introduction. Soft-deletes are row deletes which are not really happening: instead of removing the row from the database table, a field in the row is set to a value which signals that the row is officially 'deleted'. Every SELECT statement on that table is then filtering on that field's value so only rows which aren't marked as 'deleted' are returned (as the deleted data is not there anymore, semantically).

If this sounds rather awkward, it is. However, there are people who insist in having soft-deletes instead of real deletes, because it allows them to go back in time, to look back at the data that was deleted, as all data is, well... , still there. A small group of those people even believes that soft-deletes allows them to roll-back to deleted data, a kind of 'undo' facility.

The truth is, soft-deletes using status fields in rows is a bad solution to the real problem. Fortunately there are alternatives.

First let's have to look at why people would want soft-deletes. In general there are two reasons, which are already mentioned above: to be able to look at deleted data and to be able to roll-back to deleted data. Let's discuss the second reason first: rolling back to deleted data.

Roll-back to deleted data is hard.

Let's use Northwind as our example database. Let's say we don't delete rows from that database, but flag them as 'deleted' using a new field, IsDeleted (bit), added to every table. If you want to roll-back a deleted Order row, it looks as simple as setting the 'IsDeleted' field to 0, right? Though, what if that Order row refers to a deleted Customer row using its CustomerID foreign key? For the RDBMS, it's OK, as the 'deleted' Customer row is still in the Customers table, it just has its IsDeleted field set to 1. However, executing a SELECT statement which fetches the just recovered Order row with its Customer row will run into a problem: the Customer isn't technically there: the mandatory IsDeleted filter prohibits that the Customer row is showing up. The only solution to this is to also recover the deleted Customer row. Order might have had OrderDetail rows as well, which requires the OrderDetail rows to be recovered as well.

In short, recovering a row is not what's this is all about, it's about recovering a graph. Recovering graphs instead of table rows is much more complicated, due to the dependencies between the involved entity instances (rows). Writing roll-back code for this is therefore likely a complex affair, especially if you want to make sure the data-integrity of the working set is still 100% correct (so all rows involved in the recovered graph indeed have their IsDeleted flag set to 0 and are part of the working set). In the end you'll run into issues where rows have to be merged, similar to source-control systems (e.g. in the situation where a row becomes deleted several times in different graphs). So rolling back graphs is not likely going to be implemented in the average system, it's therefore not the main reason for soft-deletes.

On to the first reason, looking back at old data.

Old data is old for a reason.

When a system deletes rows from a table, it's hopefully done inside a transaction, so when something goes wrong, the delete is 'undone'. When the transaction completes, the delete is final, and the data is gone. If you ever would want to look at the old data (the data you just deleted) again, you can't, it's gone. Let's ignore that some people can't throw anything away ("I might want to look at it in 2 years, then what!? <wild panic>") and focus on what 'delete' actually means: it means that the data is considered 'no longer needed' and therefore removed from the working set. If the data was necessary after the delete, don't execute the delete in the first place, it's the same with your file system really.

However, when 'data is no longer needed' doesn't mean it is totally worthless in all situations: there's a situation where old data could be useful: for reports on history of an entity, when happened what, etc.. Though when will those reports be ran? Every 5 minutes? Will the average user of the system look at historic data all day long or work with the actual working set? The answer to these questions is likely the same: functionality consuming historic (deleted) data is rarely used compared to the functionality consuming working set (not deleted) data.

Let it be clear that it sometimes is required for legal reasons not to toss away data, however in other situations the same requirement actually lead to different solutions: old email is archived in archives and not kept in the in-box. Is that solution useful for this situation as well?

Implications of using soft-deletes

So the main reasons why some people want this is clear, however what are the implications when soft-deletes are used? Below I've mentioned a couple, though I'm sure there are more, though I think the list below is already convincing enough to look at another solution instead.

Queries become very complex. Make no mistake, once you introduce a field to signal if a row has been deleted, you have to make sure every table reference is accompanied by a filter on that IsDeleted status field. Every table reference. If you forget one, it's over: your data is then officially not correct. Maintainability will become more cumbersome, and as most time on a software project is spend on maintenance, it's something which will hurt the project, hard.

Make no mistake, once you introduce a field to signal if a row has been deleted, you have to make sure every table reference is accompanied by a filter on that IsDeleted status field. Every table reference. If you forget one, it's over: your data is then officially not correct. Maintainability will become more cumbersome, and as most time on a software project is spend on maintenance, it's something which will hurt the project, hard. Queries become slower over time. Every 'deleted' row is still there, and for the database system the row is just an ordinary row like any other. This means that DML operations on the rows will become slower, but especially SELECT statements will become slower over time: the percentage of rows which are 'live' of the total number of rows in the table is getting smaller, as more and more rows will become 'deleted'. This baggage could hurt in the long run, especially in tables with a lot of inserts/deletes: the working set might stay the same (e.g. 10K rows) but the total set of rows grows every day so the total number of rows might be millions. When you have millions of rows in the table while the actual set of rows which are 'not deleted' is a small percentage of that, it will influence the performance of queries dramatically. Compare that to a table with the actual set of rows you've to work with.

Every 'deleted' row is still there, and for the database system the row is just an ordinary row like any other. This means that DML operations on the rows will become slower, but especially SELECT statements will become slower over time: the percentage of rows which are 'live' of the total number of rows in the table is getting smaller, as more and more rows will become 'deleted'. This baggage could hurt in the long run, especially in tables with a lot of inserts/deletes: the working set might stay the same (e.g. 10K rows) but the total set of rows grows every day so the total number of rows might be millions. When you have millions of rows in the table while the actual set of rows which are 'not deleted' is a small percentage of that, it will influence the performance of queries dramatically. Compare that to a table with the actual set of rows you've to work with. Using constraints (UC, FK) is impossible. Using a unique constraint (UC) is not really possible, as the RDBMS will take into account the rows which are 'deleted' as well. So a value might be unique for the rows in the working set, but for the total set of rows in the table it doesn't have to be unique and the update or insert fails. In our example above with Customer and Order, we've seen that foreign key constraints are not really working anymore either, as they don't protect us from 'deleting' a PK side while keeping the FK side: for the RDBMS, both rows are still there and everything is fine (which is not the case)

A better solution to these requirements

There's a better solution, and I've already mentioned it briefly: archiving. RDBMS's (I assume you're using a proper, professional, solid ACID compliant database, not some toy RDBMS) usually sport a system called triggers. Triggers are neat things: they get called when something happens. You can compare them to event handlers really: an event occurs (e.g. a row gets deleted from a table) and the trigger responsible for handling that event is called. Additionally, the trigger is called in the same transaction as the code which triggered the trigger. So all actions taken inside a trigger are rolled back when the transaction containing the code which triggered the trigger is rolled back.

If you look closely at the two main reasons for soft-deletes, you'll recognize that both can be satisfied with simply keeping the data around which is deleted by DELETE statements: looking back at old, deleted data is possible wherever the old data is located and rolling back deleted data is not going to be less complex when the data is located elsewhere, as the complexity is in the graph rollback mechanism, not the location of the data to rollback to.

This leads to the solution of an archiving database. An archiving database is a catalog (or schema if you wish) which contains the same table definitions as the real database, with perhaps no UC or FK constraints, as data integrity is implied by the data integrity of the source data (after all, it's just for archiving data consistent data). Every table of which you want to keep deleted data around in the real database now gets a DELETE trigger which simply grabs the row(s) deleted and inserts them in the same table in the archiving database. If the transaction fails, the inserted rows in the archiving database roll back too, if the transaction succeeds, the data is successfully archived and still available. Additionally, you could add date/time fields to the rows in the archiving database to store exact dates and times when the row was deleted, the trigger can insert these values when the deleted row is inserted.

This makes sure the data is still available, so the reasons why people want this are still met, though it doesn't pollute the working set for the application anymore, and the implications of soft-deletes are gone. The only thing you've to be sure of is that the triggers and the archive database are maintained together with the real database (so schema changes in the real database are applied to the archive database as well, or you could go overboard and add a new archive database!). However, that's a small price to pay compared to the overly complicated queries one has to write (even with O/R mappers) and work with, queries which also have to be maintained and documented for the length the application is in use. Using the triggers and the archive database, the application can be written normally, can be tested normally, and no data is thrown away. Ever. One could extend this system with an UPDATE trigger as well, so updates are also tracked, so value deletes on the field level could be tracked as well.

So do yourself a favor, next time someone tells you to use soft-deletes, discuss the implications and offer this alternative solution. Everyone will be better off: you, the customer, and the group of people who will maintain the system for the next 20 years.