A requirement arises in many systems to update multiple SQL database rows. For small numbers of rows requiring updates, it can be adequate to use an UPDATE statement for each row that requires an update. But if there are a large number of rows that require an update, then the overhead of issuing large numbers of UPDATE statements can result in the operation as a whole taking a long time to complete.

The traditional advice for improving performance for multiple UPDATE statements is to “prepare” the required query once, and then “execute” the prepared query once for each row requiring an update. But in many cases this only provides a modest improvement as each UPDATE operation still requires a round-trip communication with the database server. In the case where the application server and database server are on different hosts, the round-trip will involve network latency as well. The dominant factor in the time taken to complete the overall operation tends to be the “admin” work in conveying the application’s intention to the database server rather than the actual updates to the database.

A more effective solution to this problem is to attempt to reduce the number of UPDATE statements. Let us start with a simple table:

staff name salary Bob 1100 Jane 1150 Frank 1050 Susan 1125 John 1100

Let us say we want to apply some pay rises to all staff. We could update each row using an UPDATE statement:

UPDATE staff SET salary = 1200 WHERE name = 'Bob'; UPDATE staff SET salary = 1200 WHERE name = 'Jane'; UPDATE staff SET salary = 1200 WHERE name = 'Frank'; UPDATE staff SET salary = 1200 WHERE name = 'Susan'; UPDATE staff SET salary = 1200 WHERE name = 'John';

That would require five update statements. Can we do any better? Yes; in this example, all staff are getting the same salary, so we can make those changes in a single UPDATE statement:

UPDATE staff SET salary = 1200 WHERE name IN ('Bob', 'Jane', 'Frank', 'Susan', 'John');

That reduces the number of queries from five to one. But this approach will only work when all rows are having the same update for the same columns. What if not everyone was getting the same salary?

UPDATE staff SET salary = 1200 WHERE name = 'Bob'; UPDATE staff SET salary = 1250 WHERE name = 'Jane'; UPDATE staff SET salary = 1200 WHERE name = 'Frank'; UPDATE staff SET salary = 1250 WHERE name = 'Susan'; UPDATE staff SET salary = 1200 WHERE name = 'John';

We are no longer setting all the salary fields to the same value, so we can’t collapse it into a single statement. But we can group the updates according to the value being set, and then do one UPDATE statement per distinct value. In this case there are two distinct values, so we can do it in two UPDATE statements:

UPDATE staff SET salary = 1200 WHERE name IN ('Bob', 'Frank', 'John'); UPDATE staff SET salary = 1250 WHERE name IN ('Jane', 'Susan');

So we can reduce the number of queries from five to two. Not quite as good, but still an improvement.

But the extent to which this helps us reduces as the proportion of distinct SET values goes up. Eventually every SET value is different, for example:

UPDATE staff SET salary = 1125 WHERE name = 'Bob'; UPDATE staff SET salary = 1200 WHERE name = 'Jane'; UPDATE staff SET salary = 1100 WHERE name = 'Frank'; UPDATE staff SET salary = 1175 WHERE name = 'Susan'; UPDATE staff SET salary = 1150 WHERE name = 'John';

Since every SET clause is now different, we can’t coalesce any queries using the approaches mentioned before, so we still have one UPDATE statement for every row requiring changes. Is there anything we can do?

What if we had a database table which contained a representation of the updates we wanted to make, e.g.

updates name salary Bob 1125 Jane 1200 Frank 1100 Susan 1175 John 1150

and we could persuade the database server to apply those updates to the target table? This is in fact entirely possible in many database systems. In the case of PostgreSQL, it can be done using a FROM clause with UPDATE, like this:

UPDATE staff SET salary = updates.salary FROM updates WHERE updates.name = staff.name;

This does an INNER JOIN between the tables “staff” and “updates” where the column “name” matches. The SET clause then takes the “salary” field from the “updates” table and uses it to update the “salary” field of the “staff” table.

We can easily contrive for an “updates” table to exist by creating a temporary table and populating it. It is relatively straightforward to populate a table with multiple rows with just one query (or at least, far fewer queries than the number of rows desired).

So, given a list of updates to apply we could effect them using the following steps:

Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates Use INSERT to populate the temporary table with the updates Use UPDATE … FROM to update the target table using updates in the temporary table Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

OK, that’s great, we have some theoretical approaches for reducing the number of queries, now what?

The code required to implement the above logic is sufficiently fiddly that we would probably not want to have to repeat it. So we could think in terms of creating a re-usable module which would implement that logic.

This is the intention of DBIx::MultiRow .

The input to the function would be a list of updates which the caller desires to be made. Each update should contain two things:

An indication of which row should be updated New values for one or more fields

Going back to our first example:

UPDATE staff SET salary = 1200 WHERE name = 'Bob'; UPDATE staff SET salary = 1200 WHERE name = 'Jane'; UPDATE staff SET salary = 1200 WHERE name = 'Frank'; UPDATE staff SET salary = 1200 WHERE name = 'Susan'; UPDATE staff SET salary = 1200 WHERE name = 'John';

To do this using DBIx::MultiRow , we would write:

use DBIx::MultiRow 'multi_update'; multi_update($dbh, table => \%staff_table_data, key_columns => [ 'name' ], value_columns => [ 'salary' ], updates => [ [ 'Bob', 1200 ], [ 'Jane', 1200 ], [ 'Frank', 1200 ], [ 'Susan', 1200 ], [ 'John', 1200 ], ] );

“key_columns” specifies the columns which will be used to identify rows which need to be updated (using WHERE).

“value_columns” specifies the columns which will be updated (using SET).

“updates” specifies the updates to be made. Each element is a two-element array. The first element provides the value of the column (specified by “key_columns”) to identify the row to be updated. The second element provides the value to be SET in the column specified by “value_columns”.

In this case, multi_update will spot that the target values are all 1200, and will effect the desired changes using a single UPDATE statement as described above. (It will use placeholders and parameter binding if it thinks it’s appropriate.)

If given our second example with two distinct values, multi_update will spot that there are two distinct values, 1200 and 1250, and will effect this with two UPDATE statements as described above.

Optimising the number of UPDATEs by grouping the distinct SET values can be done in a way which is compatible with most common SQL databases. But the temporary table / UPDATE … FROM approach requires knowledge of the specific SQL database being used.

So DBIx::MultiRow is structured so that approaches which are generic across different SQL databases are expressed in a base class, and approaches which only work for specific SQL databases are expressed in a subclass. An object of the relevant class is instantiated when the call is made, and control then passed to the implementation relevant to the database in use.

If there is no database-specific subclass for the database in use, then DBIx::MultiRow will just use the base class which implements approaches that should work for any SQL database.

At the time of writing, the only database-specific subclass is for PostgreSQL. So if the caller has a PostgreSQL database, and calls multi_update with data to represent our third example (where the target values are all unique), then the PostgreSQL-specific subclass will effect the updates using the table / UPDATE … FROM approach.

There are a few more details worth mentioning. Let’s expand the original table a bit:

staff company first_name last_name salary bonus last_updated WidgetCo Bob Jones 1100 100 2013-01-01 12:30:06.815182+01 WidgetCo Jane Smith 1150 105 2013-01-01 12:30:06.815182+01 WidgetCo Jane Austin 1150 110 2013-01-01 12:30:06.815182+01 WidgetCo Frank Johnson 1050 90 2013-01-01 12:30:06.815182+01 WidgetCo John Grey 1100 100 2013-01-01 12:30:06.815182+01 Frobnitz Bob Jones 1100 110 2013-01-01 12:30:06.815182+01 Frobnitz Susan Black 1125 115 2013-01-01 12:30:06.815182+01

“name” has now been split into “first_name” and “last_name”. To match on names we now need to match on two columns. So whereas previously we specified the matching column using a scalar, when there is more than one column to match, we use an ArrayRef.

Similarly, let’s say we want to update two fields, salary and bonus. If we want to update more than one column, whereas previously we specified the new value using a scalar, when there is more than one column to update, we use an ArrayRef.

Also, let’s say we want to make updates only for people in the company

“WidgetCo”. “Bob Jones” appears in both companies, so we need to be

careful to identify the company, but we don’t want to have to specify it

repeatedly for each row. The ‘where’ parameter can be used to specify

a global condition, e.g. where => { company => 'WidgetCo' } .

Another thing which we may want to do is update a column with the same

value for all rows being updated. So here we could update the “last_updated” column. Rather than have to specify

the same value over and over again, we can use the ‘set’ parameter.

The ‘set’ parameter takes a HashRef whose keys are the columns to be

updated. Values specified as scalars are applied as-is, while values

specified as ScalarRef specify literal SQL. In our example, we’ll set the “last_updated” column to the result of calling the SQL function NOW() .

So to effect these changes, we could call multi_update like this:

multi_update($dbh, \%table_info, key_columns => [qw{ first_name last_name }], value_columns => [qw{ salary bonus }], updates => [ [ [ 'Bob', 'Jones' ], [ 1125, 110 ] ], [ [ 'Jane', 'Smith' ], [ 1200, 130 ] ], [ [ 'Jane', 'Austin' ], [ 1200, 130 ] ], [ [ 'Frank', 'Johnson' ], [ 1125, 110 ] ], [ [ 'John', 'Grey' ], [ 1150, 110 ] ], ], where => { company => 'WidgetCo' }, set => { last_update => \'NOW()' } );

In addition to supporting explicit global conditions (using the ‘where’ parameter), DBIx::MultiRow will analyse the updates to look for any cases where the values to be matched in a column are all the same, and it will factorise these out into global conditions in the WHERE clause. So if there happen to be any implicit global conditions, then these will be identified and simplified anyway.

Similarly, DBIx::MultiRow will analyse the updates to look for any cases where the values to be SET for a column are all the same, and it will factorise these out into simple sub-clauses in the SET clause.

One more thing which deserves an explanation is the first names parameter \%table_info . In order to be able to reliably create temporary tables, and do SQL queries where the types of bound parameters cannot be inferred, DBIx::MultiRow requires the caller to provide information about the target table. So in the case of the six-column table above, %table_info might be:

( quoted_table_name => 'staff', columns => { 'company' => { dbi_type => SQL_VARCHAR, sql_type => 'TEXT' }, 'first_name' => { dbi_type => SQL_VARCHAR, sql_type => 'TEXT' }, 'last_name' => { dbi_type => SQL_VARCHAR, sql_type => 'TEXT' }, 'salary' => { dbi_type => SQL_NUMERIC, sql_type => 'NUMERIC(9,2)' }, 'bonus' => { dbi_type => SQL_NUMERIC, sql_type => 'NUMERIC(9,2)' }, 'last_updated' => { dbi_type => SQL_DATETIME, sql_type => 'TIMESTAMP WITH TIME ZONE' } } )

One upshot of doing things this way is that DBIx::MultiRow does not guarantee the order in which rows will be updated, and so in the general case, row update lock aquisition order is not known. If deadlock avoidance is a concern for the caller, then this would have to be handled separately. For example under PostgreSQL this may mean doing a separate SELECT … ORDER BY … FOR UPDATE statement to acquire the row update locks in a defined order first.

Having implemented code to do UPDATE on multiple rows, it was then only a small step to do a similar thing for DELETE. So DBIx::MultiRow includes another function multi_delete which can be used to delete multiple rows:

multi_delete($dbh, \%table_info, key_columns => [qw{ first_name last_name }], deletes => [ [ 'Bob', 'Jones' ], [ 'Jane', 'Smith' ], [ 'Jane', 'Austin' ], [ 'Frank', 'Johnson' ], ], where => { company => 'WidgetCo' }, );