Introduction

Relational databases can be tricky sometimes.

I am currently working on an application that deals with some assets owned by our clients. Those assets are created in our platform and from time to time they have some of their information updated.

Let’s take Vehicles for example. I had a table structure that looked like:

CREATE TABLE `Vehicles` (

`id` varchar(50) NOT NULL,

`organizationId` varchar(50) NOT NULL,

`plate` char(7) NOT NULL,

`vehicleInfo` json DEFAULT NULL,

`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,

PRIMARY KEY (`id`),

UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,

KEY `Vehicles_createdAt_idx` (`createdAt`),

);



CREATE TABLE `VehicleUpdates` (

`id` varchar(50) NOT NULL,

`organizationId` varchar(50) NOT NULL,

`vehiclePlate` char(7) NOT NULL,

`status` varchar(15) NOT NULL,

`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,

PRIMARY KEY (`id`),

KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE

);

The relationship of both tables is M:N and I can find all updates of a given vehicle using (organizationId, vehiclePlate) .

Now my application has a new requirement that when returning vehicles, I must also bring along its latest update information.

After digging for a while, I found out that my new requirement had a name: “Group-wise Max”.

By the time of this writing, my production MySQL server was running version 5.7.23 on Amazon RDS. Vehicles table had 15.832 rows and VehicleUpdates had 271.910 rows.

Available Solutions

I stumbled upon this blog written by Rick James. He describes the problem and proposes some solutions, taking into account their performance. I picked two that made more sense to me.

To be fair, Mr. James’ proposals solve only a part of my problem. I must use them within another LEFT JOIN to be able to fulfill my requirements.

The uncorrelated sub-query

According to Rick James, this solution is O(M) , where M is the number of output rows. Since I’m using 5.7 , already have the proper index and cannot have duplicates, this approach would work just fine.

I wrote a query like this:

SELECT SQL_NO_CACHE vu1.*

FROM VehicleUpdates AS vu1

JOIN

( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt

FROM VehicleUpdates

where organizationId = @orgId

GROUP BY organizationId, vehiclePlate

) AS vu2 USING (organizationId, vehiclePlate, createdAt);

The average execution time for this query was 25 ms . That seemed promising!

Then I added the Vehicles table to the query to solve my actual problem:

SELECT SQL_NO_CACHE v.*, vu1.*

FROM Vehicles AS v

LEFT JOIN (

VehicleUpdates AS vu1

JOIN

( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt

FROM VehicleUpdates

GROUP BY organizationId, vehiclePlate

) AS vu2 ON vu1.organizationId = vu2.organizationId

and vu1.vehiclePlate = vu2.vehiclePlate

and vu1.createdAt = vu2.createdAt

)

ON vu1.organizationId = v.organizationId

AND vu1.vehiclePlate = v.plate;

Unfortunately, this led the average execution time to 2,200 ms . A total deal breaker for me 😞.

Well, I’ll never run this full table scan on production. I’ll at least have a condition on the organizationId column and probably also set a LIMIT clause. Let’s give it a try!

Then I proceeded to:

SELECT SQL_NO_CACHE v.*, vu1.*

FROM Vehicles AS v

LEFT JOIN (

VehicleUpdates AS vu1

JOIN

( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt

FROM VehicleUpdates

GROUP BY organizationId, vehiclePlate

) AS vu2 ON vu1.organizationId = vu2.organizationId

and vu1.vehiclePlate = vu2.vehiclePlate

and vu1.createdAt = vu2.createdAt

)

ON vu1.organizationId = v.organizationId

AND vu1.vehiclePlate = v.plate

WHERE v.organizationId = '<SOME ORGANIZATION ID>'

LIMIT 100;

And the results? 🥁🥁🥁… I got a 320 ms average. Way better, but still a no go for my application 😢.

I wasn’t ready to give up yet! After examining the output of the EXPLAIN for the query above, I noticed that the inner-most JOIN was being made using range , while all others were using ref .

Turns out that that inner-most JOIN query was being inefficient, because it was scanning the whole VehicleUpdates table, only to have the results filtered out by the ON clause in top-level LEFT JOIN .

Since the MySQL engine wouldn’t optimize this for me, I had to do it myself, by adding a WHERE clause in the inner-most query to restrict the number of searched rows to only those with an specific organizationId :

SELECT SQL_NO_CACHE v.*, vu1.*

FROM Vehicles AS v

LEFT JOIN (

VehicleUpdates AS vu1

JOIN

( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt

FROM VehicleUpdates

WHERE organizationId = '<SOME ORGANIZATION ID>'

GROUP BY organizationId, vehiclePlate

) AS vu2 ON vu1.organizationId = vu2.organizationId

and vu1.vehiclePlate = vu2.vehiclePlate

and vu1.createdAt = vu2.createdAt

)

ON vu1.organizationId = v.organizationId

AND vu1.vehiclePlate = v.plate

WHERE v.organizationId = '<SOME ORGANIZATION ID>'

LIMIT 100;

This gave me an average execution time of 50 ms . Now we’re talking 😊! Running EXPLAIN again showed me that all joins where using ref .

However, this query is quite of a mouthful. The need to add a WHERE clause into the inner-most sub-query wouldn’t allow me to create a view to simplify it 😔.

The LEFT JOIN approach

Rick James states that this approach is O(2N-3N) depending on join_buffer_size . At least in theory, it couldn’t be as fast as the uncorrelated sub-query.

I ended up with a query like this:

SELECT vu1.*

FROM VehicleUpdates AS vu1

LEFT JOIN VehicleUpdates AS vu2

ON vu1.organizationId = vu2.organizationId

AND vu1.vehiclePlate = vu2.vehiclePlate

AND vu2.createdAt > vu1.createdAt

WHERE vu2.id IS NULL;

The average execution time was about 40 ms . Not as good as the equivalent uncorrelated sub-query approach, but still acceptable 😅.

Okay, so proceeding to the real problem, I got:

SELECT SQL_NO_CACHE v.*, vu1.*

FROM Vehicles AS v

LEFT JOIN VehicleUpdates AS vu1

ON v.plate = vu1.vehiclePlate

AND v.organizationId = vu1.organizationId

LEFT JOIN VehicleUpdates AS vu2

ON vu1.organizationId = vu2.organizationId

AND vu1.vehiclePlate = vu2.vehiclePlate

AND vu2.createdAt > vu1.createdAt

AND vu2.id IS NULL;

Without a LIMIT clause, on average, this query runs for 12,000 ms . That’s a lot 😲😲😲!!!

Okay, then I wrote what would be the actual production query:

SELECT SQL_NO_CACHE v.*, vu1.*

FROM Vehicles AS v

LEFT JOIN VehicleUpdates AS vu1

ON v.plate = vu1.vehiclePlate

AND v.organizationId = vu1.organizationId

LEFT JOIN VehicleUpdates AS vu2

ON vu1.organizationId = vu2.organizationId

AND vu1.vehiclePlate = vu2.vehiclePlate

AND vu2.createdAt > vu1.createdAt

WHERE v.organizationId = '<SOME ORGANIZATION ID>'

AND vu2.id IS NULL

LIMIT 100;

The results? 🥁🥁🥁… I got the same 40 ms as the query without the Vehicles table 🎉🎉🎉!

Conclusion

Turns out that the LEFT JOIN approach had a slightly better performance, contrary to what I would expect from the article on Rick James’ blog.

The difference is very small and might be due to statistical noise, because I didn’t use any proper benchmark tool or a strict methodology. I ran the queries against organizationId s which had the less and the most associated vehicles about 20 times each and then calculated the arithmetic average on the execution times. I’d be more comfortable saying that both have similar performances.

However, I ended up choosing the LEFT JOIN approach because it allows me to create a view to simplify the final query:

CREATE VIEW LatestVehicleUpdates AS

SELECT vu1.id, vu1.organizationId, vu1.vehiclePlate, vu1.status, vu1.createdAt, vu1.updatedAt

FROM VehicleUpdates AS vu1

LEFT JOIN VehicleUpdates AS vu2

ON vu1.organizationId = vu2.organizationId

AND vu1.vehiclePlate = vu2.vehiclePlate

AND vu2.createdAt > vu1.createdAt

WHERE vu2.id IS NULL;

Then I can run the following query to solve my original problem:

SELECT v.*, lvu.*

FROM Vehicles

LEFT JOIN LatestVehicleUpdates as lvu

ON v.organizationId = lvu.organizationId

AND v.plate = lvu.vehiclePlate

WHERE organizationId = '<SOME ORGANIZATION ID>'

LIMIT 100;

That’s all folks!