Warming cache

When cache is slower than one query

I found myself refactoring a lot of our ETL code due to poorly designed cache and I want to share an abstract of my learnings.

tl;dr; Pre-load all the data that the program is likely going to require instead of querying and caching data when it is needed.

Tuning our ETL process like it was a race car.

Input data

First, let me setup the scene. You are aggregating data from thousands of cinemas (this is what we do at http://applaudience.com/). This data includes seats for every showtime, e.g.

{

"name": "A1",

"state": "SOLD",

"type": "TIER_1_PREMIUM",

},

{

"name": "A2",

"state": "SOLD",

"type": "TIER_1_PREMIUM",

},

{

"name": "A3",

"state": "SOLD",

"type": "TIER_2_PREMIUM",

}

{

"name": "A4",

"state": "SOLD",

"type": "TIER_2_PREMIUM",

},

{

"name": "A5",

"state": "SOLD",

"type": "TIER_1_PREMIUM",

},

{

"name": "A6",

"state": "SOLD",

"type": "TIER_1_PREMIUM",

}

The requirement is to map foreign entities to local entities.

An example entity type is a seat type ( type in the above example), i.e. cinema describes seat types using arbitrary identifiers (e.g. TIER_1_PREMIUM , TIER_2_PREMIUM ) and the program that receives data must identify if it already knows of this foreign seat type:

If program recognizes the foreign seat type identifier, then it must return the associated local identifier. If program does not recognize the foreign seat type identifier, then it must create a record in the local database and return the associated local identifier.

This pattern appears whenever dealing with any external data sources that needs to be synchronised with a local database.

In order to synchronise our database with cinema feed, we:

first identify all attributes of the data provided by the cinema check if last known state is different from the current state record new state if it is different from the last state

The flawed cache strategy was in the step #1.

Foreign attribute identification

In order to identify all attributes, we need to iterate through all seats and update or insert (upsert) their identifiers, which is what we do in this code:

for (const seat of seatingPlan.seats) {

const cinemaForeignSeatNameId = await upsertCinemaForeignSeatName(

connection,

cinemaId,

seat.name

);

const cinemaForeignSeatStateId = await upsertCinemaForeignSeatState(

connection,

cinemaId,

seat.state

);

const cinemaForeignSeatTypeId = await upsertCinemaForeignSeatType(

connection,

cinemaId,

seat.type

); seatIdentifiers.push({

cinemaForeignSeatNameId,

cinemaForeignSeatStateId,

cinemaForeignSeatTypeId,

});

}

All the upsert* methods are implemented using Slonik upsert utility, which will first attempt to SELECT a resource and if resource is not found then will attempt to INSERT the resource (update does not come into play because the lookup columns consist only of columns that make up theunique constraint).

const upsertCinemaForeignSeatType = async (

connection: DatabaseConnectionType,

cinemaId: DatabaseRecordIdType,

foreignSeatType: string

): Promise<DatabaseRecordIdType> => {

return upsert(

connection,

'cinema_foreign_seat_type',

{

cinemaId,

foreignSeatType,

},

[

'cinema_id',

'foreign_seat_type',

]

);

};

upsertCinemaForeignSeatType returns a local seat type identifier if it can find a record in a cinema_foreign_seat_type table with a matching (cinema_id, foreign_seat_type) .

Using cache

As you have probably already deduced from the sample input, state and type values repeat often in the same cinema feed. Therefore, it makes sense to cache the query, e.g.

const MINUTE = 60 * 1000; const memoizeQuery = (query, max = 5000, maxAge = 60 * MINUTE) => {

return memoizee(query, {

max,

maxAge,

normalizer: (args) => {

return JSON.stringify([].slice.call(args, 1));

},

promise: true,

});

}; const upsertCinemaForeignSeatStateUsingCache = memoizeQuery(upsertCinemaForeignSeatState);

const upsertCinemaForeignSeatTypeUsingCache = memoizeQuery(upsertCinemaForeignSeatType); for (const seat of seatingPlan.seats) {

const cinemaForeignSeatNameId = await upsertCinemaForeignSeatName(

connection,

cinemaId,

seat.name

);

const cinemaForeignSeatStateId = await upsertCinemaForeignSeatStateUsingCache(

connection,

cinemaId,

seat.state

);

const cinemaForeignSeatTypeId = await upsertCinemaForeignSeatTypeUsingCache(

connection,

cinemaId,

seat.type

);

seatIdentifiers.push({

cinemaForeignSeatNameId,

cinemaForeignSeatStateId,

cinemaForeignSeatTypeId,

});

}

This change means that state and type upsert operations are going to perform their routines at most 1 time for each unique (cinema_id, state) and (cinema_id, type) combination. Considering that each auditorium has 200–300 seats, this is already a pretty good improvement.

The problem with this approach is that it assumes that state and type cardinality is small. What if state and type values are unique to each seat? That would mean that we need to execute upsert for each seat or ~600 queries per cinema. This is exactly what happened in our case. We found that several major cinemas are feeding us with unique values state/ type values for each seat resulting in over 10M unnecessary queries per day.

I should note that 10M queries is not necessary a problem. We can easily handle that – in fact, PostgreSQL Shared Buffer Cache will mostly make these queries unnoticeable. The primary problem is log flood and unnecessary network time.

Pre-loading likely to be used data

There is a better way to do it though – we can pre-load all cinema data in a single query, i.e. instead of using (cinema_id, type) parameters to lookup existence of cinema_foreign_seat_type , we can lookup all cinema_foreign_seat_type for cinema_id , construct a dictionary and use dictionary to lookup foreign seat type , e.g.

const createUpsertCinemaForeignSeatType = async (

connection: DatabaseConnectionType,

cinemaId: DatabaseRecordIdType,

) => {

const dictionary = {}; const cinemaForeignSeatTypes = await connection.any(sql`

SELECT fuid, id

FROM cinema_foreign_seat_type

WHERE cinema_id = ${cinemaId}

`); for (const cinemaForeignSeatType of cinemaForeignSeatTypes) {

dictionary[cinemaForeignSeatType.fuid] = cinemaForeignSeatType.id;

} return (fuid: string): Promise<DatabaseRecordIdType> => {

if (dictionary[fuid]) {

return dictionary[fuid];

} return upsert(

connection,

'cinema_foreign_seat_type',

{

cinemaId,

fuid,

},

[

'cinema_id',

'fuid',

]

);

};

};

Using this technique, we execute only one query to retrieve all already known seat types (which are going to remain unchanged most of the time), i.e. we only need to execute upsert routine when the input data contains a previously unseen seat type. The rest of the code remains unchanged and no cache is needed.

The benefit of this approach is that we reduce the number of network requests that are needed to recognize the data and we also limit our assumptions about the state of the database to a single transaction (the dictionary object). This pattern works whenever you know some attributes of the data that you need to retrieve (e.g. cinema_id in this case).

There is one gotcha, though – what if there are millions of records that are identified just by cinema_id ? We worked around this by adding last_seen_at attribute to cinema_foreign_seat_type . We update this attribute once a day and we only retrieve records that have been seen over the last 30 days.

To sum up, with minimal code changes, we have drastically reduced the number of queries that we need to make when identifying foreign data and we got rid of cache.