Photo by Shripal Daphtary on Unsplash

In the past year, we’ve migrated tens of thousands of diners from Eat24, Foodler, and OrderUp onto our platforms. We were lucky to have a lot of past experience to draw on when we planned these recent migrations of large amounts of diner data. Diner data is essential. It’s important to convert it accurately, in order to not make your users angry in this conversion. Users hate change — any change, even if it is demonstrably better. So getting the data migration right is really important.

But it wasn’t always so easy. Our first major migration was following the merger of Seamless and Grubhub. When the two companies merged, their technology was running on two completely independent platforms. Diner accounts, restaurant accounts, financial reporting, restaurant tablets, and everything in-between had a separate implementation in the respective platforms. Seamless was .NET and Grubhub was Java. Both of them were monoliths running separate sets of data, each out of a single data center.

The Seamless platform had been around longer and definitely showed its age. We had a graph we called the “Doomsday Graph,” which predicted the date — our doomsday — when the Microsoft SQL server powering it all would no longer sustain the projected traffic and order growth. Instead of pouring more money into two separate platforms, we decided to sunset the legacy Seamless software stack and migrate to the Grubhub stack. The old Grubhub platform’s days were numbered, too. At the same time, we had the foresight to spend effort designing and building a platform for the future — a distributed, scalable, and fault-tolerant platform in the cloud.

Seamless pre-merger

When I first started at Grubhub, I remember being very impressed at the depth and breadth of complexity involved in taking and fulfilling an online order. There was a lot of stuff going on. It would be inconceivable to move to a new platform in one go. Again, we decided to start with a few core pieces of the platform and migrate all of the consumer Seamless users over to a mix of the new platform and the Grubhub platform.

To start, we built a routing layer and a diner registration, authentication, and access control service. The idea was to have clients (web, iOS, and Android) login using the new service, but still give them access to the monolith Grubhub APIs for search, menus, checkout, payments, and everything else. The authentication service would need to return two access tokens to the client — one that was compatible with the monolith APIs and one compatible with new distributed versions of search, menus, checkout, and payments services on the near roadmap. The routing layer enabled us to run multiple versions of our services and send a larger and larger percentages of traffic to the newer versions until we were confident we didn’t have any issues with the version. This percent routing was a key feature in allowing us to move quickly and with confidence.

Migrating diners

But with this new platform in development, we still had to deliver food to hungry diners. We weren’t afforded any downtime for the migration. Nor would we be able to cutover all diners at once. Instead, we would run an extensive open beta with the new site, so we needed to plan for users to be potentially active on both the classic and new versions of the site. Complicating things even more, we were building new versions of the native mobile apps. Back in those days, new version adoption was a pretty slow process. We had Blackberry users continuing to use the classic app years after the migration was completed.

Because all ordering functions were to be executed on the classic Grubhub platform, the user’s account information needed to be available there, too. We built three separate systems to deal with these requirements. Of course, we needed to keep a copy of a user’s account on each of these systems — Seamless classic, Grubhub classic, and the new account service.

On-demand migration

With these three sets of data, we didn’t want to constantly look for changes and move data around. Diners could use whichever system they liked — they could create an account and login through either the Seamless classic site or the new beta site, but were not allowed to create a new account or login with their Seamless account from the Grubhub classic site. To ensure a user’s data remained consistent across systems, we managed the synchronization between accounts entirely within the new service. Whenever a user tried to create an account on the new platform, we would first check if we already had an account registered for that user. If we didn’t find an account, we would make a call out to the classic Seamless services to attempt to create the account there. If the account creation succeeded, we could create the account on the new platform, and then call out to the Grubhub system to create the account there.

We took a similar approach to authentication for existing users. If a user tried to authenticate from the new platform, we would first check if we had the account in our system already. If so, we would verify the password matched what we had on store. If the account wasn’t found on the new platform, we would make an authentication call out to the Seamless platform. If the authentication succeeded, we would migrate the user’s data onto the new platform and then onto the Grubhub platform. Similarly, if a user’s password didn’t match on the new platform, we would make an authentication call out to the Seamless platform. If that authentication succeeded, we could assume that the user changed their password on the Seamless side and give them access to the new site.

With any networked system, there are bound to be errors. The new platform was built on a transaction-less data store, and we didn’t implement a remote transaction/rollback strategy for the numerous API calls out to the classic Grubhub and Seamless systems. With that in mind, any error in any of the steps will leave an account in an incomplete state. To counteract this problem, both the authentication, account creation, and account update systems needed to be able to check the state of an account and make repairs if necessary, continuing through the remaining steps.

Making the creation and update APIs idempotent was also a very important strategy. Our service router built retries in from the start, and so we need to make sure that any given request would get the same response if replayed multiple times. A request may outright fail with an intermittent error or it may timeout because the node it was routed to was overloaded. In both those cases, the request may have been partially or completely successful despite failing to return a successful response. Under these conditions, the service router will retry the request. For the retry to be successful, the underlying service needs to return a success. The original request may have eventually been successful, in part or whole. The request would still need to return a success whether the record succeeded, needed repairing, or had completely failed.

Take, for example, the request to create an account. Most systems won’t allow you to create the same account twice. Grubhub’s account system is no different, but it must also allow retries. We achieved this by returning an identical success response to a create account request when it is the same as a previous successful create request — email and password match the previous record created. This idempotency of the API has the same benefit when the client executes the retry — either the user manually retries, or the client library executes it. Instead of getting an error stating the account already exists, the user is told the account creation was successful and is able to start ordering food. Of course, if the user tried to recreate an existing account using a password that didn’t match what was on file, they would get the standard account creation error.

Real-time migration

The on-demand migration system gave us a lot of breathing room and flexibility on how we rolled out the new site, but it had its limitations. Only account credential data was synchronized. Order history, saved addresses and other account information were not synchronized by the service. In our new microservice architecture, each service only handles a small set of tasks. The real-time migration service didn’t have access to any data other than account data — by design.

To get this data onto the new Grubhub platform, we built a migration service that would listen for changes occurring on the classic Seamless system. Whenever an address was created or an order placed, the real-time migrator would copy that data to the Grubhub platform. If the user associated with the content hadn’t been migrated yet, the migrator would move the user over, too.

Bulk migration

Finally, to make sure that Grubhub and the new platform had all the historic Seamless data — including the order history, saved addresses and other account information not synchronized in previous migrations — we performed one massive bulk migration. We extracted the data into CSV files and encrypted sensitive fields. We then sent the files through the bulk migrator.

What Grubhub looked like back in 2014

The bridge

In hindsight, one of the best decisions we made early on was to build bridge services into both the Grubhub classic and Seamless classic data centers. These services bypassed the legacy code and interacted directly with the respective databases to lookup, create, and update users and passwords. This let us focus on getting the migration right without wasting time paying tech debt. At the time we made it, however, it was somewhat controversial.

There were a number of risks to taking this approach. We already had a code base and libraries that could manipulate the data, all of it field tested and closely monitored. Several veteran engineers expressed concerns about this bridge:

The new service wouldn’t treat the data the same as the classic system, possibly corrupting data, introducing bugs, and upsetting diners when their account data doesn’t match up.

By bypassing all the classic APIs and their controls, we might overwhelm the database.

The operations crew’s tools wouldn’t paint a full picture of what was going on.

Even if we were going to build a new service, we should at least take on the core libraries to manipulate the data and avoid some of these problems.

These were all good points, and we did take them seriously. But our main concern was the complexity of the existing code base. Ten years of continuous development in the same code base will produce a lot of good code, but also a lot of not so good code. All this complexity made the code difficult to work with and test. Keeping the old code base meant we would not be able to affect the change we were looking for.

Ultimately, we opted for simplicity and mitigated many of the concerns by creating an extensive automated test suite. This suite ran a number of scenarios where a user was created from either the classic Seamless API, the classic Grubhub API, or the new API and then logged in to each of the other services, then verified that the data was correctly synchronized across them.

As we continued to grow and extend the new platform and replace legacy functionality with new services, the bridge service would play a vital role in synchronizing data between to the two platforms.

Beyond Seamless

Once the Seamless migration was complete and we had moved the majority of diner traffic to the new platform, we would begin moving Grubhub users and their data onto the new platform as well. We took a very similar approach to migrating Grubhub data as we did Seamless. On-demand migration, real-time migration, and a bulk migration process played heavily in the process.

One decision that made all these migrations easier was to make our password hashing algorithm pluggable. When we wrote a hashed password to the the database, we would identify the hash function that was used to hash it. When someone logged in, we would hash their incoming password with the same function defined in their account record. Originally, we decided to do this to hedge our bets on a choice of hashing algorithm. We wanted to be able to swap out and plug in hashing functions if we determined our function was no longer considered secure.

The pluggable hashing function gave us an unintended benefit. When we migrated user data from Seamless, Grubhub, and from subsequent diner acquisitions, we never had the password in the clear. You should never have a clear text password, of course. The problem is that each service had its own password hashing function, and each was different from our new one. We got around this by using our pluggable hash function. When we felt a hash function was as strong as ours, we would import the hashed data directly and indicate the function that was used. When we wanted to add a bit more strength to the function, we would apply our hash on top of the migrated hashed password. We would indicate again that the function was double hashed and which functions were used, and then apply those function at login time.

More recently, we migrated diners from Eat24, OrderUp, and Foodler to the new Grubhub platform. We used many of the same approaches of bulk, real-time, and on-demand migrations. We even built a bridge API into the Eat24 platform.

At the end of the day, our ideas worked out, we averted the doomsday predicted by our graph, and we took away some good learnings that have come in handy for our recent migrations.