One of the first truly challenging tasks I had to face as a new developer was contributing to the implementation of a data migration wizard for the database of a legacy CRM.

Having no previous experience whatsoever with data migration and related best practices meant I had to improvise at several points. Partly because the work that had been done so far only covered a part of the migration logic that had to be implemented, and there wasn’t much help online about specific issues that arose on the way. In the end, though, it was a really interesting and educational experience.

The infrastructure

The legacy database was stored on a SQL Server instance, and we had access to the most recent dump to migrate it. Our RDBMS was MySQL, and initially, the migrator was implemented as a C# console application. The goal now was to have a fully functional and configurable wizard in a Windows Forms application, so I had to transfer (and partly refactor) some of the console code to the form application.

There were many entities to migrate, and the wizard had to be able to transfer them independently so we could test the validity of the transferred data.

The challenges

We were two people working on the project, and we encountered many difficulties during its development.

The schema of our database was designed in a different logic than the existing one, so we had to map the existing entities and their relations to ours somehow. The hardest part though wasn’t transforming them but figuring out how they map to the new business logic.

The database was old, and it had already undergone some integration with an even older ERP system. Consequently, about half the tables had cryptic three-letter names, and we had to guess what their purpose was from either their contents or a from an ancient .chm file that very briefly described the most important tables and their fields. Needless to say, that file wasn’t really helpful, and there was no one available from the company that had originally designed the database to help us.

After more than a few queries and guesswork, we eventually found out which were the tables that contained the data we needed, and how they were related to each other and, most importantly, to our schema. But the challenges didn’t end there. As it turned out, the users of the system had been using some of its functionality (and consequently, the corresponding tables) for purposes other than their intended use, for lack of specialized functionality (which we had implemented in the new system).

That effectively meant that a single table mapped to more than one tables in our database, depending on the contents of more than one field, which more than once were code strings with inconsistent formats, whose values we had to find out and decipher. Sometimes, there even was no valid mapping to our database for some records, as their data was ambiguous.

The implementation

The core concept was to retrieve rows from the SQL Server, perform any necessary processing on them, and perform insertions in MySQL by chunking the results. After a few tries, we decided a chunk size of 1.000 was a good trade-off between retrieval and insertion time. Even though this procedure worked well for independent entities, it had to be extended for related ones.

The solution we used was mapping the inserted entity keys to the retrieved ones, in a Dictionary, so in the processing step of the related entities, we could add the proper foreign keys directly. Even that solution though, couldn’t cover every possible case we came across.

There were several cases where we had to split a table into multiple ones, as well as join data from different tables into the same one. For example, the main table that contained information about the customers had a lot of fields that not only weren’t used at all but were completely irrelevant to our system. So we had to search for any other tables containing relevant data and find out how they were joined.

On a more technical level, what I did in order to process all the customer fields uniformly was to handle any read/write logic in an abstract base class. I then handled the processing in a method of a derived class, which was called for each entry, and appended the processed result to the query to be inserted. Then, at a user defined chunking size, the insert query would be called, and the query string reset. The above was the solution I applied to most of the transferable data, although there were some particularities which had to be specifically addressed to the customer fields.

Some entities could be transferred independently from the others, so we delegated that work to different threads, but in some cases, an entity had to wait for the transfer of another one to complete in order to be processed. At this point, I saw in action several issues about thread safety (and multi-threading in general) that until then I had only seen in theory at the University.

Perhaps one of the trickiest parts to migrate were the customer’s transactions. The fact that the ERP tables contained all sorts of irrelevant (to our system) invoicing data made it even harder to figure out where exactly was the information we were looking for. The ERP system would also continue to be used (which meant we had to make an integration to their existing API in order to sync future transactions), so we wouldn’t handle invoicing in our system, but we needed a particular part of the customers’ transactions for various screens and reports.

I was responsible for migrating these parts of the data, and after several unsuccessful tries, I managed to find the sales invoices we wanted and how to connect them to each customer. More importantly, I figured out how the line items of each invoice or receipt related to it (with several LEFT JOINs on seemingly unrelated intermediate tables).

The existing database structure was a bit chaotic to my inexperienced eyes, and I still don’t understand some of the design decisions, but all is well that ends well.

Conclusion

All in all, I can safely say that I learned much from this project, both from a software engineering point and from a data modeling/ database design perspective. The next challenge was to integrate existing APIs to our new system to keep the migrated data up-to-date, but that’s an entirely different story.