Running migrations is hard, especially when you’re dealing with an ever-changing, sophisticated data model. Or, better put, running migrations can be hard. Having dealt with every type of migration problem imaginable here at VTS, over the years we’ve developed a system that actually makes migrating one of the smoothest parts of our deployment process.
It’d be selfish to keep this knowledge entirely to ourselves, so here it is. Below are three very simple principles that will save you 95% of the hassle associated with database migrations. And yes, you’re right - I made up the 95% figure. In reality, I’m sure it’s much higher.
Make all data migrations asynchronous background jobs
In order to successfully deploy an app without bringing it down for too long, it’s essential that the db:migrate process executes quickly.
Imagine the following scenario: right before the deploy, one of your engineers opens a pull request that renames a heavily used column in one of your models. She replaces all the references to the old name in the code by the new one, and she adds a migration that changes the actual name. Simple enough - renaming migrations run very quickly, so the app should be good to go in a matter of seconds, right? You merge the pull request and deploy.
Well, suppose that earlier that day another engineer created a different, much longer migration. Let’s say he added a column that stores some computed value so you don’t have to waste time calculating it every time. In addition to the
add_column line of code, he included a little script that iterates through all the rows in the table and populates the newly added column. Say this script takes thirty minutes to run.
Now when you deploy, the code change will be made and the db:migrate process will start, but it’ll be stuck on the data migration for thirty minutes before it gets to actually renaming the column. For that half hour, the app will be essentially unusable! The code will keep trying to reference the new column name, but the database will still have the old one, resulting in a whole lot of errors being thrown (and some very unhappy users).
I don’t have to tell you that this could turn into a very serious problem. Thirty minutes is a rather optimistic case - we’ve deployed migrations that took hours to run. Fortunately, there is a very simple solution. All you have to do is create a background worker (we use Sidekiq) that will perform the data migration. That way you can add the computed value column, quickly kick off the background job, and keep going.
What’s interesting about this particular scenario is that the data migration wasn’t too important (it was just a cached value), so it’s likely the engineer wasn’t too concerned about it. But you never know what might come after your migration - that’s why we make it a rule here at VTS to put all data migrations into background jobs, regardless of importance or estimated duration.
All background jobs should be idempotent
“Idempotent” is a fancy word describing a very simple concept. A process is said to be idempotent if running it several times has the same result as running it just once. For example, the line
x += 1 is not idempotent - running it three times will result in a different value of x than if you were to run it just once. A line like
x = 5 would be considered idempotent.
Background jobs blow up - a lot. That’s why Sidekiq has a ‘retry’ feature, which, when a job fails, waits a bit of time and runs it again. The reasons for failure vary tremendously - you may have made a syntax error, the database could be thread-locked, or your script may have made some assumptions about the data that weren’t true in every case. Even if you code is absolutely perfect (which never happens), it could be that another Sidekiq job has a memory leak, forcing the entire process to restart.
While trying to minimize these kinds of failures is a noble pursuit that you might wanna spend some time on, the fact of the matter is that your jobs will often fail and retry - it’s just part of life. That’s why it’s imperative that you make every background job, and especially every data migration, completely idempotent.
Doing this is usually fairly simple. Are you going through all the rows of some table and updating a value? Find a way to keep track of which rows you’ve already completed, and in the case of failure start where you last left off. If that’s too tricky, don’t worry - just start from scratch and make sure your actually operation is idempotent. For instance, recalculate the entire value instead of just adding to it. It may take a bit longer, but it’s nothing compared to the time you’d waste trying to fix up butchered data.
Don’t delete obsolete columns or tables until a week later
Building an application with a sophisticated data model (like VTS), you will inevitably have to delete columns, or even entire tables. Perhaps you simplified your model and were able to get rid of a column you didn’t really need (yay!). More often, as the app grows, what used to be a column gets broken out into a table of its own.
Either way - you make a pull request that makes all the necessary changes, creates all the new tables and columns and migrates the data over. Then you delete the column. Satisfied with your work, you deploy…
… and roll back the next day. Maybe the feature had bugs, maybe the users didn’t like it, or maybe the platform ended up just not being ready. Whatever it may be, rolling back is as much a part of our daily lives as those three cups of coffee we down before lunch.
Deleting a column or a table makes rolling back very hard. In some cases, you might be able to reverse engineer what was in it, but why bother? The cost of keeping the obsolete data around for another week is trivial compared to the convenience it provides.
One week is an arbitrary choice - you can do one month, you can do a couple of days. You can vary it based on the specifics of the situation. What’s important is that you only delete the column or table once you’re confident you won’t be rolling back.