Migrating an existing database into a new application can be a daunting task, especially if it's coming from another development team or platform. The database may contain multitudes of data, ensuring a lengthy migration - one that can't be repeated over and over if deadlines loom. It may have poor data integrity, or none at all. Your new, shiny Rails app may have a dozen validations on each table, but those same validations might exclude half the contents of the legacy database. And let's not even get into broken associations.
Fortunately, ActiveRecord makes it incredibly easy to map a second set of database models, and migrating data from one ActiveRecord source to another is faster than you might think - as long as you follow some best practices, of course. Even if you have an extremely large data set, don't assume that raw SQL is the only way to go. As with Rails usage in general, the saved development time (not to mention accurate validations and data conversions) will more than make up for any speed loss.
Migration Schedule
Throughout your migration design, keep in mind that you'll more than likely be running multiple migrations. You will need to test individual model migrations many times (especially if you have data integrity issues), and the final migration may span at least two executions if you're attempting a seamless transition from old to new. If people are still adding new data while you're running the final migration, it probably won't be "final" after all - you'll need to capture those last additions once you've switched over, and that may be a lot of data if your migrations take a long time to process.
Map Legacy Tables With ActiveRecord
In order to set up your legacy tables in ActiveRecord, you'll want to create a legacy base class in order to DRY up your database connection and any other common migration logic. It's a good idea to throw all of this into a directory outside of app/models, too, since it really doesn't have to do with your application logic (it can go anywhere, since you'll be requiring it directly from your migration task).
In your legacy base class, add an establish_connection call with your legacy database specifics, and you'll be ready to roll:
class LegacyBase < ActiveRecord::Base establish_connection( :adapter => "mysql", :host => "localhost", :username => "username", :password => "password", :database => "app_legacy" ) end
Once you have your base class, just create a new class for each legacy table that you need to migrate, preferably with "Legacy" or something similar prepended to the class name (it will help avoid class name collisions and help keep things organized). You'll need to use the set_table_name method if your class names are different than your table names:
class LegacyUser < LegacyBase set_table_name "users" end
Migrate Self
The most important part of the migration, of course, is the actual code that turns old into new. With Ruby at your disposal, it's easy to contain integrity issues and transform values as you need. Take some time to examine your validations and even pore through a bit of the old data to see what kind of issues you'll run into (and prepare for the worst!). In all likelihood, you'll have to do a lot of string processing and nil checking in order to get things transferred correctly. For example, if you have uniqueness validations, you might need to append random numbers to any values marked as non-unique.
It's best to put the actual migration code for each model in its corresponding legacy model class, if possible:
class LegacyUser < LegacyBase set_table_name "users" def migrate_me! user = User.create( :name => self.name, # Some fields can be directly ported :state => self.state.downcase, # Some fields may need string processing, but watch out for nils! (self.state || '').downcase is safer :permalink => self.permalink || "user#{self.id}" # Your app probably validates_presence_of, but don't assume that the old app did! ) # Compensate for uniqueness issues - it's better to have a generic row than no row at all (your associations think so, at least) if user.errors.on(:permalink) user.permalink = "user#{self.id}" user.save end end end
Also, check out the validates_existence_of plugin. It will ensure you know about broken associations, which you can expect a lot of if your predecessors weren't careful with dependency deletions - or if they saved non-validated rows to the database (it happens way more than you want to believe).
Keep Your IDs (And Your Sanity)
If it's possible to maintain IDs from the old database to the new, you should consider doing so. Some old databases will have funky primary keys spanning multiple columns and such, but hopefully yours uses more Rails-friendly numeric IDs. There are a lot of benefits to maintaining them: you won't have to create a lookup hash of old to new every time you create an associated model, you'll be able to easily cross-check old rows to new rows while you're debugging, and so forth. Certainly your migration will blaze along quicker if you don't have to do several association finds for every row.
If you do persist your IDs, be sure to initialize said IDs after creating each model (but before saving) - Rails will not accept an id parameter in an ActiveRecord new/create hash:
# Bad User.create :id => 42, :name => "Zaphod" # Still bad user = User.new :id => 42, :name => "Zaphod" user.save # Good user = User.new :name => "Zaphod" user.id = 42 user.save
Also, you may want to manually alter the auto-increment of your new tables if you're maintaining IDs. It's not a bad idea to give the increment value some extra headroom, especially if you'll need to do a follow-up migration once your new application is live. That way, you won't have conflicting IDs if data has been added to the new application before you've imported the stragglers from the old database.
It's simple enough to run an alter command in your Rails script:
new_increment_value = LegacyUser.find(:first, :order => 'id DESC').id + 10000 ActiveRecord::Base.connection.execute("ALTER TABLE users AUTO_INCREMENT = #{new_increment_value}")
Avoid Validation Slowness
Weeding out and/or correcting bad legacy data with Rails validations is one of the benefits of using ActiveRecord to migrate, but it won't feel like much of a benefit if it takes five seconds to check each row. Before you hit "go" on your million-row migration, look over your validations and ensure that you have properly set up corresponding indexes in your destination database. Methods like validates_uniqueness_of can be especially time-consuming if MySQL has to filesort a million rows by permalink, email address, etc. for each single migrated row.
Iterate Wisely
Collection processing is usually pretty simple in Rails: select your models with ActiveRecord and then use #each or a for loop to step through. If your legacy tables are small enough, you can most likely get away with this in your migration. But if your rows number in the hundreds of thousands or greater, you might find that keeping all of those model objects in memory makes Ruby rather unhappy.
Fortunately, there are ways to iterate more efficiently, no matter how large the table. For starters, we can dramatically reduce our memory cost by selecting only one row at a time. The only real issue becomes how to select each row in turn.
ActiveRecord's offset parameter provides a simple way to get the nth row of a table, which lends itself easily to a loop:
Model.count.times do |row| Model.find(:first, :offset => row).migrate_me! end
The offset parameter translates to a SQL LIMIT call:
SELECT * FROM models LIMIT 3, 1
Unfortunately, using LIMIT in MySQL slows dramatically as row counts increase. You might not notice it at first - the delay is due to parsing up to the limit point, so LIMIT 100, 1 is much quicker than LIMIT 100000, 1 - but eventually it will happen (probably a short while after you've gone to sleep dreaming of a finished migration the next morning).
To speed things up, we can select by ID instead of offset - all we need is a way to iterate through IDs. The most obvious solution is to select all of the IDs first, which is actually not a bad idea if your data is merely very large (as opposed to tremendous, gargantuan, or ludicrous). An array of a million integers may occupy a hundred megabytes of memory or so, but that's small potatoes on a production server.
We can use ActiveRecord::Base's connection object to get an array of integers without dealing with any modeling:
ActiveRecord::Base.connection.select_values('SELECT id FROM models').each do |id| Model.find(id).migrate_me! end
For performance geeks and folks with truly incomprehensibly large data sets, there is a better solution yet: select your first ID and use simple SQL to get successive results (if your legacy data doesn't have unique integer IDs, you still might be able to use this method on a timestamp column or something similar, but otherwise you'll have to get more creative).
last_row_id = -1 current_record = Model.first # Or whichever model you want to start with end_id = Model.last.id # Or whichever model you want to end with while last_row_id < end_id do current_record = Model.find(:first, :conditions => ['id > ?', last_row_id]) current_record.migrate_me! last_row_id = current_record.id end
Rake Is Your Friend
It's best to wrap your migration script in a rake task, which can be called for any environment. If you haven't done so before, simply add a .rake file to lib/tasks, and add something like this:
namespace :legacy do desc "Migrate legacy data" task :migrate => :environment do # The => :environment does the work of loading your Rails environment so you can use ActiveRecord, etc. require 'db/legacy/migrater' # Require your migration class # You can pass values from the command line rake call (e.g. rake legacy:migrate MODEL=LegacyUser START_ROW=500) # as a hash to your migration script using the ENV variable migrater = Migrater.new(ENV) migrater.go! end end
Thereafter, you can merely call RAILS_ENV=test rake legacy:migrate and so forth from the command line in order to process your migrations. Add some conditions for selecting start/end IDs and limiting by model and you'll be all set!
Some Status, Please
Nothing is worse than running a migration and not having any idea when it will finish - or if it will finish. Do yourself a favor and add some informative output to your script, ideally with error-per-model output as well (so you can track down conversion issues). A nice trick to use is print '.' for each model migrated, with the occasional numeric status update (#500 of 100000, etc.) That way, your screen won't scroll to infinity and beyond.
Hey, That Was Easy!
I hope these tips save you some time the next time you migrate a legacy database. Migrations aren't as fun and flashy as the latest AJAX trick, but it's definitely cathartic to watch those status updates march towards completion once you're done!

