Birds

Migrating an existing database into a new application can be a daunting task, especially if it's coming from another development team or platform. The database may contain multitudes of data, ensuring a lengthy migration - one that can't be repeated over and over if deadlines loom. It may have poor data integrity, or none at all. Your new, shiny Rails app may have a dozen validations on each table, but those same validations might exclude half the contents of the legacy database. And let's not even get into broken associations.

Fortunately, ActiveRecord makes it incredibly easy to map a second set of database models, and migrating data from one ActiveRecord source to another is faster than you might think - as long as you follow some best practices, of course. Even if you have an extremely large data set, don't assume that raw SQL is the only way to go. As with Rails usage in general, the saved development time (not to mention accurate validations and data conversions) will more than make up for any speed loss.

Migration Schedule

Throughout your migration design, keep in mind that you'll more than likely be running multiple migrations. You will need to test individual model migrations many times (especially if you have data integrity issues), and the final migration may span at least two executions if you're attempting a seamless transition from old to new. If people are still adding new data while you're running the final migration, it probably won't be "final" after all - you'll need to capture those last additions once you've switched over, and that may be a lot of data if your migrations take a long time to process.

Map Legacy Tables With ActiveRecord

In order to set up your legacy tables in ActiveRecord, you'll want to create a legacy base class in order to DRY up your database connection and any other common migration logic. It's a good idea to throw all of this into a directory outside of app/models, too, since it really doesn't have to do with your application logic (it can go anywhere, since you'll be requiring it directly from your migration task).

In your legacy base class, add an establish_connection call with your legacy database specifics, and you'll be ready to roll:

class LegacyBase < ActiveRecord::Base
  establish_connection(
    :adapter => "mysql",
      :host => "localhost",
      :username => "username",
      :password => "password",
      :database => "app_legacy"
    )
end

Once you have your base class, just create a new class for each legacy table that you need to migrate, preferably with "Legacy" or something similar prepended to the class name (it will help avoid class name collisions and help keep things organized). You'll need to use the set_table_name method if your class names are different than your table names:

class LegacyUser < LegacyBase
  set_table_name "users"
end

Migrate Self

The most important part of the migration, of course, is the actual code that turns old into new. With Ruby at your disposal, it's easy to contain integrity issues and transform values as you need. Take some time to examine your validations and even pore through a bit of the old data to see what kind of issues you'll run into (and prepare for the worst!). In all likelihood, you'll have to do a lot of string processing and nil checking in order to get things transferred correctly. For example, if you have uniqueness validations, you might need to append random numbers to any values marked as non-unique.

It's best to put the actual migration code for each model in its corresponding legacy model class, if possible:

class LegacyUser < LegacyBase
  set_table_name "users"

  def migrate_me!
    user = User.create(
      :name => self.name,   # Some fields can be directly ported
      :state => self.state.downcase,   # Some fields may need string processing, but watch out for nils! (self.state || '').downcase is safer
      :permalink => self.permalink || "user#{self.id}"   # Your app probably validates_presence_of, but don't assume that the old app did!
    )
    # Compensate for uniqueness issues - it's better to have a generic row than no row at all (your associations think so, at least)
    if user.errors.on(:permalink)
      user.permalink = "user#{self.id}"
      user.save
    end
  end
end

Also, check out the validates_existence_of plugin. It will ensure you know about broken associations, which you can expect a lot of if your predecessors weren't careful with dependency deletions - or if they saved non-validated rows to the database (it happens way more than you want to believe).

Keep Your IDs (And Your Sanity)

If it's possible to maintain IDs from the old database to the new, you should consider doing so. Some old databases will have funky primary keys spanning multiple columns and such, but hopefully yours uses more Rails-friendly numeric IDs. There are a lot of benefits to maintaining them: you won't have to create a lookup hash of old to new every time you create an associated model, you'll be able to easily cross-check old rows to new rows while you're debugging, and so forth. Certainly your migration will blaze along quicker if you don't have to do several association finds for every row.

If you do persist your IDs, be sure to initialize said IDs after creating each model (but before saving) - Rails will not accept an id parameter in an ActiveRecord new/create hash:

# Bad
User.create :id => 42, :name => "Zaphod"

# Still bad
user = User.new :id => 42, :name => "Zaphod"
user.save

# Good
user = User.new :name => "Zaphod"
user.id = 42
user.save

Also, you may want to manually alter the auto-increment of your new tables if you're maintaining IDs. It's not a bad idea to give the increment value some extra headroom, especially if you'll need to do a follow-up migration once your new application is live. That way, you won't have conflicting IDs if data has been added to the new application before you've imported the stragglers from the old database.

It's simple enough to run an alter command in your Rails script:

new_increment_value = LegacyUser.find(:first, :order => 'id DESC').id + 10000
ActiveRecord::Base.connection.execute("ALTER TABLE users AUTO_INCREMENT = #{new_increment_value}")

Avoid Validation Slowness

Weeding out and/or correcting bad legacy data with Rails validations is one of the benefits of using ActiveRecord to migrate, but it won't feel like much of a benefit if it takes five seconds to check each row. Before you hit "go" on your million-row migration, look over your validations and ensure that you have properly set up corresponding indexes in your destination database. Methods like validates_uniqueness_of can be especially time-consuming if MySQL has to filesort a million rows by permalink, email address, etc. for each single migrated row.

Iterate Wisely

Collection processing is usually pretty simple in Rails: select your models with ActiveRecord and then use #each or a for loop to step through. If your legacy tables are small enough, you can most likely get away with this in your migration. But if your rows number in the hundreds of thousands or greater, you might find that keeping all of those model objects in memory makes Ruby rather unhappy.

Fortunately, there are ways to iterate more efficiently, no matter how large the table. For starters, we can dramatically reduce our memory cost by selecting only one row at a time. The only real issue becomes how to select each row in turn.

ActiveRecord's offset parameter provides a simple way to get the nth row of a table, which lends itself easily to a loop:

Model.count.times do |row|
  Model.find(:first, :offset => row).migrate_me!
end

The offset parameter translates to a SQL LIMIT call:

SELECT * FROM models LIMIT 3, 1

Unfortunately, using LIMIT in MySQL slows dramatically as row counts increase. You might not notice it at first - the delay is due to parsing up to the limit point, so LIMIT 100, 1 is much quicker than LIMIT 100000, 1 - but eventually it will happen (probably a short while after you've gone to sleep dreaming of a finished migration the next morning).

To speed things up, we can select by ID instead of offset - all we need is a way to iterate through IDs. The most obvious solution is to select all of the IDs first, which is actually not a bad idea if your data is merely very large (as opposed to tremendous, gargantuan, or ludicrous). An array of a million integers may occupy a hundred megabytes of memory or so, but that's small potatoes on a production server.

We can use ActiveRecord::Base's connection object to get an array of integers without dealing with any modeling:

ActiveRecord::Base.connection.select_values('SELECT id FROM models').each do |id|
  Model.find(id).migrate_me!
end

For performance geeks and folks with truly incomprehensibly large data sets, there is a better solution yet: select your first ID and use simple SQL to get successive results (if your legacy data doesn't have unique integer IDs, you still might be able to use this method on a timestamp column or something similar, but otherwise you'll have to get more creative).

last_row_id = -1
current_record = Model.first   # Or whichever model you want to start with
end_id = Model.last.id   # Or whichever model you want to end with

while last_row_id < end_id do
  current_record = Model.find(:first, :conditions => ['id > ?', last_row_id])
  current_record.migrate_me!
  last_row_id = current_record.id
end

Rake Is Your Friend

It's best to wrap your migration script in a rake task, which can be called for any environment. If you haven't done so before, simply add a .rake file to lib/tasks, and add something like this:

namespace :legacy do
  desc "Migrate legacy data"
  task :migrate => :environment do   # The => :environment does the work of loading your Rails environment so you can use ActiveRecord, etc.
    require 'db/legacy/migrater'   # Require your migration class

    # You can pass values from the command line rake call (e.g. rake legacy:migrate MODEL=LegacyUser START_ROW=500) 
    # as a hash to your migration script using the ENV variable
    migrater = Migrater.new(ENV)
    migrater.go!
  end
end

Thereafter, you can merely call RAILS_ENV=test rake legacy:migrate and so forth from the command line in order to process your migrations. Add some conditions for selecting start/end IDs and limiting by model and you'll be all set!

Some Status, Please

Nothing is worse than running a migration and not having any idea when it will finish - or if it will finish. Do yourself a favor and add some informative output to your script, ideally with error-per-model output as well (so you can track down conversion issues). A nice trick to use is print '.' for each model migrated, with the occasional numeric status update (#500 of 100000, etc.) That way, your screen won't scroll to infinity and beyond.

Hey, That Was Easy!

I hope these tips save you some time the next time you migrate a legacy database. Migrations aren't as fun and flashy as the latest AJAX trick, but it's definitely cathartic to watch those status updates march towards completion once you're done!



Honeycomb

This is the third article in a series about best practices for creating Rails forms. Be sure to check out Pretty Data, Pretty Code and Modeling All Form Data for more related techniques.

Revisiting Our Form

Continuing with the example from my last article, I have this simple form:

<% form_for :product, :url => products_path do |f| %>
  <div class="form_item text_field">
    <label for="product[name]">Name:</label>
    <%= f.text_field :name %>
  </div>
  <div class="form_item text_field">
    <label for="product[price]">Price:</label>
    <%= f.text_field :price %>
  </div>
  <div class="form_item text_field">
    <label for="product[features]">Features (1 per line):</label>
    <%= f.text_area :features %>
  </div>
  <div class="form_item submit_button">
    <%= f.submit "Create Product" %>
  </div>
<% end %>

Note: I changed the original form's wrapper tags from <p> to <div>. The reason is that, if you're using the built-in Rails error message helpers, by default it will wrap erroneous fields in <div> tags. Since <div> tags can't nest within <p> tags, this can cause serious layout-breaking problems.

There's a lot of duplication here in terms of the wrapper code. Each field is wrapped in a <div> with a class of form_item as well as a descriptor of what kind of field it contains. I use this information for CSS styling of specific kinds of inputs. The labels are also fairly repetitive as well, and these repetitions would become more obnoxious were this a longer form.

Removing Repetition

What if we could generate this same form code while removing the repetition? It's possible, and Rails gives us a great hook for just this scenario by allowing us to build custom Form Builders. A custom Form Builder is simply a subclass of ActionView::Helpers::FormBuilder that can alter and extend the abilities of the regular Form Builder. In our current form, the f block variable is an instance of FormBuilder, so methods like text_field and submit are all instance methods of the FormBuilder class. Let's override these methods to not only output the field markup, but to also output the wrapper div:

# in /helpers/application_helper.rb
class WrapperFormBuilder < ActionView::Helpers::FormBuilder
  METHODS_TO_OVERRIDE = %w{text_field text_area password_field file_field date_select datetime_select submit}

  METHODS_TO_OVERRIDE.each do |method_name|
    src =<<END_SRC
      def #{method_name}_with_wrapper(field, options={})
        # allow explicit setting of label text with options[:label]
        field_label = if '#{method_name}' == 'submit'
          '' # no label for submit inputs
        elsif options[:label]
          label(field, options.delete(:label))
        else
          label(field) + ":" # Adds colon as default
        end

        # get unwrapped field
        field_markup = #{method_name}_without_wrapper(field, options)

        # return wrapped field (@template gives us access to helper methods in this class)
        @template.content_tag(:div, field_label + field_markup, :class => "form_item #{method_name}")
      end
    END_SRC
    class_eval src, __FILE__, __LINE__
    alias_method_chain method_name.to_sym, :wrapper
  end

end

For our form, we can now implement our new class like this:

<% form_for :product, :url => products_path, :builder => WrapperFormBuilder do |f| %>

Another possibility is to create a new method to replace form_for:

# in helpers/application_helper.rb
def wrapper_form_for(name, object=nil, options={}, &proc)
  form_for(name, object, options.merge(:builder => WrapperFormBuilder), &proc)
end

Using this new helper method, our form now looks like this:

<% wrapper_form_for :product, :url => products_path do |f| %>
  <%= f.text_field :name %>
  <%= f.text_field :price %>
  <%= f.text_area :features, :label => "Features (1 per line):" %>
  <%= f.submit "Create Product" %>
<% end %>

If you need to create a form field without a wrapper (perhaps to use a non-conforming wrapper), you can still access the original methods like f.text_field_without_wrapper or f.submit_without_wrapper.

I have found custom Form Builders to be powerful tools for speeding up form development, DRYing up code and keeping consistency between forms and developers. This is a fairly basic example, but the sky is the limit for what you can implement using these techniques.



Rose

In my last article I wrote about using data modeling to clean up form-related code and to take advantage of powerful helpers like form_for and error_messages_for. This solves the significant problem of isolating business logic into a model class, but another problem remains — how can we make our form data pretty without trashing our model's code with view logic?

Beautifying the Data

I have a simple Product model with rows for name, price and features. Setting and displaying the price field is tricky because I need to remove currency formatting before storing it in the database as a decimal, and I want to reformat it when displaying the current price in an 'Edit' form. The features field is also tricky. I want the user to enter each product feature ("Slices and Dices", "Purees Anything", etc.) on a separate line of the textarea and for that to be split into a serialized array that I will store in the database.

My first instinct is to create virtual attributes in my model to handle the logic of deformatting/reformatting this data. Here's how it looks:

class Product < ActiveRecord::Base
  serialize :features, Array
  
  validates_presence_of :name
  validates_numericality_of :price
  
  # need this for formatting
  def helpers
    ActionController::Base.helpers
  end
  
  def price_field=(p)
    # expecting p to be something like "$4,000.00", set price to 4000.00
    p.gsub!(/[^0-9.]/, '')
    self.price = p.to_f
  end
  
  def price_field
    helpers.number_to_currency(self.price)
  end
  
  def features_field=(str)
    # expecting string with features separated by newlines
    self.features = str.split("\n").collect {|f| f.strip }
  end
  
  def features_field
    self.features.join("\n")
  end
end
<%# the product form %>
<% form_for :product, url => products_path do |f| %>
  <p class="form_item text_field">
    <label for="product[name]">Name:</label>
    <%= f.text_field :name %>
  </p>
  <p class="form_item text_field">
    <label for="product[price_field]">Price:</label>
    <%= f.text_field :price_field %>
  </p>
  <p class="form_item text_field">
    <label for="product[features_field]">Features (1 per line):</label>
    <%= f.text_area :features_field %>
  </p>
  <p class="form_item submit_button">
    <%= f.submit "Create Product" %>
  </p>
<% end %>

The good news is that we have a very simple, straightforward looking view with no logic stuffed in it. Our controller is also completely vanilla, so I didn't bother to even show it.

Our model, however, is getting cluttered. What once was a haven for business logic is now filled with both business and view logic. I'm also a little concerned about having to use the #price_field and #features_field methods, since it adds complexity to what should be a simple object API. Will this cause confusion with my fellow programmers?

Beautifying the Code

What if we extracted the view logic into its own object? By using a presenter object as an adapter between our Product model and our form we can isolate the view logic from the business logic. Here's how it looks:

class Product < ActiveRecord::Base
  serialize :features, Array
  
  validates_presence_of :name
  validates_numericality_of :price
end
class ProductPresenter
  attr_reader :product
  
  def initialize(product)
    @product = product
  end
  
  # need this for formatting
  def helpers
    ActionController::Base.helpers
  end
  
  # for mass assignment
  def attributes=(hash)
    hash.each_pair do |key, val|
      self.send("#{key}=".to_sym, val)
    end
  end
  
  def price=(p)
    # expecting p to be something like "$4,000.00", set price to 4000.00
    p.gsub!(/[^0-9.]/, '')
    @product.price = p.to_f
  end
  
  def price
    helpers.number_to_currency(@product.price)
  end
  
  def features=(str)
    # expecting string with features separated by newlines
    @product.features = str.split("\n").collect {|f| f.strip }
  end
  
  def features
    @product.features.join("\n")
  end
  
  # proxy all other methods to @product
  def method_missing(method_name, *args, &block)
    @product.send(method_name, *args, &block)
  end
end
class ProductsController < ApplicationController
  def new
    @product = ProductPresenter.new(Product.new)
  end
  
  def edit
    product = Product.find(params[:id])
    @product = ProductPresenter.new(product)
  end
  
  def create
    @product = ProductPresenter.new(Product.new)
    @product.attributes = params[:product]
    if @product.save
      # success
    else
      render :action => 'new'
    end
  end
  
  def update
    p = Product.find(params[:id])
    @product = ProductPresenter.new(p)
    @product.attributes = params[:product]
    if @product.save
      # success
    else
      render :action => 'new'
    end
  end
end
<%# the product form %>
<% form_for :product, url => products_path do |f| %>
  <p class="form_item text_field">
    <label for="product[name]">Name:</label>
    <%= f.text_field :name %>
  </p>
  <p class="form_item text_field">
    <label for="product[price]">Price:</label>
    <%= f.text_field :price %>
  </p>
  <p class="form_item text_field">
    <label for="product[features]">Features (1 per line):</label>
    <%= f.text_area :features %>
  </p>
  <p class="form_item submit_button">
    <%= f.submit "Create Product" %>
  </p>
<% end %>

Our business and view logic have been effectively separated, our form code is clearer and the controller is only slightly more complicated. Because the ProductPresenter is acting as a proxy object to its Product object, we can simply treat it like a product thanks to 'duck typing'. While this example is a little contrived, using presenter classes can make or break your code base as you create complex model objects with equally complex visual representations.




RSS Feed


CATEGORIES


ARCHIVES


BOOKMARKED


Add to Technorati Favorites