I’m working on a Rails app (Ruby 1.9.2 / Rails 3.0.3) that keeps track of people and their memberships to different teams over time. I’m having trouble coming up with a scalable way to combine duplicate Person objects. By ‘combine’ I mean to delete all but one of the duplicate Person objects and update all references to point to the remaining copy of that Person. Here’s some code:
Models:
Person.rb
class Person < ActiveRecord::Base
has_many :rostered_people, :dependent => :destroy
has_many :rosters, :through => :rostered_people
has_many :crews, :through => :rosters
def crew(year = Time.now.year)
all_rosters = RosteredPerson.find_all_by_person_id(id).collect {|t| t.roster_id}
r = Roster.find_by_id_and_year(all_rosters, year)
r and r.crew
end
end
Crew.rb
class Crew < ActiveRecord::Base
has_many :rosters
has_many :people, :through => :rosters
end
Roster.rb
class Roster < ActiveRecord::Base
has_many :rostered_people, :dependent => :destroy
has_many :people, :through => :rostered_people
belongs_to :crew
end
RosteredPerson.rb
class RosteredPerson < ActiveRecord::Base
belongs_to :roster
belongs_to :person
end
Person objects can be created with just a first and last name, but they have one truly unique field called iqcs_num (think of it like a social security number) which can be optionally stored on either the create or update actions.
So within the create and update actions, I would like to implement a check for duplicate Person objects, delete the duplicates, then update all of the crew and roster references to point to the remaining Person.
Would it be safe to use .update_all on each model? That seems kind of brute force, especially since I will probably add more models in the future that depend on Person and I don’t want to have to remember to maintain the find_duplicate function.
Thanks for the help!
The ‘scalable’ way to deal with this is to make the de-duplication process part of the app’s normal function – whenever you save a record, make sure it’s not a duplicate. You can do this by adding a callback to the Person model. Perhaps something like this:
You’ll want to make sure that you index the table you store Person objects in on the iqcs_num column, so that this lookup stays efficient as the number of records grows – it’s going to be performed every time you update a Person record, after all.
I don’t know that you can get out of keeping the callback up-to-date – it’s entirely likely that different sorts of associated objects will have to be moved differently. On the other hand, it only exists in one place, and it’s the same place you’d be adding the associations anyway – in the model.
Finally, to make sure your code is working, you’ll probably want to add a validation on the Person model that prevents duplicates from existing. Something like: