I have a sqlite3 database in which I have corrupt data. I qualify “corrupt” with the following characteristics:
Data in name, telephone, latitude, longitude columns is corrupt if: The value is NULL or “” or length < 2
Data in address column is corrupt if The value is NULL or “” or number of words < 2 and length of word is <2
To test this I wrote the following script in Ruby:
require 'sqlite3'
db = SQLite3::Database.new('development.sqlite3')
db.results_as_hash = true;
#Checks for empty strings in name, address, telephone, latitude, longitude
#Also checks length of strings is valid
rows = db.execute(" SELECT * FROM listings WHERE LENGTH('telephone') < 2 OR LENGTH('fax') < 2 OR LENGTH('address') < 2 OR LENGTH('city') < 2 OR LENGTH('province') < 2 OR LENGTH('postal_code') < 2 OR LENGTH('latitude') < 2 OR LENGTH('longitude') < 2
OR name = '' OR address = '' OR telephone = '' OR latitude = '' OR longitude = '' ")
rows.each do |row|
=begin
db.execute("INSERT INTO missing (id, name, telephone, fax, suite, address, city, province, postal_code, latitude, longitude, url) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", row['id'], row['name'], row['telephone'], row['fax'], row['suite'], row['address'], row['city'], row['province'],
row['postal_code'], row['latitude'], row['longitude'], row['url'] )
=end
id_num = row['id']
puts "Id = #{id_num}"
corrupt_name = row['name']
puts "name = #{corrupt_name}"
corrupt_address = row['address']
puts "address = #{corrupt_address}"
corrupt_tel = row['telephone']
puts "tel = #{corrupt_tel}"
corrupt_lat = row['latitude']
puts "lat = #{corrupt_lat}"
corrupt_long = row['longitude']
puts "lat = #{corrupt_long}"
puts '===end===='
end
#After inserting the records into the new table delete them from the old table
=begin
db.execute(" DELETE * FROM listings WHERE LENGTH('telephone') < 2 OR LENGTH('fax') < 2 OR LENGTH('address') < 2 OR
LENGTH('city') < 2 OR LENGTH('province') < 2 OR LENGTH('postal_code') < 2 OR LENGTH('latitude') < 2 OR LENGTH('longitude') < 2
OR name = '' OR address = '' OR telephone = '' OR latitude = '' OR longitude = '' ")
=end
This works but Im new to Ruby and DB programming. So I would welcome any suggestions to make this query better.
The ultimate goal I have is to run a script on my database which tests the validity of data in it and if there are some data that are not valid they are copied to a different table and deleted from the 1st table.
Also, I would like to add to this query a test to check for duplicate entries.
I qualify an entry as duplicate if more than 1 rows share the same name and the same address and the same telephone and the same latitude and the same longitude
I came up with this query but Im not sure if its the most optimal:
SELECT *
FROM listings L1, listings L2
WHERE L1.name = L2.name
AND L1.telephone = L2.telephone
AND L1.address = L2.address
AND L1.latitude = L2.latitude
AND L1.longitude = L2.longitude
Any suggestions, links, help would be greatly appreciated
Your first query doesn’t have any significant performance problem. It will run with a seq scan evaluating your “is corrupt” predicate. The check for
== ''is redundant withlength(foo) < 2as length(”) is < 2. You have a bug where you quoted the field names in your length() calls, so you’ll be evaluating the length of the literal field name instead of the value of the field. You have also failed to test for NULL which is a value distinct from ”. You can use thecoalescefunction to convert NULL to ” and capture NULLS with the length check. You also don’t seem to have addressed the special word based rule for address. This later is trouble unless you extend sqlite with a regexp function. I suggest approximating it with LIKE or GLOB.Try this alternative:
You find duplicates query doesn’t work, since there’s always at least one record to match when self joining on equality. You need to exclude the record under test on one side of the join. Typically this can be done by excluding on primary key. You haven’t mentioned if the table has a primary key, but IIRC sqllite can give you a proxy for one with ROWID. Something like this:
BTW, while you stressed efficiency in your question, it’s important to make your code correct before you worry about efficiency.