I am writing a Python script that uses data of dubious quality. The data is being stored in an SQLite database.
I would like a compact way to specify constraints on the data. The constraints are of two types:
- Data errors – an error message will be issued.
- “Column A must be an integer in the range 0-10”
- “Column B must be a non-blank string”, and so forth.
- Data quality warnings – “are you sure this is right?” A warning message will be issued. The constraints would be things like
- “warn if Column C has a default value of 0” – are you sure the typist didn’t miss an entry?
- “warn if the number in Column D is unusually large (> 1000)”.
Ideally, I would like to express my constraints in a human-readable format like:
'kV' MUST BE float IN RANGE 0-10
'Rating' SHOULD NOT BE DEFAULT 1.0
'Description' SHOULD NOT BE DEFAULT ""
… but I’ll take any improvement on my current approach (below). I would be happy to accept a solution that involves enforcing the constraints in either Python or a SQLite schema.
Here’s what I’m using at the moment:
def is_number_in_range(number, expected_type, lower, upper):
if type(number) != expected_type:
return "not an %s" % expected_type
elif ((number < lower) or (number > upper)):
return "%s out of range [%i-%i]." % (expected_type, upper, lower)
else:
return "OK"
def not_default (value, expected_type, default_value):
if type(value) != expected_type:
return "not an %s" % expected_type
elif value == default_value:
return "default value of %s - make sure this is what you want." % default_value
else:
return "OK"
def Check_Cable_Lib(db_conn):
res = db_conn.execute("SELECT * FROM Lib_Cable LIMIT 1")
constraints = (
('kV', lambda x: is_number_in_range(x, float, 0, 1000) ),
('kA1', lambda x: is_number_in_range (x, float, 0, 10) ),
('kA1', lambda x: not_default(x, float, 1.0))
)
for cable_type in res:
for constraint in constraints:
constraint_variable = constraint[0]
constraint_data = cable_type[constraint_variable]
constraint_function = constraint[1]
validation_message = constraint_function(constraint_data)
print ("%(constraint_variable)s = %(constraint_data)s : %(validation_message)s" % locals())
stage1_db_path = "stage1.sqlite3";
db_conn = sqlite3.connect(stage1_db_path)
db_conn.row_factory = sqlite3.Row
Check_Cable_Lib(db_conn)
Example output:
kV = 11.0 : OK
kA1 = 1.0 : OK
kA1 = 1.0 : default value of 1.0 - make sure this is what you want.
EDIT: I’m aware it’s impolite to explicitly check types in Python. However for the sake of the code that uses the data, I need to check that SQLite hasn’t stored unexpected things in the columns (“hello world” in an INT column, etc.) Remember the data is of dubious quality and SQLite will happily put any type of data in any column. Catching these types of data entry errors is one of the objectives of this code.
Combining @onedaywhen’s idea to use SQL to check the constraints, and @ABS’s idea to define the constraints in a more readable way, here’s what I’ve come up with.
Wrapping it up in a class probably isn’t particularly useful (as used in the example it’s a glorified wrapper around the
check()function), but it means I can bake some slightly nicer output formatting into it later.