I have a database containing three tables:
- practices – 8 fields
- patients – 47 fields
- exacerbations – 11 fields
The majority of the fields in these tables are recorded in varchar format, other fields include integers, doubles and dates.
I have to transform this data into numerically classified data so that it can be used by a statistician to extrapolate any patterns in the data. To acheive this I will have to convert varchar fields into integers that represent the classification that string belongs to, an example being ‘Severity’ which has the following possible string values:
- Mild
- Moderate
- Severe
- Very Severe
This field in the patients table has a finite list of string values that can appear, other fields have an endless possibility of string values that cannot be classified until they are encountered by my database (unless I implement some form of intelligent approach).
For the time being I am just trying to construct the best approach to converting each field for all entries in each of the 3 tables to numeric values. The pseudo code I have in my head so far is as follows (it’s not complete):
function profileDatabase
for each table in database
for each field that is of type varchar
select all distinct values and insert into classfication table for that field
end for
end for
function classifyDatabase
for each table in database
for each field that is of type varchar
// do something efficient to build an insert string to place into new table
end for
end for
Can someone suggest the best way of performing this process so that it is efficient giving that there are currently in excess of 100 practices, 15,000 patients and 55,000 exacerbations in the system. I have no need to implement this in PHP, build I would prefer to do so. Any pointers as to how to structure this would be great as I am not sure my approach the best approach.
This process will have to run every month for the next two years as the database grows to have a total of 100,000 patients.
I have managed to build my own solution to this problem which runs in reasonable time. For anyone interested, or anyone who may encounter a similar issue here is the approach I have used:
A PHP script that is run as a cron job by calling php scriptName.php [database-name]. The script builds a classified table for each table name that is within the database (that is not a lookup table for this process). The setting up of each classification creates a new table which mimics the format of the base table but sets all fields to allow NULL values. It then creates blank rows for each of the rows found in the base table. The process then proceeds by analysing each table field by field and updating each row with the correct class for this field.
I am sure I can optimise this function to improve on the current complexity, but for now I shall use this approach until the run-time of the scripts goes outside of an acceptable range.
Script code: