As part of a data import process, we need to “massage” text to have it conform to certain standards. The process involves downloading an XML file from a remote server, inserting the data into “working” tables for processing, then moving the data from the “working” tables to the live tables.
Case in point, there are instances of the slanted quote character (’) that we want to replace with the straight quote character (‘).
We also want to have this rather easy to extend. If we need to add new replacements/deletions, it should not require a rebuild of the import process project.
There are two schools of thought on our team:
-
Perform the massaging in code. Have an XML file in the project that has the various characters we want to replace/remove. Whenever we need to add new replacements/deletions we can update the file.
-
Perform the massaging in SQL. When we are transferring the data from the “working” tables to the “live” tables, run each field through a SQL function that performs the replacements/deletions which we can edit at any time.
Is one method “better” than the other? Will the SQL method be faster? Are there certain things we can do easier, or at all, in code that would be difficult/impossible in SQL?
Thanks in advance.
If there is a lot of data I would consider using SQL as this approach can be optimised to scale more efficiently once you understand the input data and the more common replacements or cleaning functions. If you perform the massaging in code then you will almost certainly need to take an iterative approach to the replacements where the time taken to run will increase in line with the data volumes increasing.
If the amount of data to process is small enough that performance is not an issue then doing the cleaning in code will probably give you greater flexibility.