I have a big fat query that’s written dynamically to integrate some data. Basically what it does is query some tables, join some other ones, treat some data, and then insert it into a final table.
The problem is that there’s too much data, and we can’t really trust the sources, because there could be some errored or inconsistent data.
For example, I’ve spent almost an hour looking for an error while developing using a customer’s database because somewhere in the middle of my big fat query there was an error converting some varchar to datetime. It turned out to be that they had some sales dating ‘2009-02-29’, an out-of-range date.
And yes, I know. Why was that stored as varchar? Well, the source database has 3 columns for dates, ‘Month’, ‘Day’ and ‘Year’. I have no idea why it’s like that, but still, it is.
But how the hell would I treat that, if the source is not trustable?
I can’t HANDLE exceptions, I really need that it comes up to another level with the original message, but I wanted to provide some more info, so that the user could at least try to solve it before calling us.
So I thought about displaying to the user the row number, or some ID that would at least give him some idea of what record he’d have to correct. That’s also a hard job because there will be times when the integration will run up to 80000 records.
And in an 80000 records integration, a single dummy error message: ‘The conversion of a varchar data type to a datetime data type resulted in an out-of-range datetime value’ means nothing at all.
So any idea would be appreciated.
Oh I’m using SQL Server 2005 with Service Pack 3.
EDIT:
Ok, so for what I’ve read as answers, best thing to do is check each column that could be critical to raising errors, and if they do attend the condition, I should myself raise an error, with the message I find more descriptive, and add some info that could have been stored in a separate table or some variables, for example the ID of the row, or some other root information.
This sounds like a standard ETL issue: Extract, Transform, and Load. (Unless you have to run this query over and over again against the same set of data, in which case you’d pretty much do the same thing, only over and over again. So how critical is performance?)
What kind of error handling and/or “reporting of bad data” are you allowed to provide? If you have everything as “one big fat query”, your options become very limited — either the query works or it doesn’t, and if it doesn’t I’m guessing you get at best one RAISERROR message to tell the caller what’s what.
In a situation like this, the general framework I’d try to set up is:
Done this way, you should always be able to return (or store) a valid data set… even if it is empty. The trick will be in determining when the routine fails — when is the data too corrupt to process and produce the desired results, so you return a properly worded error message instead?