A dataset I receive for routine refresh purposes contains a date field that’s actually VARCHAR.
As this will be an indexed/searched field, I’m left with…
1) Converting the field to DATETIME and validating and normalizing the data values when refreshing
or…
2) Leaving the data as-is and forming my queries to accommodate various valid date formats, i.e.,
WHERE DateField = ‘CCYYMMDD’ OR DateField = ‘MM/DD/CCYY’ OR ….
The refresh would be on a monthly basis; “cleaning” the data would add about 35% time to the ETL cycle. My queries on the date field would all be equalities; I do not need to range search.
Also, I’m a one man shop, so the more hands-off the overall solution the better.
So which scenario am I better off doing? All opinions appreciated.
I think this is a great question. Here’s my opinion:
I’m a big believer in the idea that in the long run you’ll save more time and have fewer headaches by using data types for the purpose for which they were intended. That means dates in date fields, characters in character fields, etc. If you go with option 2 you’ll need to remember to code for all the various possible date formats every time you query the table. If you set this down and come back a year from now, are you going to remember?
By contrast, if you use a date field and do the upfront work in the ETL process of dealing with the dates properly, you will always know just how to interact with the field. And I’m not even going into performance implications.
And in this case, I’m not sure you’ll even see a short-term benefit. If there are, for example 5 different possible date formats in the source data, you’ll need to account for those one way or another; either in the ETL or in the output queries. The code to transform those 5 formats in ETL is not materially more complicated than the code to manage those 5 formats in the output queries.
And if the data could literally arrive in an infinite number of formats, you have big problems either way. Either your ETL will break or your queries will break. It is, to a certain extent, an irreducible complexity.
I would suggest that you take the time to code the proper transforms into your ETL. But do yourself a favor and code a preprocessing step that identifies dates in formats that won’t properly transform and alerts you to them. If you see patterns; i.e., if any format shows up more than once, code a transform for it. Over time you’ll be left manually cleaning fewer and fewer of those nasty dates. With luck, your 35% will drop to 5% or less.
Good luck!