Here’s the deal; the issue isn’t with getting the CSV into SQL Server, it’s getting it to work how I want it… which I guess is always the issue 🙂
I have a CSV file with columns like: DATE, TIME, BARCODE, etc... I use a derived column transformation to concatenate the DATE and TIME into a DATETIME for my import into SQL Server, and I import all data into the database. The issue is that we only get a new .CSV file every 12 hours, and for example sake we will say the .CSV is updated four times in a minute.
With the logic that we will run the job every 15 minutes, we will get a ton of overlapping data. I imagine I will use a variable, say LastCollectedTime which can be pulled from my SQL database using the MAX(READTIME). My problem comes in that I only want to collect rows with a readtime more recent than that variable.
Destination table structure:
ID, ReadTime, SubID, ...datacolumns..., LastModifiedTime where LastModifiedTime has a default value of GETDATE() on the last insert.
Any ideas? Remember, our readtime is a Derived Column, not sure if it matters or not.
Here is one approach that you can make use of:
Let’s assume that your destination table in SQL Server is named
BarcodeData.Create a staging table (say
BarcodeStaging) in your database that has the same column structure as your destination tableBarcodeDatainto which CSV data is imported into.In the SSIS package, add an
Execute SQL Taskbefore the Data Flow Task to truncate the staging tableBarcodeStaging.Import the CSV data into the staging table
BarcodeStagingand not into the actual destination table.Use the
MERGEstatement (I assume that you are using SQL Server 2008 or higher version), to compare the staging tableBarCodeStagingand the actual destination tableBarcodeDatausing the DateTime column as the join key. If there are unmatched rows, then copy the rows from the staging table and insert them into the destination table.Technet link to
MERGEstatement: http://technet.microsoft.com/en-us/library/bb510625.aspxHope that helps.