I am looking for some probable choices to efficiently populate a relational SQL Server database from xml files. So basically I visualize a three step process to accomplish it;
- Read XML from a public url
- populate sql db which is similar to xml schema
- populate the target relational sql db.
I am not sure if mapping the xml directly to the target db is achieveable easily ie. skipping step 2, but my inclination is that it would make the process a little bit complicated.
The xml reading part from a public url would be something like http://www.abc.com/xmlfeed.xml which would require a nightly routine to make this file available to be processed. Some thing like windows task schedular..or any better way?
I have only two days to make this work, so I would prefer anything that is quick to implement with less coding effort. However I do need this method to be maintainable in the future, as I would be receiving the new xml data every day with the same old schema. In case the schema changes a little bit, I would like the process of tweaking the routine to be hassle free.
I thought that migration of legacy data to SQL Server would be a few minutes task due to the frequency of such requirement, but to my surprise there are very little discussion/comparisions on the internet for different xml migration techniques.I am really confused to decide on the route that I should take, a pure SQL Server solution like SSIS or something like xml parsers.
As I read your post through, my very first idea was SSIS, and at the end you wrote it yourself. Especially if you are familiar with it, I recommend it. You can implement such a solution in two days.
After you implemented the ETL process you can create an SQL Server Agent job which will schedule your SSIS package to run at the time you want it to run. It supports running packages from SQL Server or File System.
EDIT
According to your example. It’s fully possibe to implement such a solution in SSIS. I give some screenshot about a sample project which process your XML sturcture.
First image shows that the SSIS package consists of 3 control flow steps. Each of them is a Data Flow Task. First it process the manufacturers then the models then cars.
I implemented only the manufacturers part. This is shown in image #2 and #3. (They overlap a little bit.) In #2 I read the XML content (XML Source task), aggregate it (Aggregation task) by manufacturer. Then I sort them also by manufacturer name (Sort task). On the other side I read the manufacturers existing in SQL database (through OLE DB Source task), then this will be also sorted.
After that these two sources are merged (Merge join task) by a join operation (similar as in SQL). In this case this is a FULL OUTER JOIN so you can figure out which manufacturer is new and which one should be deleted. I split the records into two parts according the previous two conditions (new, deleted).
Finally I add the new manufacturers through an OLE DB Destination task, and delete the missing manufacturers with the help of an OLE DB Command task. In the latter case I assume there’s a stored procedure (called DeleteManufacturer(@ManufacturerName)) in SQL which will delete the manufacturer and all attached models and cars. (Cascade Delete)
The other two data flow tasks should be implemented in the same way. If you should uptade the matching records, the Conditional Split task must have three conditions and a new tree bunch attached to this third condition. Here a new OLE DB Command can be used with an UPDATE statement.
As I wrote previously if you are ready with the package, an SQL Server Agent job should be created, which will run your package at night (or at the time you wish).