I’m creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.
The task is to remove almost all tags except links ('a' tags and their 'href' attributes). So I divided the problem in two stages. 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function.
For the user defined function and given the restricted environment, I prefer to do this with native libraries. That means, not using the Html Agility Pack, for example.
In javascript this regular expression, apparently does the right job:
<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
At least, according to http://www.pagecolumn.com/tool/regtest.htm
But, I don’t know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output.
For instance, if the input is : <a href="http://example.com">some text</a>
how to save the text "http://example.com" and "some text" as part of the output in C# code and at the same time stripping any other possible html tag (and their content)?
At the end. I made a separate .net console program combining HtmlAgilityPack (HAP) and querying SQL Server from there. In the program I did use a naive regular expression to isolate the fragments, and with HAP I did retrieve the href and anchor texts, and with that I did a final composition stripping out any other characters except text, numbers, and some punctuation.