I have a table with a column called ExcelLinks that contain records like this:
=INDEX(‘\\san1\engData[BT_500.0_Structural_Position.xls]Concrete’!$B$4:$IK$83,MATCH($K$9,’\\san1\engData[BT_500.0_Structural_Position.xls]Concrete’!$A$4:$A$83,0),MATCH(C212,’\\san1\engData[BT_500.0_Structural_Position.xls]Concrete’!$B$3:$IK$3,0))/1000000
=INDEX(‘\\san1\engData[GK_600.0_Pumps.xls]Pumps’!$B$4:$BD$39,MATCH($K$9,’\\san1\engData[TT_640.0_Generator.xls]Generator’!$A$4:$A$39,0),MATCH(C214,’\\san1\engData[GK_600.0_Pumps.xls]Pumps’!$B$3:$BD$3,0))/1000000
=INDEX(‘\\san1\engData[TT_640.0_Generator.xls]Generator’!$B$4:$HU$83,MATCH($K$9,’\\san1\engData[GK_600.0_Pumps.xls]Pumps’!$A$4:$A$83,0),MATCH(C218,’\\san1\engData[TT_640.0_Generator.xls]Generator’!$B$3:$HU$3,0))/1000000
The ideal output would be:
_______________________________________
| Row | LinkCount | UniqueLinkCount |
| 1 | 3 | 1 |
| 2 | 3 | 2 |
| 3 | 3 | 2 |
I want to query this data and see the number of files and unique files used per record.
I did a search online and couldn’t find anything that does this.
I’m thinking I’ll make a cursor and for each record I’ll detect chars starting with \\ and ending with '!$ and count the number of files.
The hard bit is the ExcelLinks with the =INDEX and MATCH functions that use multiple interlinks (that could be different files).
There’s over 12 million records in this table so I am concerned about the performance using a cursor.
There are some better ways to do this with Oracle using RegEx’s. I know that SQL Server doesn’t have RegEx and am willing to write/use a CLR stored proc if that’s the easiest option.
First, grab this string splitting CLR function from Adam Machanic. Compile the code into a DLL (using csc if you don’t have Visual Studio), copy the DLL to your server, and then register the DLL as follows (you’ll have to replace some variable parts here, such as the file path, what you want to call the assembly, etc.):
With that in place, the query itself is quite easy. Let’s create a simple table variable holding a few rows (I shortened the paths for brevity):
Results:
This relies on
\\not appearing naturally in the data other than as the beginning of a file path, and that all file paths reside on a network share.This is probably not the most efficient you can get – I’m sure some RegEx wizard can improve this using that approach instead of splitting (here is a good article to get you started), but that’s not my forte. A large portion of the cost is going to be the I/O required to scan the entire table, rather than the counting or the replacing.
If you can’t use CLR, you can substitute that function for any number of non-CLR versions (here is an example that would be a functionally suitable replacement), but keep in mind other approaches will likely suffer from less optimal performance.