I need to test if a string (filenames with their complete path) contains another one in MSSQL.
My script needs to check if the file we are about to commit is present in the database under a specific column (pre-hook script).
I cannot really change the data definition of the column, but we are currently using text TEXT and files are seperated by a new line character. I tried to use TSQL function CONTAINS, but the overall performance is not really good.
Would it be a better idea to load all the data in a PHP array and do the comparaison locally ?
I’m not quite sure what is best way to do here.
Update: There is about 194 530 rows in the database.
The main thing to keep in mind when doing a search through a string is that you want to limit the length of the string you are searching through. Right now, you have multiple path+filename values tucked into a single row-column pair – as I’ve mentioned above, this is poorly normalized (and is part of the reason you’re having trouble doing lookups).
Given that you can’t really change the schema of the table you’re having trouble with, a better alternative might be creating a structure to work with the metadata that describes the files stored within a certain row.
For example, one option might be to create a table that contains
filename–rowIDpairs, where each row of the original table is linked to the parsed-out filenames within theTEXTcolumn of that row. That gives you the option of limiting your search by first doing a lookup on a shorter string (thefilename), and then using that constraint to help search a smaller number of rows to satisfy the path+filename combination and achieve a unique result.If you have a large number of files with identical names, another option might be to implement a hash index, using
rowIDs from your original table and a hash of each path+filename from that row usingCHECKSUM()or whatever hashing function you have available.Using an ‘indexing’ table like this one does add overhead: you have to maintain the metadata as the original table gets updated, but it also means you’re doing your heavy lifting ahead of time and making future queries of the data much faster.