What is the best way to deal with storing and indexing URL’s in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL’s can be very large, and using them as a key makes the indexes larger and slower. How much I don’t know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn’t be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case…
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I’d use a normal identity column as the primary key. You say:
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements – something like
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.