I’ve a table with about 130 000 records with telephonenumbers. The numbers are all formated like this +4311234567. The numbers always include international country code, local area code and then the phonenumber and sometimes an extension.
There is a webservice which checks for the caller’s number in the table. That service works already. But now the client wants that also if someone calls from a company which number is already in the database but not his extension, that the service will return some result.
Example for table.
**id** | **telephonenumber** | **name** | 1 | +431234567 | company A | 2 | +431234567890 | employee in company A | 3 | +4398765432 | company b
now if somebody from company A calls with a different extension for example +43123456777, than it should return id1. But the problem is, that I don’t know how many digits the extensions have. It could have 3,4 or more digits.
Are there any patterns for string kind of matchings?
The data is stored in a sql2005 database.
Thanks
EDIT:
The telephonenumbers i am getting from a crm system. I’ve talked with the admin of the crm and he is trying to send me the data in a different format.
**id** | **telephonenumber** |**extension** | **name** | 1 | +431234567 | | company A | 2 | +431234567 | 890 | employee in company A | 3 | +4398765432 | | company b
Given that the number of digits in the extension can be different for each company and the number of digits in the number could be different for each country and area code, this is a tricky problem to do efficiently.
Even if you get the data table split into base number and extension, you still have to split the incoming number into base number and extension, which I actually think complicates things.
What I would be inclined to try is:
Original format
For example, searching for “+43123456777”:
The main failure mode of this approach is if a company has variable length extension numbers. For instance consider what happens if both 431234567890 and 43123456789 are valid numbers but only the second one is in the database. If the incoming number is 431234567890, then 43123456789 will be matched in error.
Split format
This is a little more complex, but more robust.
For example, searching for “+43123456777”:
Implementation notes
This algorithm, as noted above, does have some efficiency problems. If the database lookup is expensive, it has a linear cost related to the length of the telephone number, especially in the case where no similar numbers exist in the database (for example, if the incoming number is from Kazakhstan, but there are no Kazakhstan numbers in the datsbase *8′).
You could add some optimisations relatively easily though. If most of the companies you deal with use 3 or 4 digit extensions, you could start by stripping, say, 4 digits off the end and then doing a binary chop, until you reach an answer. This would reduce a 15 digit number to 4 or 5 in many cases and at most 6 lookups.
Also, every time you narrow the selection, you could select only within the previous selection rather than having to select within the whole database.
Additional implementation notes
Having finally worked out how Unreason’s answer works, I can see that is a much simpler, more elegant solution. I wish I’d though of the simplicity of simply looking for the database number in the incoming number rather than the other way around.
My only concern is that performing this on every
telephonenumberin the database might impose excessive demands on the server. I would suggest benchmarking that solution under maximum stress and see if it causes problems. If not, fine – use that. If it does, consider implementing the simple form of my algorithm and doing the stress tests again. If the performance is still too low, try my binary search suggestion.