I’m looking to create an Embodied Agent for handling search requests on a website. The agent needs to be able to handle simple questions, and provide a series of website links for an answer.
All the articles are in a database. Each article has a title field, and a series of tags to categorize the article.
At this point, my simple algorithm would be:
- Split the question up into a series of words.
- Remove all common words like “a”, “the”, “how”, etc.
- Create a “where” clause, searching the article body, article title, and tags for the remaining words.
- Display the list, possibly ranked with those articles with matches in the title first, tags second, and article body third.
Is there a better algorithm for converting an English question into a SQL query? Are there specific details that should be tracked along with each article by the article author to further improve search results? Are there details that should be recorded over time while the search is in use to further improve search results?
UPDATE: The website will be running on IIS, with the latest ASP.NET. The backend database will be a SQL Server.
There really isn’t an easy solution for true english query parsing. Most search engines simply eliminate noise words, like you’re proposing, and look for the remaining terms. If you’re using Microsoft SQL, you may want to look at Full-Text Search (SQL Server). You may also want to read Semantic Search (SQL Server), if you can use Microsoft SQL Server 2012. If you’re using MySQL, see 12.9. Full-Text Search Functions.