I have a system that offers a user to search for whatever he wants, and grabbing the content from different places into one page.
I restrict the search results by a keyword/label or a few keywords, so the user won’t get junk he never asked for.
And I always stick to the main market/label theme(keyword) not to let the search go wrong.
At the beginning all was fine, but then, when I went deeply into developing this system, I started to understand that I cannot predict or filter the content that will be retrieved.
The system is automatic, f.e, when you search for “Christiano Ronaldo” I’d like to get his pictures, videos, twits, news and other stuff.
When I construct a page out of all this, to enhance my search engine optimization, I use most repetitive words in the content to provide even more, in links like “See more” or generate more pages based on 1 user search.
I’ve come to a problem, when the automatic content crawler started to bring bullshit content.
I search for “virgin atlantic”, it brings me the airline information, which is what I want, using parts of the content and keywords from that information I go looking further, and it brings me Virginia, which is relevant, but not what I want.
Then it brings east/west, and then United States, and then it goes deeper and deeper in a wrong direction.
That was a brief. My real question… Is there any algorithm, theories or other stuff to read and is it possible to recognize the theme/direction/meaning/relevancy of a content/keywords to the main theme I set up manually.
So if I say -> go look only for Sport related content, it will not bring me news about Ronaldo’s new girlfriend, but his statistics, career data and things like that.
I don’t care putting a person to filter the content manually and tell the AI:
ACCEPT/DECLINE so it will learn what to bring and what not according to requested theme/pattern.
Neural Network, any other A.I. algorithms to recognize content?
Short answer: have a look at Hidden markov models and Bayesian nets and semantic web research. One could fill whole libraries with research on this topic.
Long answer:
The problem with AI usually is these types of problems are very, very hard. Yes, there is loads of theory. But implementing those theories is another thing. I’ve seen companies building up some kind of engine, which they are very proud of. But then they are usually to tool-focused, and forget what problem they actually want to solve. That is the problem I would call AI-blackbox-problem. You have an algorithm such as Hidden markov models, Neural nets, Bayesian nets, Kalman filter, Support Vector machines, etc. Then you throw a bunch of data at them and they ouptut a bunch of parameterized models. But often it is not possible to possible to trace the internal state.
So if you want to solve the semantic web problem you have picked one of the hardest problems there is. How to tell the computer what you are looking for? Well Google uses the link structure to retrieve information. Then there are the semantic web proponents, which say the content provider should add a bunch of meta data. I think this approach has largely failed. There are always new startups trying to do new things in this area. Palantir is perhaps one of those data mining companies coming through.
So I suggest to start with learning the basics using toy problems, picking up a textbook, such as Russell/Norvig, go to class, which you now can do online, http://www.udacity.com/overview/Course/cs373/CourseRev/apr2012, and go from there. Nothing wrong with playing around with hard problems, but it is easy to frustrated. Know that you problem is solvable in finite time and resources. (Saying that having worked for 5 years on an almost impossible problem myself).