Hopefully this makes sense. I have a php script, magpierss, that parses a RSS feed and inserts the data into MySql and it works fine. I get the various parts of the RSS items into variables to make them easier to work with, so getting the pieces of the rss feed is not a problem.
However, my goal is to be able to have it filter the stories and only import certain ones. I want to automate this as much as possible, with some allowance for false positives/negatives because they will be manually verified after.
What I want to be able to do is set a list of keywords and ‘weights’ for each word. So when a new RSS item is parsed the script will create a ‘score’ based on the weights of the words in the description field.
For example:
stackoverflow = 10
very = 7
helpful = 8
So “stackoverflow very helpful” would get a score of 25
and also “stackoverflow is always very helpful” would still get a score of 25, because ‘is’ and ‘always’ are not keywords with weights assigned to them.
and “something random here” would get a score of 0, for having no keywords.
Then I could play with keyword weights and scores to figure out the best settings for filtering the rss feed.
Most of this I can figure out. I just need to know a way to parse the description of the item, and assign weights to specified keywords to create a ‘score’.
PHP comes with some functions that would help, such as strpos() and preg_match. The former would search for a specific string, and preg_match would search for regular expressions. You should create an array of the keywords and their weights, then run through each one and check to see if the description contains that value. If so, you increment a weight counter. Here’s a simple example:
Something like that anyway. There are other ways to do it, but this should get you started.
Good luck.