Any examples, tips, guidance for the following scenario?
I have retrieved updates from several different news websites. I then analyse that information to predict on current trend in the world.
I could only find the information on data mining when searching for above idea, but it is for database systems. While data mining is similar to what i am trying to do, data mining in databases information is more specific than what I have retrieved from websites. So could someone guide me on this aspect? I really appreciate any help you can give on this.
Thanks.
First of all, you need some training data from the past. Meaning, a collection of old news and the state of the trend to analyze at different points in time.
Then, you have to decide how to quantify this information. If the trend is something like “Sold mobile phones”, you can just take the number of sold mobiles.
The news are harder to quantify. For example, you could measure the word frequency in the training news and take the n least frequent words as features (similar to SPAM filters).
After that, you train a classifier on these features and trend from the past.
A good one is the “Random Forest” algorithm, since it is practically parameter-free.
You will need a lot of background knowledge to actually implement this plan. “The Elements of Statistical Learning” by Hastie, Tibshirani and Friedmann is a good book to learn from. It can be downloaded for free on the authors’ homepage.