I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
-
Go through the HTML source of pages
from a certain site and “understand”
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads – for eg. –
TOC, author bio etc ) -
Go through the HTML source of pages
from disparate sites and “classify”
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1. -
… similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
Seems like a pretty complicated task to me; step 2, classification, is “easy” but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.