I’ve been thinking about this for a while now, so I thought I would ask for suggestions:
I have some crawler which enters the root of some site (could be anything from http://www.StackOverFlow.com, http://www.SomeDudesPersonalSite.se or even http://www.Facebook.com). Then I need to determin what “kind of homepage” I’m visiting.. Different types could for instance be:
- Forum
- Blog
- Link catalog
- Social media site
- News site
- “One man site”
I’ve been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.
But this is where I get stuck.. How do you detect trends?
- Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
- News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..
BUT I can’t really find too many trends.
SO: My question is:
Any ideas on how to do this?
Thanks so much..
You could train a neural network to recognise them. Give it number/types of links, maybe types of HTML tags as well.
I think otherwise you’re just going to be second-guessing what makes a site what it is.