I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.
Let us make a few basic assumptions:
-
The entire Interweb consists of
about five thousand HTML and/or
plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb. -
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:
UID1 -> index.html -> helen,she,was,champion,with,freckles
UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep
UID2 -> blah.html -> next,week,on,badgerwatch
UID2 -> gah.txt -> one,one,and,one,is,not,numberwang
Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging.
Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:
- Extracting the real textual content stuff within the document
as a list of words in the order in
which they are presented. - All the while, ignoring any garbage
such as<script>and<html>
tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care. - Bear in mind a solution that can build the list
of words WHILE reading the document
is cooler that one which needs to
read in the whole document first.
At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of ‘print’ statements will suffice.
Thanks in advance, hope this was clear enough.
Take a look at
http://simplehtmldom.sourceforge.net/
You do somthing like
And that will give you all the text.
Want to iterate over just the links
It is very usefull and powerfull.
Check it out.