If your working copy is on "Experiment", that's where the…

Question

0

Asked: May 11, 20262026-05-11T16:56:17+00:00 2026-05-11T16:56:17+00:00

I am looking to implement a simple forward indexer in PHP. Yes I do

0

I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.

Let us make a few basic assumptions:

The entire Interweb consists of
about five thousand HTML and/or
plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:

UID1 -> index.html -> helen,she,was,champion,with,freckles

UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep

UID2 -> blah.html -> next,week,on,badgerwatch

UID2 -> gah.txt -> one,one,and,one,is,not,numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging.
Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

Extracting the real textual content stuff within the document
as a list of words in the order in
which they are presented.
All the while, ignoring any garbage
such as <script> and <html>
tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
Bear in mind a solution that can build the list
of words WHILE reading the document
is cooler that one which needs to
read in the whole document first.

At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of ‘print’ statements will suffice.

Thanks in advance, hope this was clear enough.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T16:56:18+00:00

Editorial Team

2026-05-11T16:56:18+00:00Added an answer on May 11, 2026 at 4:56 pm

Take a look at

http://simplehtmldom.sourceforge.net/

You do somthing like

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text.
Want to iterate over just the links

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull.
Check it out.

0

Reply
Share
Share

- Report

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions