I would like to ask you about some known PHP libraries which may help me to parse *.txt files for sentences.
I have to parse too large text files, so I decided to make a stream parser (sentence by sentence).
I thought that it would be pretty to iterate file by sentences, something like:
foreach (new SentenceIterator("./data/huge.txt") as $sentence)
{
// do something...
}
Main idea is that file should be load to the memory completely.
What I have tried:
$f = fopen("./data/huge.txt", "r");
$dataBytes = 64;
$buffer = '';
while (!feof($f))
{
$data = fread($f, $dataBytes);
$dotPosition = strpos($data, '.');
if (false !== $dotPosition)
{
$sentence = $buffer . substr($data, 0, $dotPosition);
// correct cursor position
fseek($f, -1 * $dotPosition, SEEK_CUR);
// clear buffer
$buffer = '';
continue;
}
$buffer .= $data;
}
But in this case I get corrupted (lopped) sentences.
Could someone suggest me some existing libraries or maybe how to fix my code?
Thx in advance.
Sorry for inconvenience,
After some digging I have found solution which is… Spl lib..
Iterator called
SplFileObjectwhich implementsIterator,RecursiveIteratorandSeekableIterator. And it allows read file line by line.Updates and worked code is: