I have a phrase counter function as part of a class that I would like to adapt to run very large data sets
<?php
private static function do_phrase_count($words, $multidim, $count, $pcount){
if($multidim === false){
$words = array($words);
}
$tally = array();
$arraycount = 0;
foreach($words as $wordgroup){
$max = count($wordgrounp) - $pcount;
for($x = 0; $x < $max; $x++){
$cutoff = $x + $pcount;
$spacekey = false;
$phrase = '';
$z = 0;
for($y = $x; $y < $cutoff; $y++){
if($spacekey) $phrase .= ' ';
else $spacekey = true;
$phrase .= $wordgroup[$y + $z];
$z++;
}
if(isset($tally[$phrase])){
$tally[$phrase]++;
$arraycount++;
}
else $tally[$phrase] = 1;
if($arraycount > 99999){
arsort($tally);
$tally = array_slice($tally, 0, 50000);
$arraycount = 49999;
}
}
}
arsort($tally);
$out = array_slice($tally, 0, $count);
return $out;
}
- $words is an array of words to check
- $multidim is a boolean showing if the array is cascading or flat
- $count is the number of elements to be returned
- $pcount is the number of words in a phrase
With every iteration, array_key_exists gets slower, so at a certain point I need to decrease the size of the tally array.
I was considering using a limit (100K) to stop the script from adding new array elements to $tally, or even using a percentage of total words, but after I stop adding new elements to the array I lose the ability to track trends that may pop up. (If I’m analysing data from a whole year, by the time I got to June, I wouldn’t be able to see “summer time” as a trend).
Anyone have a solution as to how to limit my tally array to keep the script zinging without losing the ability to track trends?
UPDATE: I changed the script per your suggestions. Thank you for helping. I also figured out a solution to cut down on the size of the array.
It should be faster than
array_key_existsps: test sample
Result:
Of course, the amount of keys is limited here.