I’m dealing with a database with about 30 tables, and 10 million unique entries.
I am trying to use PHP to present that data in a certain format using the echo “function” and placing the variables using {$variable}.
Also, the data is hierarchical so I used a join command in order to include several columns and that resulting table was probably about 15 columns.
I ran the php file in Google Chrome, and it ran for about 1 hour on a pretty decent core2duo machine.
But the result set stopped at about 18 thousand entries – I had put no limit on the query by the way.
The most important part of my question is how do I run this file to get all the results? I don’t want to sit there and set the offset over and over, if there is another way, I would be very grateful.
Secondarily – and I know you probably need more information, just not sure what – can I make the process faster? I’m planning on re-running it on a better machine, but are there other ways?
Thanks
Update:
<?php
include ('includes/functions.php');
$connection=connectdb();
$result=runquery('
SELECT taxonomic_rank.rank as shortrank, scientific_name_element.name_element as shortname, sne.name_element as pname, tr.rank as prank
FROM taxon_name_element
LEFT JOIN scientific_name_element ON taxon_name_element.scientific_name_element_id = scientific_name_element.id
LEFT JOIN taxon ON taxon_name_element.taxon_id = taxon.id
LEFT JOIN taxonomic_rank ON taxonomic_rank.id = taxon.taxonomic_rank_id
LEFT JOIN taxon_name_element AS tne ON taxon_name_element.parent_id = tne.taxon_id
LEFT JOIN scientific_name_element AS sne ON sne.id = tne.scientific_name_element_id
LEFT JOIN taxon AS tax ON tax.id = tne.taxon_id
LEFT JOIN taxonomic_rank AS tr ON tr.id = tax.taxonomic_rank_id');
set_time_limit(0);
ini_set('max_execution_time',0);
while($taxon_name_element = mysql_fetch_array($result)){
if ($taxon_name_element['shortrank'] == 'species'){
$subitem = $taxon_name_element['pname']."_".$taxon_name_element['shortname'];}
else{$subitem = $taxon_name_element['shortrank']."_".$taxon_name_element['shortname'];}
$parentitem = $taxon_name_element['prank']."_".$taxon_name_element['pname'];
echo
"\n<!-- http://invertnet.ill/med#{$subitem}\" -->\n
<owl:Class rdf:about=\"http://invertnet.ill/med#{$subitem}\">
<rdfs:label xml:lang=\"en\">{$subitem}</rdfs:label>
<rdfs:subClassOf rdf:resource=\"http://invertnet.ill/med#{$parentitem}\"/>
</owl:Class>\n\n";}
echo "<br>".count($taxon_name_element)." number of stuff";
?>
Reading the below lines, it doesn’t seem to be the slow query issue.
“I ran the php file in Google Chrome, and it ran for about 1 hour on a pretty decent core2duo machine.
But the result set stopped at about 18 thousand entries – I had put no limit on the query by the way”
The browser isn’t the best medium to throw 10 million records, not Chrome at least :-). My suggestion is that you put some pagination in your PHP file so that you do not have to set the offset manually every time. Put a simple previous-next link showing say 10000 records per page.
If it is not absolutely required to run in a browser, another way could be to write all output to a text file.
Some notes on the query too: any specific reason for adding LEFT JOIN twice for each table? It seems it has something to do with taxon_name_element.parent_id but since I’m not sure on the requirement and the table schema, can’t comment on it. But if the query is running too slow, do consider optimizing it.
EDIT 1 – I’ve tried to workout a little on your query. And since you want both the name of the element and it’s parent name, I think it is possible to do it in a simpler query without JOINING the same tables twice. It will need coding some extra logic though.
Few observations that I learn from the query:
taxon_name_elementtaxonomic_rankfor both the element and its parenttaxon_name_element.parent_id = tne.taxon_id, I learn that both the element and its parent are in the same table `taxon_name_element”Now let us see the simpler query:
The resultset will now contain both taxon_id and parent_id. So the idea is to store all results in the array such that the KEY is set to the parent_id. Like:
Hope that makes sense! Well, the above ofcourse is only needed if the original query is expensive.
CAUTION: the above code isn’t tested and I only hope that it works. Minor changes or fixes might be needed 😉