I am trying to read a 12MB+ file which has a large HTML table which looks like this:
<table>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
<td>d</td>
<td>e</td>
</tr>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
<td>d</td>
<td>e</td>
</tr>
<tr>..... up to 20,000+ rows....</tr>
</table>
Now this is how I’m scraping it:
<?
require_once 'phpQuery-onefile.php';
$d = phpQuery::newDocumentFile('http://localhost/test.html');
$last_index = 20000;
for ($i = 1; $i <= $last_index; $i++)
{
$set['c1'] = $d['tr:eq('.$i.') td:eq(0)']->text();
$set['c2'] = $d['tr:eq('.$i.') td:eq(1)']->text();
$set['c3'] = $d['tr:eq('.$i.') td:eq(2)']->text();
$set['c4'] = $d['tr:eq('.$i.') td:eq(3)']->text();
$set['c5'] = $d['tr:eq('.$i.') td:eq(4)']->text();
}
// code to insert to db here...
?>
My benchmark says it takes around 5.25 hours to scrape and insert 1,000 rows to db. Given that data, it will take around 5 days just to finish the whole 20,000+ rows.
My local machine is running on:
- XAMPP
- Win 7
- proc, i3 2100 3.1GHz
- ram, G.Skill RipJaws X 4GB dual
- HDD, old SATA
Is there any way I can speed up the process? Maybe I’m scraping it the wrong way? Note that the file is accessible locally hence I used http://localhost/test.html
Slightly faster solution:
for ($i = 1; $i <= $last_index; $i++)
{
$r = $d['tr:eq('.$i.')'];
$set['c1'] = $r['td:eq(0)']->text();
$set['c2'] = $r['td:eq(1)']->text();
$set['c3'] = $r['td:eq(2)']->text();
$set['c4'] = $r['td:eq(3)']->text();
$set['c5'] = $r['td:eq(4)']->text();
}
// code to insert to db here...
?>
I have never worked with phpQuery, but that looks like a very sub-optimal way to parse a huge document: It’s possible that phpQuery has to walk through the whole thing every time you make it load a row using
tr:eq('.$i.').The much more straightforward (and probably also much faster) way would be to simply walk through each
trelement of the document, and deal with each element’s children in aforeachloop. You wouldn’t even need phpQuery for that.See How to Parse XML File in PHP for a variety of solutions.