I am trying to scrape specific html tags including their data from a google products page. I want to get all the <li> tags within this ordered list and put them in a list.
Here is the code:
<td valign="top">
<div id="center_col">
<div id="res">
<div id="ires">
<ol>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
</ol>
</div>
</div>
</div>
<div id="foot">
<p class="flc" id="bfl" style="margin:19px 0 0;text-align:center"><a href=
"/support/websearch/bin/answer.py?answer=134479&hl=en">Search Help</a>
<a href=
"/quality_form?q=Pioneer+Automotive+PF-555-2000&hl=en&tbm=shop">Give us
feedback</a></p>
<div class="flc" id="fll" style="margin:19px auto 19px auto;text-align:center">
<a href="/">Google Home</a> <a href=
"/intl/en/ads">Advertising Programs</a> <a href="/services">Business
Solutions</a> <a href="/intl/en/policies/">Privacy & Terms</a> <a href=
"/intl/en/about.html">About Google</a>
</div>
</div>
</td>
I want to get all the <li class="g"> tags and the data in each of them. Is that possible?
instead of using a regex using something like an xml parser may be more useful to your situation. Load it up into an xml document and then use something like SelectNodes to get out your data you are looking for
http://msdn.microsoft.com/en-us/library/4bektfx9.aspx