I am trying to parse a page like this one and I simply want to get the paragraphs after the header, the introduction I guess.
I want all the content (inclduing the paragraph tags) between <table class="infobox vcard"> and <table id="toc">. Using simple CSS selectors to get even the first paragraph:
div#bodyContent div#mw-content-text.mw-content-ltr p
does not always work because sometimes something in the infobox table has a paragraph. Also, the amount of introductory paragraphs will vary. If someone has a better approach than what I’m going for here, I will also be receptive to that.
—
Additional code requested, shortened as much as possible:
require HTTP::Request;
require LWP::UserAgent;
use LWP::Simple;
use HTML::Query 'Query';
my $pageurl = "http://en.wikipedia.org/wiki/Wayne_Rooney";
my $wikiurl = URI->new($pageurl);
my $wikirequest = HTTP::Request->new(GET => $wikiurl);
my $wikiua = LWP::UserAgent->new;
my $wikiresponse = $wikiua->request($wikirequest);
my $pagetoparse = $wikiresponse->content;
my $q2 = Query(text => $pagetoparse);
my @wikiintro = $q2->query('div#bodyContent div#mw-content-text.mw-content-ltr p')->get_elements();
my $pageintro;
if(@wikiintro) {
if(index($wikiintro[0]->as_text(), "Appearances (Goals)") != -1){
$pageintro = $wikiintro[1]->as_text();
} else {
$pageintro = $wikiintro[0]->as_text();
}
} else {
$pageintro = "unavailable";
}
One way using the non-standard module
HTML::TreeBuilder.Content of
script.pl:Run it providing the url as unique argument:
With following output (I hope it to be near of what you expect):
EDIT: To get as result also tags use
printf qq|%s\n|, $p->as_HTML;instead of$p->as_text.