I’m using curl to retrieve information from wikipedia. So far I’ve been successful in

Question

0

Asked: May 11, 20262026-05-11T17:07:12+00:00 2026-05-11T17:07:12+00:00

I’m using curl to retrieve information from wikipedia. So far I’ve been successful in

0

I’m using curl to retrieve information from wikipedia. So far I’ve been successful in retrieving basic text information but I really would want to retrieve it in HTML.

Here is my code:

$s = curl_init();       

$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);

$rs = Zend_Json::decode($rs);

$rs = ($rs['ysearchresponse']['resultset_web']);

$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);

$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);

$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];

However the text retrieved this way isnt well enough to be displayed 🙁 its all in this kind of format

”’Aix-les-Bains”’ is a [[Communes of
France|commune]] in the [[Savoie]]
[[Departments of France|department]]
in the [[Rhône-Alpes]] [[regions of
France|region]] in southeastern
[[France]].

It lies near the [[Lac du Bourget]],
{{convert|9|km|mi|abbr=on}} by rail
north of [[Chambéry]].

==History== ”Aix” derives from [[Latin]] ”Aquae” (literally,
“waters”; ”cf” [[Aix-la-Chapelle]]
(Aachen) or [[Aix-en-Provence]]), and
Aix was a bath during the [[Roman
Empire]], even before it was renamed
”Aquae Gratianae” to commemorate the
[[Emperor Gratian]], who was
assassinated not far away, in
[[Lyon]], in [[383]]. Numerous Roman
remains survive. [[Image:IMG 0109 Lake
Promenade.jpg|thumb|left|Lac du
Bourget Promenade]]

How do I get the HTML of the wikipedia article?

UPDATE: Thanks but I’m kinda new to this here and right now I’m trying to run an xpath query [albeit for the first time] and can’t seem to get any results. I actually need to know a couple of things here.

How do I request just a part of an article?
How do I get the HTML of the article requested.

I went through this url on data mining from wikipedia – it put an idea to make a second request to wikipedia api with the retrieved wikipedia text as parameters and that would retrieve the html – although it hasn’t seemed to work so far 🙁 – I don’t want to just grab the whole article as a mess of html and dump it. Basically my application what it does is that you have some locations and cities pin pointed on the map – you click on the city marker and it would request via ajax details of the city to be shown in an adjacent div. This information I wish to get from wikipedia dynamically. I’ll worry about about dealing with articles that don’t exist for a particular city later on just need to make sure its working at this point.

Does anyone know of a nice working example that does what I’m looking for i.e. read and parse through selected portions of a wikipedia article.

According to the url provided – it says I should post the wikitext to the wikipedia api location for it to return parsed html. The issue is that if I post the information I get no response and instead an error that I’m denied access – however if I try to include the wikitext as GET it parses with no issue. But it fails of course when I have waaaaay too much text to parse.

Is this a problem with the wikipedia api? Because I’ve been hacking at it for two days now with no luck at all 🙁

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T17:07:13+00:00

Editorial Team

2026-05-11T17:07:13+00:00Added an answer on May 11, 2026 at 5:07 pm

The simplest solution would probably be to grab the page itself (e.g. http://en.wikipedia.org/wiki/Combination ) and then extract the content of <div id="content">, potentially with an xpath query.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using curl to retrieve information from wikipedia. So far I’ve been successful in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply