I’m trying to get the four or five things that happened on this day in history, and add a plaintext representation of that into an array in PHP.
So far, I’m using this code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/w/api.php?action=featuredfeed&feed=onthisday&feedformat=rss');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '3');
curl_setopt($ch, CURLOPT_USERAGENT, 'My random user agent'); // Needed for Wikipedia to prevent IP blocking
$contents = trim(curl_exec($ch));
curl_close($ch);
$xml = simplexml_load_string($contents);
$json = json_encode($xml);
$array = json_decode($json, true);
$noOfDays = count($array['channel']['item']);
$r = $noOfDays - 1;
$input = $array['channel']['item'][$r]['description'];
I know this is not very dyamic and efficient, but one person is going to be calling this page once a day, so it’s not terribly important.
At this point, $input contains a block of HTML, which looks something like this:
<p><b><a href="/wiki/April_6" title="April 6">April 6</a></b>: <b><a href="/wiki/Good_Friday" title="Good Friday">Good Friday</a></b> (Western Christianity, 2012); <b><a href="/wiki/Fast_of_the_Firstborn" title="Fast of the Firstborn">Fast of the Firstborn</a></b> begins at dawn and <b><a href="/wiki/Passover" title="Passover">Passover</a></b> begins at sunset (Judaism, 2012)
</p>
<div style="float:right;margin-left:0.5em">
<p><a href="/wiki/File:Sir_Arthur_Wellesley,_1st_Duke_of_Wellington.png" class="image" title="Arthur Wellesley, the Earl of Wellington"><img alt="Arthur Wellesley, the Earl of Wellington" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png/78px-Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png" width="78" height="100" /></a>
</p>
</div>
<li style="-moz-float-edge: content-box">
<a href="/wiki/1250" title="1250">1250</a> – <a href="/wiki/Seventh_Crusade" title="Seventh Crusade">Seventh Crusade</a>: Egyptian <a href="/wiki/Ayyubid" title="Ayyubid" class="mw-redirect">Ayyubids</a> <b><a href="/wiki/Battle_of_Fariskur" title="Battle of Fariskur">annihilated the crusader army</a></b> and captured King <a href="/wiki/Louis_IX_of_France" title="Louis IX of France">Louis IX of France</a> as a hostage.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1320" title="1320">1320</a> – The <b><a href="/wiki/Declaration_of_Arbroath" title="Declaration of Arbroath">Declaration of Arbroath</a></b>, a declaration of <a href="/wiki/Scottish_independence" title="Scottish independence">Scottish independence</a>, was adopted.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1812" title="1812">1812</a> – <a href="/wiki/Peninsular_War" title="Peninsular War">Peninsular War</a>: After a <b><a href="/wiki/Siege_of_Badajoz_(1812)" title="Siege of Badajoz (1812)">three-week siege</a></b>, the <a href="/wiki/Anglo-Portuguese_Army" title="Anglo-Portuguese Army">Anglo-Portuguese Army</a>, under the <a href="/wiki/Arthur_Wellesley,_1st_Duke_of_Wellington" title="Arthur Wellesley, 1st Duke of Wellington">Earl of Wellington</a> <i>(pictured)</i>, captured <a href="/wiki/Badajoz" title="Badajoz">Badajoz</a>, Spain and forced the surrender of the French garrison.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1947" title="1947">1947</a> – The <a href="/wiki/1st_Tony_Awards" title="1st Tony Awards">first</a> <b><a href="/wiki/Tony_Award" title="Tony Award">Tony Awards</a></b>, recognizing achievement in live American <a href="/wiki/Theatre" title="Theatre">theatre</a>, were handed out at the <a href="/wiki/Waldorf-Astoria_Hotel" title="Waldorf-Astoria Hotel">Waldorf-Astoria Hotel</a> in <a href="/wiki/New_York_City" title="New York City">New York City</a>.
<li style="-moz-float-edge: content-box">
<a href="/wiki/2008" title="2008">2008</a> – Egyptian workers staged <b><a href="/wiki/2008_Egyptian_general_strike" title="2008 Egyptian general strike">an illegal general strike</a></b>, two days before <a href="/wiki/Egyptian_municipal_elections,_2008" title="Egyptian municipal elections, 2008">key municipal elections</a>.
</li>
</ul>
<p>More anniversaries: <span class="nowrap"><a href="/wiki/April_5" title="April 5">April 5</a> –</span> <span class="nowrap"><b><a href="/wiki/April_6" title="April 6">April 6</a></b> –</span> <span class="nowrap"><a href="/wiki/April_7" title="April 7">April 7</a></span>
</p>
<div style="text-align: right;" class="noprint"><span class="nowrap"><b><a href="/wiki/Wikipedia:Selected_anniversaries/April" title="Wikipedia:Selected anniversaries/April">Archive</a></b> –</span> <span class="nowrap"><b><a href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" class="extiw" title="mail:daily-article-l">By email</a></b> –</span> <span class="nowrap"><b><a href="/wiki/List_of_historical_anniversaries" title="List of historical anniversaries">List of historical anniversaries</a></b></span></div>
<div style="text-align: right;"><small>It is now <span class="nowrap">April 6, 2012</span> (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>) – <span class="plainlinks" id="purgelink"><span class="nowrap"><a class="external text" href="//en.wikipedia.org/w/index.php?title=MediaWiki:Ffeed-onthisday-transcludeme&action=purge">Refresh this page</a></span></span></small></div>
The only thing that I’m interested in are the bits between each <li style="-moz-float-edge: content-box">
I’ve got no idea why they didn’t close these <li> tags properly, but there you go.
So the essence of what I want to is take the actual information, strip away the links and add each one into an array, which should look something like this:
Array (
[0] => 1250 – Seventh Crusade: Egyptian Ayyubids annihilated the crusader army and captured King Louis IX of France as a hostage.
[1] => Next one...
[2] => And another...
)
There’s also a slight problem regarding the   at the end of this line. How would I translate that into plaintext? I have a feeling HTML parsing may be the answer.
I’ve already tried regex and HTML parsing, but as the tags don’t close I’ve had some difficulty doing this.
Any suggestions?
As @zzzzBov points out, closing tags are optional in HTML (but not XHTML). Unfortunately this is one of several facts that makes it incompatible with XML (and XML parsers). For your task I would recommend parsing the DOM using a library like phpQuery or PHP Simple HTML DOM Parser.
In phpQuery your code would look something like this:
As for
 , tryhtml_entity_decode().