Im having trouble parsing a HTML page using Beautiful Soup 3, and python 2.6.

Question

0

Editorial Team

Asked: June 14, 20262026-06-14T22:34:40+00:00 2026-06-14T22:34:40+00:00

Im having trouble parsing a HTML page using Beautiful Soup 3, and python 2.6.

0

Im having trouble parsing a HTML page using Beautiful Soup 3, and python 2.6.

The HTML content is this:

content='<div class="egV2_EventReportCardLeftBlockShortWidth">
<span class="egV2_EventReportCardTitle">When</span>
<span class="egV2_EventReportCardBody">
<meta itemprop="startDate" content="2012-11-23T10:00:00.0000000">
<span class='egV2_archivedDateEnded'>STARTS</span>Fri 23 Nov,10:00AM<br/>
<meta itemprop="endDate" content="2012-12-03T18:00:00.0000000">
<span class='egV2_archivedDateEnded'>ENDS</span>Mon 03 Dec,6:00PM</span>
<span class="egV2_EventReportCardBody"></span>
<div class="egV2_div_cal" onclick=" showExportEvent()">
<div class="egV2_div_cal_outerFix">
<div class="egV2_div_cal_InnerAdjust"> Cal </div>
</div></div></div>'

And I want to get the string ‘Fri 23 Nov,10:00AM’ out of the middle into a variable, for concatenating, and sending back to a PHP page.

To read this content, i use the following code:
(the content above comes through from a html page read (http://everguide.com.au/melbourne/event/2012-nov-23/life-with-bird-spring-warehouse-sale/)

import urllib2
req = urllib2.Request(URL)
response = urllib2.urlopen(req)
html = response.read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html.decode('utf-8'))
soup.prettify()
import re
for node in soup.findAll(itemprop="name"):
    n = ''.join(node.findAll(text=True)) 
for node in soup.findAll("div", { "class" : "egV2_EventReportCardLeftBlockShortWidth" }):
    d = ''.join(node.findAll(text=True))
print n,"|", d

Which returns:

[(ssh user)]# python testscrape.py

LIFE with BIRD Spring Warehouse Sale | 
When
<span class="egV2_EventReportCardDateTitle">STARTS</span>
STARTSFri 23 Nov,10:00AMENDSMon 03 Dec,6:00PM
<span class="egV2_EventReportCardDateTitle">ENDS</span>



 Cal 



[(ssh user)]#

(And it includes all those line breaks etc).

So you can see there at the end, Im grouping both of those stripped strings into one printout, with a separator character in the middle to PHP can read back the string as one, and then break it apart.

Problem is – the python code can read that page and store the text, but it includes all those rubbish and tags etc, that are confusing the PHP app.

I really just want returned:

Fri 23 Nov,10:00AM

is it because Im using the findAll(text=True) method?

How can I drill down and get just the text only in that div – not the span tags too?

Any help would be greatly appreciated, thank you.

Rick – Melbourne.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T22:34:42+00:00

Why not try something like

In [95]: soup = BeautifulSoup(content)

In [96]: soup.find("span", {"class": "egV2_archivedDateEnded"})
Out[96]: <span class="egV2_archivedDateEnded">STARTS</span>

In [97]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next
Out[97]: u'STARTS'

In [98]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next.next
Out[98]: u'Fri 23 Nov,10:00AM'

or even

In [99]: soup.find("span", {"class": "egV2_archivedDateEnded"}).nextSibling
Out[99]: u'Fri 23 Nov,10:00AM'

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Im having trouble parsing a HTML page using Beautiful Soup 3, and python 2.6.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply