Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9227797
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 18, 20262026-06-18T05:10:05+00:00 2026-06-18T05:10:05+00:00

I am trying to parse through an HTML page which simplified looks like this:

  • 0

I am trying to parse through an HTML page which simplified looks like this:

<div class="anotherclass part"
  <a href="http://example.com" >
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div>
    <div class="column def"></div>
    <div class="column ghi">1 Feb 2013</div>
    <div class="column jkl">
      <h4>A title</h4>
      <p>
        <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
    </div>
  </a>
</div>

I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

I have got this code:

from BeautifulSoup import BeautifulSoup

with open("file.html") as fp:
  html = fp.read()

soup = BeautifulSoup(html)

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
  mypart={}

  # ghi
  mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
  # def
  mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
  # h4
  mypart['title'] = part.find('h4').string

  # jkl
  mypart['other'] = part.find('p').string

  # abc
  pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
  theprices = re.findall( pattern, str(part) )
  if len(theprices) == 2:
    mypart['price'] = theprices[1]
    mypart['rrp'] = theprices[0]
  elif len(theprices) == 1:
    mypart['price'] = theprices[0]
    mypart['rrp'] = theprices[0]
  else:
    mypart['price'] = None
    mypart['rrp'] = None

I want to extract any text from the classes def and ghi which I think my script does correctly.

I also want to extract the two prices from abc which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.

Finally I want to extract the "A, List, Of, Terms, To, Extract" part from class jkl which my script fails to do. I thought getting the string part of the p tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi so it should be easy to replace/remove it.

Any advice? Thank-you!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-18T05:10:06+00:00Added an answer on June 18, 2026 at 5:10 am

    First, if you add convertEntities=bs.BeautifulSoup.HTML_ENTITIES to

    soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
    

    then the html entities such as &#163; will be converted to their corresponding unicode character, such as £. This will allow you to use a simpler regex to identify the prices.


    Now, given part, you can find the text content in the <div> with the prices using its contents attribute:

    In [37]: part.find(attrs={"class": re.compile('abc')}).contents
    Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']
    

    All we need to do is extract the number from each item, or skip it if there is no number:

    def parse_price(text):
        try:
            return float(re.search(r'\d*\.\d+', text).group())
        except (TypeError, ValueError, AttributeError):
            return None
    
    price = []
    for item in part.find(attrs={"class": re.compile('abc')}).contents:
        item = parse_price(item.string)
        if item:
            price.append(item)
    

    At this point price will be a list of 0, 1, or 2 floats.
    We would like to say

    mypart['rrp'], mypart['price'] = price
    

    but that would not work if price is [] or contains only one item.

    Your method of handling the three cases with if..else is okay — it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you’d like something a little more terse you could do the following:

    Since we want to repeat the same price if price contains only one item, you might be led to think about itertools.cycle.

    In the case where price is the empty list, [], we want itertools.cycle([None]), but otherwise we could use itertools.cycle(price).

    So to combine both cases into one expression, we could use

    price = itertools.cycle(price or [None])
    mypart['rrp'], mypart['price'] = next(price), next(price)
    

    The next function peels off the values in the iterator price one by one. Since price is cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary — which is just what we want.


    The A, List, Of, Terms, To, Extract - 1 Feb 2013 could be obtained again through the use of the contents attribute:

    # jkl
    mypart['other'] = [item for item in part.find('p').contents
                       if not isinstance(item, bs.Tag) and item.string.strip()]
    

    So, the full runnable code would look like:

    import BeautifulSoup as bs
    import os
    import re
    import itertools as IT
    
    def parse_price(text):
        try:
            return float(re.search(r'\d*\.\d+', text).group())
        except (TypeError, ValueError, AttributeError):
            return None
    
    filename = os.path.expanduser("~/tmp/file.html")
    with open(filename) as fp:
        html = fp.read()
    
    soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)
    
    for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
        mypart = {}
        # abc
        price = []
        for item in part.find(attrs={"class": re.compile('abc')}).contents:
            item = parse_price(item.string)
            if item:
                price.append(item)
    
        price = IT.cycle(price or [None])
        mypart['rrp'], mypart['price'] = next(price), next(price)
    
        # jkl
        mypart['other'] = [item for item in part.find('p').contents
                           if not isinstance(item, bs.Tag) and item.string.strip()]
    
        print(mypart)
    

    which yields

    {'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to parse this html through jQuery to get data1, data2, data3.
I am trying to parse a page which updates internal contents through javascript. When
I'm trying to parse a quite strange page. Here's a simplified version: <!DOCTYPE html
I'm trying to parse an Arabic text from an HTML page to one of
I am trying to use JQuery to parse a sitemap.xml to look like this
OK I have been trying to parse a html tag which in it contains
I'm trying to parse through a bunch of PHP code using ack . I'm
I'm trying to parse CSV files uploaded by the user through PHP, but it's
I am trying to parse the below XML in to the Entity through Linq,
I'm trying to fetch and parse an XML-file through JavaScript. I don't control the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.