I extract some code from a web page ( http://www.opensolaris.org/os/community/on/flag-days/all/ ) like follows, <tr

Question

0

Asked: May 11, 20262026-05-11T09:16:32+00:00 2026-05-11T09:16:32+00:00

I extract some code from a web page ( http://www.opensolaris.org/os/community/on/flag-days/all/ ) like follows, <tr

0

I extract some code from a web page (http://www.opensolaris.org/os/community/on/flag-days/all/) like follows,

<tr class='build'>   <th colspan='0'>Build 110</th> </tr> <tr class='arccase project flagday'>   <td>Feb-25</td>   <td></td>   <td></td>   <td></td>   <td>     <a href='../pages/2009022501/'>Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />     cpupm keyword mode extensions - <a href='/os/community/arc/caselog/2008/777/'>PSARC/2008/777</a><br />     CPU Deep Idle Keyword - <a href='/os/community/arc/caselog/2008/663/'>PSARC/2008/663</a><br />   </td> </tr>

and there are some relative url path in it, now I want to search it with regular expression and replace them with absolute url path. Since I know urljoin can do the replace work like that,

>>> urljoin('http://www.opensolaris.org/os/community/on/flag-days/all/', ...         '/os/community/arc/caselog/2008/777/') 'http://www.opensolaris.org/os/community/arc/caselog/2008/777/'

Now I want to know that how to search them using regular expressions, and finally tanslate the code to,

<tr class='build'>   <th colspan='0'>Build 110</th> </tr> <tr class='arccase project flagday'>   <td>Feb-25</td>   <td></td>   <td></td>   <td></td>   <td>     <a href='http://www.opensolaris.org/os/community/on/flag-days/all//pages/2009022501/'>Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />     cpupm keyword mode extensions - <a href='http://www.opensolaris.org/os/community/arc/caselog/2008/777/'>PSARC/2008/777</a><br />     CPU Deep Idle Keyword - <a href='http://www.opensolaris.org/os/community/arc/caselog/2008/663/'>PSARC/2008/663</a><br />   </td> </tr>

My knowledge of regular expressions is so poor that I want to know how to do that. Thanks

I have finished the work using Beautiful Soup, haha~ Thx for everybody!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T09:16:32+00:00

First, I’d recommend using a HTML parser, such as BeautifulSoup. HTML is not a regular language, and thus can’t be parsed fully by regular expressions alone. Parts of HTML can be parsed though.

If you don’t want to use a full HTML parser, you could use something like this to approximate the work:

import re, urlparse  find_re = re.compile(r'\bhref\s*=\s*('[^']*'|\'[^\']*\'|[^'\'<>=\s]+)')  def fix_urls(document, base_url):     ret = []     last_end = 0     for match in find_re.finditer(document):         url = match.group(1)         if url[0] in '\''':             url = url.strip(url[0])         parsed = urlparse.urlparse(url)         if parsed.scheme == parsed.netloc == '': #relative to domain             url = urlparse.urljoin(base_url, url)             ret.append(document[last_end:match.start(1)])             ret.append(''%s'' % (url,))             last_end = match.end(1)     ret.append(document[last_end:])     return ''.join(ret)

Example:

>>> document = '''<tr class='build'><th colspan='0'>Build 110</th></tr> <tr class='arccase project flagday'><td>Feb-25</td><td></td><td></td><td></td><td><a href='../pages/2009022501/'>Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href='/os/community/arc/caselog/2008/777/'>PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href='/os/community/arc/caselog/2008/663/'>PSARC/2008/663</a><br /></td></tr>''' >>> fix_urls(document,'http://www.opensolaris.org/os/community/on/flag-days/all/') '<tr class='build'><th colspan='0'>Build 110</th></tr> <tr class='arccase project flagday'><td>Feb-25</td><td></td><td></td><td></td><td><a href='http://www.opensolaris.org/os/community/on/flag-days/pages/2009022501/'>Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href='http://www.opensolaris.org/os/community/arc/caselog/2008/777/'>PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href='http://www.opensolaris.org/os/community/arc/caselog/2008/663/'>PSARC/2008/663</a><br /></td></tr>' >>>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I extract some code from a web page ( http://www.opensolaris.org/os/community/on/flag-days/all/ ) like follows, <tr

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply