I am using Python and Beautiful Soup to obtain url of available software from

Question

0

Asked: June 8, 20262026-06-08T03:47:24+00:00 2026-06-08T03:47:24+00:00

I am using Python and Beautiful Soup to obtain url of available software from

0

I am using Python and Beautiful Soup to obtain url of available software from Civic Commons – Social Media link. I want the link of all the Social Media software (spread across 20 pages). I am able to get the url of software listed in the first page.

Below is the Python code that I wrote for obtaining these values.

from bs4 import BeautifulSoup
import re
import urllib2

base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

list_of_links = [] 
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
   string_temp_link = base_url+link_tag.get('href')
   list_of_links.append(string_temp_link)

list_of_links = list(set(list_of_links))  

for link_item in list_of_links:
   print link_item

print ("\n")

#Newly added code to get all Next Page links from a url    
next_page_links = [] 
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
   string_temp_link = base_url+link_tag.get('href')
   next_page_links.append(string_temp_link)
for next_page in next_page_links:
   print next_page

I used /apps/ regex to get the list of software.

But I wanted to know if there is better approach to crawl through next page. I am able to match the next page link by using regex “*page=”. But this gives repeated list of pages.

How can I do this in a better way?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T03:47:25+00:00

Looking at the page, there’s 5 pages, the last of which is “…?page=4”, so, we know there’s the first page, then page=1 through page=4…

<li class="pager-last last">
<a href="/software-licenses/gpl?page=4" title="Go to last page">last »</a>
</li>

So you could retrieve that by the class (or by title), then parse the href…

from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
    pass # do something useful here like building a url string with pageno

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Python and Beautiful Soup to obtain url of available software from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply