I’m trying to write a basic web crawler in Python. The trouble I have

Question

0

Asked: June 14, 20262026-06-14T04:58:02+00:00 2026-06-14T04:58:02+00:00

I’m trying to write a basic web crawler in Python. The trouble I have

0

I’m trying to write a basic web crawler in Python. The trouble I have is parsing the page to extract url’s. I’ve both tried BeautifulSoup and regex however I cannot achieve an efficient solution.

As an example: I’m trying to extract all the member urls in Facebook’s Github page. (https://github.com/facebook?tab=members). The code I’ve written extracts member URL’s;

def getMembers(url):
  text = urllib2.urlopen(url).read();
  soup = BeautifulSoup(text);
  memberList = []
    #Retrieve every user from the company
    #url = "https://github.com/facebook?tab=members"

  data = soup.findAll('ul',attrs={'class':'members-list'});
  for div in data:
    links = div.findAll('li')
    for link in links:
          memberList.append("https://github.com" + str(link.a['href']))

  return memberList

However this takes quite a while to parse and I was wondering if I could do it more efficiently, since crawling process is too long.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T04:58:03+00:00

Editorial Team

2026-06-14T04:58:03+00:00Added an answer on June 14, 2026 at 4:58 am

I suggest that you use GitHub API, that let you do exactly what you want to accomplish. Then it’s only a matter of using a json parser and you are done.

http://developer.github.com/v3/orgs/members/

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to write a basic web crawler in Python. The trouble I have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply