Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6775325
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T15:53:37+00:00 2026-05-26T15:53:37+00:00

So I am relatively new to python, and in order to learn, I have

  • 0

So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.

So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can’t quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class=”mw-content-ltr” like this:

    user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
    values = {'name' : 'David Kavanagh',
      'location' : 'Belfast',
      'language' : 'Python' }
    headers = { 'User-Agent' : user_agent }
    encodes = urllib.urlencode(values)
    req = urllib2.Request(url, encodes, headers)
    page = urllib2.urlopen(req)
    root = parse(page).getroot()
    return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')

This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let’s say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren’t eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don’t get it. I have tried removing the direct descendant stipulation but then I frequently get links that are “above” the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.

So the first part of my question is:

How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven’t thought of.

So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.

The second part of the question is:
How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.

Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T15:53:38+00:00Added an answer on May 26, 2026 at 3:53 pm

    I’ll just answer the second question:

    For a start, just keep a dict mapping one Wikipedia article title to the next. This will make it easy and fast to check I’ve you’ve hit an article you’ve already found before. Essentially this is just storing a directed graph’s vertices, indexed by their origins.

    If you get to the point where a Python dict is not efficient enough (it does have a significant memory overhead, once you have millions of items memory can be an issue) you can find a more efficient graph data structure to suit your needs.

    EDIT

    Ok, I’ll answer the first question as well…

    For the first part, I highly recommend using the MediaWiki API instead of getting the HTML version and parsing it. The API enables querying for certain types of links, for instance just inter-wiki links or just inter-language links. Additionally, there are Python client libraries for this API, which should make using it from Python code simple.

    Don’t parse a website’s HTML if it provides a comprehensive and well-documented API!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am relatively new to Python, and I am experimenting with writing the following
Relatively new to python. I recently posted a question in regards to validating that
I'm relatively new to python, have done it a bit for my physics university
I'm relatively new to Python and am having problems programming with Scapy, the Python
I am relatively new to python and app engine, and I just finished my
I'm relatively new to the Python world, but this seems very straight forward. Google
I'm a relatively new convert to Python. I've written some code to grab/graph data
I'm relatively new to wxPython (but not Python itself), so forgive me if I've
I'm relatively new to J2ME and about to begin my first serious project. My
I am relatively new to Python and was hoping someone could explain the following

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.