I’m a newbie programmer trying to jump in to Python by building a script

Question

0

Asked: May 18, 20262026-05-18T09:34:26+00:00 2026-05-18T09:34:26+00:00

I’m a newbie programmer trying to jump in to Python by building a script

0

I’m a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)".
My HTML source looks like:

<h3>Header3 (Start here)</h3>
<ul>
    <li>List items</li>
    <li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
    <li>List items</li>
    <ul>
        <li>Nested list items</li>
        <li>Nested list items</li></ul>
    <li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>

I’d like all the li tags following the first h3 tag and stopping at the next h2 tag, including all nested li tags.

firstH3 = soup.find('h3')

…correctly finds the place I’d like to start.

firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
    if nextSibling.name == 'h2':
        break
    if nextSibling.name == 'ul':
        uls.append(nextSibling)

…gives me a list uls, each with li contents that I need.

Excerpt of the uls list:

<ul>
...
    <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
    <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
    <li>Air Bud series:
        <ul>
            <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
            <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
            <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
            <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
        </ul>
    </li>
    <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>

But I’m unsure of where to go from here.

Update:

Final Code:

lis = []
    for ul in uls:
        for li in ul.findAll('li'):
            if li.find('ul'):
                break
            lis.append(li)

    for li in lis:
        print li.text.encode("utf-8")

The if…break throws out the LI’s that contain UL’s since the nested LI’s are now duplicated.

Print output is now:

102 Dalmatians(2000)

10th & Wolf(2006)

11:14(2006)

12:08 East of Bucharest(2006)

13 Going on 30(2004)

1408(2007)

…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T09:34:27+00:00

Editorial Team

2026-05-18T09:34:27+00:00Added an answer on May 18, 2026 at 9:34 am

.findAll() works for nested li elements:

for ul in uls:
    for li in ul.findAll('li'):
        print(li)

Output:

<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m a newbie programmer trying to jump in to Python by building a script

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply